instancelib.feature_extraction.doc2vec module

class instancelib.feature_extraction.doc2vec.Doc2VecVectorizer(d2v_params, tokenizer=<function split_tokenizer>, storage_location=None, filename=None)[source]

Bases: BaseVectorizer[str], SaveableInnerModel

Parameters:

d2v_params (Dict[str, Any]) –
tokenizer (Callable[..., List[str]]) –
storage_location (Optional[Optional[PathLike[str]]]) –
filename (Optional[Optional[PathLike[str]]]) –

fit(x_data, **kwargs)[source]

Fit the vectorizer according to the data in the given Sequence.

Parameters:: x_data (Sequence[DT]) – A Sequence of examples with type DT.
Returns:: A fitted vectorizer for data with type DT
Return type:: BaseVectorizer[DT]

Examples

Assume the creation of a vectorizer and a sequence of data examples in the variable data_list

>>> vectorizer = BaseVectorizer[DT]()
>>> vectorizer = vectorizer.fit(data_list)

Parameters:: kwargs (Any) –

fit_transform(x_data, **kwargs)[source]

Transform a list of data to a feature matrix. The transformation is based on the data contained in the parameter x_data. Subsequent transformations with transform() will be based on the fit of the data provided in this call.

Parameters:: x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples
Returns:: A feature matrix with shape (n_examples, n_features)
Return type:: npt.NDArray[Any]

Examples

Assume the vectorizer is fitted

>>> x_mat = vectorizer.fit_transform(x_data)

Parameters:: kwargs (Any) –

innermodel: Optional[Doc2Vec]

load()[source]

Return type:: None

save()[source]

Return type:: None

tokenizer: Callable[..., List[str]]

transform(x_data, **kwargs)[source]

Transform a list raw data points to a feature matrix according to the fitted vectorizer

Parameters:: x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples
Returns:: A feature matrix with shape (n_examples, n_features)
Return type:: npt.NDArray[Any]

Examples

Assume the vectorizer is fitted

>>> x_mat = vectorizer.transform(x_data)

Parameters:: kwargs (Any) –

instancelib.feature_extraction.doc2vec.get_line_docs(documents)[source]

Parameters:: documents (Sequence[str]) –
Return type:: TaggedLineDocument

instancelib.feature_extraction.doc2vec.split_tokenizer(text)[source]

Parameters:: text (str) –
Return type:: List[str]