instancelib.feature_extraction.base module

class instancelib.feature_extraction.base.BaseVectorizer[source]

Bases: ABC, Generic[DT]

This is the ABC specifies a generic vectorizer. Vectorizers transform raw data examples into feature vectors. Given a data type DT, it specifies the methods fit() that initializes or fits the vectorizer. The method transform() transforms the data into vector form.

abstract fit(x_data, **kwargs)[source]

Fit the vectorizer according to the data in the given Sequence.

Parameters:

x_data (Sequence[DT]) – A Sequence of examples with type DT.

Returns:

A fitted vectorizer for data with type DT

Return type:

BaseVectorizer[DT]

Examples

Assume the creation of a vectorizer and a sequence of data examples in the variable data_list

>>> vectorizer = BaseVectorizer[DT]()
>>> vectorizer = vectorizer.fit(data_list)
Parameters:

kwargs (Any) –

abstract fit_transform(x_data, **kwargs)[source]

Transform a list of data to a feature matrix. The transformation is based on the data contained in the parameter x_data. Subsequent transformations with transform() will be based on the fit of the data provided in this call.

Parameters:

x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples

Returns:

A feature matrix with shape (n_examples, n_features)

Return type:

npt.NDArray[Any]

Examples

Assume the vectorizer is fitted

>>> x_mat = vectorizer.fit_transform(x_data)
Parameters:

kwargs (Any) –

property fitted: bool

Check if the vectorizer has been fitted

Returns:

True if the vectorizer has been fitted

Return type:

bool

property name: str
abstract transform(x_data, **kwargs)[source]

Transform a list raw data points to a feature matrix according to the fitted vectorizer

Parameters:

x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples

Returns:

A feature matrix with shape (n_examples, n_features)

Return type:

npt.NDArray[Any]

Examples

Assume the vectorizer is fitted

>>> x_mat = vectorizer.transform(x_data)
Parameters:

kwargs (Any) –

class instancelib.feature_extraction.base.SeparateContextVectorizer(data_vectorizer, context_vectorizer)[source]

Bases: ABC, Generic[DT, CT]

This ABC specifies a generic vectorizer for data types that consists of two parts that have to be fitted or configured according to different specifications. The feature vectors of the two parts are concatenated for each example.

The two parts are referred to the data part and the context part. This vectorizer contains two inner vectorizer, one for the data part and one for the context part respectively.

Parameters:

Examples

Construction:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> data_vectorizer = SklearnVectorizer[str](TfidfVectorizer())
>>> context_vectorizer = Doc2VecVectorizer[str]()
>>> vectorizer = SeparateContextVectorizer[str, str](data_vectorizer,
...     context_vectorizer)

Fitting:

>>> x_data = ["This...", "Another text...", ... "Last Text"]
>>> x_context_data = ["Surrounding text", ... , "Another text"]
>>> vectorizer = vectorizer.fit(x_data, x_context_data)

Transforming:

>>> x_mat = vectorizer.transform(x_data, x_context_data)
fit(x_data, context_data, **kwargs)[source]

Fit the vectorizer according to the data in the given Sequence s.

Parameters:
  • x_data (Sequence[DT]) – The data parts

  • context_data (Sequence[CT]) – The contexts parts

Returns:

A fitted vectorizer

Return type:

SeparateContextVectorizer[DT, CT]

Examples

Fitting this vectorizer can be performed as follows:

>>> x_data = ["This...", "Another text...", ... "Last Text"]
>>> x_context_data = ["Surrounding text", ... , "Another text"]
>>> data_vectorizer = SklearnVectorizer[str](TfidfVectorizer())
>>> context_vectorizer = Doc2VecVectorizer[str]()
>>> vectorizer = SeparateContextVectorizer[str, str](
...     data_vectorizer,
...     context_vectorizer)
>>> vectorizer = vectorizer.fit(x_data, x_context_data)

Warning

We assume that the variables x_data and context_data are sequences of equal length.

Parameters:

kwargs (Any) –

fit_transform(x_data, context_data, **kwargs)[source]

Fit and transform a list raw data points to a feature matrix according to the fitted vectorizers. Subsequent transformations with transform() will be based on the fit of the data provided in this call.

Parameters:
  • x_data (Sequence[DT]) – A sequence with data parts of the data points of length n_docs

  • context_data (Sequence[CT]) – A sequence with context part of the data points of length n_docs

  • kwargs (Any) –

Returns:

A feature matrix of concatenated vectors with shape (n_docs, n_features_data + n_features_context)

Return type:

npt.NDArray[Any]

property fitted: bool

Check if the vectorizer has been fitted

Returns:

True if the vectorizer has been fitted

Return type:

bool

transform(x_data, context_data, **kwargs)[source]

Transform a list raw data points to a feature matrix according to the fitted vectorizers

Parameters:
  • x_data (Sequence[DT]) – A sequence with data parts of the data points of length n_docs

  • context_data (Sequence[CT]) – A sequence with context part of the data points of length n_docs

Returns:

A feature matrix of concatenated vectors with shape (n_docs, n_features_data + n_features_context)

Return type:

npt.NDArray[Any]

Raises:

NotFittedError – If the model is not fitted

Warning

We assume that the variables x_data and context_data are sequences of equal length and that the indices of the sequences correspond to the same data point.

Parameters:

kwargs (Any) –

class instancelib.feature_extraction.base.StackVectorizer(vectorizer, *vectorizers)[source]

Bases: BaseVectorizer[DT], Generic[DT]

This ABC specifies a generic vectorizer that consists of several vectorizers that are fitted on the same data points.

The feature vectors of the contained vectorizers are concatenated in the transform step, according to the order they are specified in the constructor (argument order).

Parameters:
  • vectorizer (BaseVectorizer[DT]) – At least one vectorizer is required

  • *vectorizers (BaseVectorizer[DT]) – Any number of vectorizers for the same data type

Examples

Construction

>>> tf_idf = SklearnVectorizer[str](TfidfVectorizer())
>>> doc2vec = Doc2VecVectorizer[str]()
>>> count = SklearnVectorizer[str](CountVectorizer())
>>> vectorizer = StackVectorizer[str](tfidf, doc2vec, count)

Fitting

>>> x_data = ["This...", "Another text...", ... "Last Text"]
>>> vectorizer = vectorizer.fit(x_data)

Transforming

>>> another_data = ["Another test text", ... , "Another text"]
>>> x_mat = vectorizer.transform(x_data)
fit(x_data, **kwargs)[source]

Fit the vectorizer according to the data in the given Sequence.

Parameters:

x_data (Sequence[DT]) – A Sequence of examples with type DT.

Returns:

A fitted vectorizer for data with type DT

Return type:

BaseVectorizer[DT]

Examples

Assume the creation of a vectorizer and a sequence of data examples in the variable data_list

>>> vectorizer = BaseVectorizer[DT]()
>>> vectorizer = vectorizer.fit(data_list)
Parameters:

kwargs (Any) –

fit_transform(x_data, **kwargs)[source]

Transform a list of data to a feature matrix. The transformation is based on the data contained in the parameter x_data. Subsequent transformations with transform() will be based on the fit of the data provided in this call.

Parameters:

x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples

Returns:

A feature matrix with shape (n_examples, n_features)

Return type:

npt.NDArray[Any]

Examples

Assume the vectorizer is fitted

>>> x_mat = vectorizer.fit_transform(x_data)
Parameters:

kwargs (Any) –

property fitted: bool

Check if the vectorizer has been fitted

Returns:

True if the vectorizer has been fitted

Return type:

bool

transform(x_data, **kwargs)[source]

Transform a list raw data points to a feature matrix according to the fitted vectorizer

Parameters:

x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples

Returns:

A feature matrix with shape (n_examples, n_features)

Return type:

npt.NDArray[Any]

Examples

Assume the vectorizer is fitted

>>> x_mat = vectorizer.transform(x_data)
Parameters:

kwargs (Any) –

vectorizers: List[BaseVectorizer[TypeVar(DT)]]

The internal vectorizers are stored in this list