instancelib.feature_extraction.base module
- class instancelib.feature_extraction.base.BaseVectorizer[source]
-
This is the
ABCspecifies a generic vectorizer. Vectorizers transform raw data examples into feature vectors. Given a data type DT, it specifies the methodsfit()that initializes or fits the vectorizer. The methodtransform()transforms the data into vector form.- abstract fit(x_data, **kwargs)[source]
Fit the vectorizer according to the data in the given
Sequence.- Parameters:
x_data (Sequence[DT]) – A Sequence of examples with type DT.
- Returns:
A fitted vectorizer for data with type DT
- Return type:
Examples
Assume the creation of a vectorizer and a sequence of data examples in the variable data_list
>>> vectorizer = BaseVectorizer[DT]() >>> vectorizer = vectorizer.fit(data_list)
- Parameters:
kwargs (
Any) –
- abstract fit_transform(x_data, **kwargs)[source]
Transform a list of data to a feature matrix. The transformation is based on the data contained in the parameter x_data. Subsequent transformations with
transform()will be based on the fit of the data provided in this call.- Parameters:
x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples
- Returns:
A feature matrix with shape (n_examples, n_features)
- Return type:
npt.NDArray[Any]
Examples
Assume the vectorizer is fitted
>>> x_mat = vectorizer.fit_transform(x_data)
- Parameters:
kwargs (Any) –
- property fitted: bool
Check if the vectorizer has been fitted
- Returns:
True if the vectorizer has been fitted
- Return type:
- abstract transform(x_data, **kwargs)[source]
Transform a list raw data points to a feature matrix according to the fitted vectorizer
- Parameters:
x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples
- Returns:
A feature matrix with shape (n_examples, n_features)
- Return type:
npt.NDArray[Any]
Examples
Assume the vectorizer is fitted
>>> x_mat = vectorizer.transform(x_data)
- Parameters:
kwargs (Any) –
- class instancelib.feature_extraction.base.SeparateContextVectorizer(data_vectorizer, context_vectorizer)[source]
-
This
ABCspecifies a generic vectorizer for data types that consists of two parts that have to be fitted or configured according to different specifications. The feature vectors of the two parts are concatenated for each example.The two parts are referred to the data part and the context part. This vectorizer contains two inner vectorizer, one for the data part and one for the context part respectively.
- Parameters:
data_vectorizer (BaseVectorizer[DT]) – The vectorizer for the data part
context_vectorizer (BaseVectorizer[CT]) – The vectorizer for the context part
Examples
Construction:
>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> data_vectorizer = SklearnVectorizer[str](TfidfVectorizer()) >>> context_vectorizer = Doc2VecVectorizer[str]() >>> vectorizer = SeparateContextVectorizer[str, str](data_vectorizer, ... context_vectorizer)
Fitting:
>>> x_data = ["This...", "Another text...", ... "Last Text"] >>> x_context_data = ["Surrounding text", ... , "Another text"] >>> vectorizer = vectorizer.fit(x_data, x_context_data)
Transforming:
>>> x_mat = vectorizer.transform(x_data, x_context_data)
- fit(x_data, context_data, **kwargs)[source]
Fit the vectorizer according to the data in the given
Sequences.- Parameters:
- Returns:
A fitted vectorizer
- Return type:
Examples
Fitting this vectorizer can be performed as follows:
>>> x_data = ["This...", "Another text...", ... "Last Text"] >>> x_context_data = ["Surrounding text", ... , "Another text"] >>> data_vectorizer = SklearnVectorizer[str](TfidfVectorizer()) >>> context_vectorizer = Doc2VecVectorizer[str]() >>> vectorizer = SeparateContextVectorizer[str, str]( ... data_vectorizer, ... context_vectorizer) >>> vectorizer = vectorizer.fit(x_data, x_context_data)
Warning
We assume that the variables x_data and context_data are sequences of equal length.
- Parameters:
kwargs (
Any) –
- fit_transform(x_data, context_data, **kwargs)[source]
Fit and transform a list raw data points to a feature matrix according to the fitted vectorizers. Subsequent transformations with
transform()will be based on the fit of the data provided in this call.- Parameters:
- Returns:
A feature matrix of concatenated vectors with shape (n_docs, n_features_data + n_features_context)
- Return type:
npt.NDArray[Any]
- property fitted: bool
Check if the vectorizer has been fitted
- Returns:
True if the vectorizer has been fitted
- Return type:
- transform(x_data, context_data, **kwargs)[source]
Transform a list raw data points to a feature matrix according to the fitted vectorizers
- Parameters:
- Returns:
A feature matrix of concatenated vectors with shape (n_docs, n_features_data + n_features_context)
- Return type:
npt.NDArray[Any]
- Raises:
NotFittedError – If the model is not fitted
Warning
We assume that the variables x_data and context_data are sequences of equal length and that the indices of the sequences correspond to the same data point.
- Parameters:
kwargs (Any) –
- class instancelib.feature_extraction.base.StackVectorizer(vectorizer, *vectorizers)[source]
Bases:
BaseVectorizer[DT],Generic[DT]This
ABCspecifies a generic vectorizer that consists of several vectorizers that are fitted on the same data points.The feature vectors of the contained vectorizers are concatenated in the transform step, according to the order they are specified in the constructor (argument order).
- Parameters:
vectorizer (BaseVectorizer[DT]) – At least one vectorizer is required
*vectorizers (BaseVectorizer[DT]) – Any number of vectorizers for the same data type
Examples
Construction
>>> tf_idf = SklearnVectorizer[str](TfidfVectorizer()) >>> doc2vec = Doc2VecVectorizer[str]() >>> count = SklearnVectorizer[str](CountVectorizer()) >>> vectorizer = StackVectorizer[str](tfidf, doc2vec, count)
Fitting
>>> x_data = ["This...", "Another text...", ... "Last Text"] >>> vectorizer = vectorizer.fit(x_data)
Transforming
>>> another_data = ["Another test text", ... , "Another text"] >>> x_mat = vectorizer.transform(x_data)
- fit(x_data, **kwargs)[source]
Fit the vectorizer according to the data in the given
Sequence.- Parameters:
x_data (Sequence[DT]) – A Sequence of examples with type DT.
- Returns:
A fitted vectorizer for data with type DT
- Return type:
Examples
Assume the creation of a vectorizer and a sequence of data examples in the variable data_list
>>> vectorizer = BaseVectorizer[DT]() >>> vectorizer = vectorizer.fit(data_list)
- Parameters:
kwargs (
Any) –
- fit_transform(x_data, **kwargs)[source]
Transform a list of data to a feature matrix. The transformation is based on the data contained in the parameter x_data. Subsequent transformations with
transform()will be based on the fit of the data provided in this call.- Parameters:
x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples
- Returns:
A feature matrix with shape (n_examples, n_features)
- Return type:
npt.NDArray[Any]
Examples
Assume the vectorizer is fitted
>>> x_mat = vectorizer.fit_transform(x_data)
- Parameters:
kwargs (Any) –
- property fitted: bool
Check if the vectorizer has been fitted
- Returns:
True if the vectorizer has been fitted
- Return type:
- transform(x_data, **kwargs)[source]
Transform a list raw data points to a feature matrix according to the fitted vectorizer
- Parameters:
x_data (Sequence[DT]) – A sequence of raw data examples with length n_examples
- Returns:
A feature matrix with shape (n_examples, n_features)
- Return type:
npt.NDArray[Any]
Examples
Assume the vectorizer is fitted
>>> x_mat = vectorizer.transform(x_data)
- Parameters:
kwargs (Any) –
-
vectorizers:
List[BaseVectorizer[TypeVar(DT)]] The internal vectorizers are stored in this list