instancelib.ingest.spreadsheet module

instancelib.ingest.spreadsheet.build_environment(df, label_mapper, labels, data_cols, label_cols)[source]

Build an environment from a data frame

Parameters:

df (pd.DataFrame) – A data frame that contains all texts and labels
label_mapping (Mapping[int, str]) – A mapping from indices to label strings
data_cols (Sequence[str]) – A sequence of columns that contain the texts
label_col (str) – The name of the column that contains the label data
label_mapper (Callable[[Any], Optional[str]]) –
labels (Optional[Iterable[str]]) –
label_cols (Sequence[str]) –

Returns:

A MemoryEnvironment that contains the

Return type:

MemoryEnvironment[int, str, npt.NDArray[Any], str]

instancelib.ingest.spreadsheet.build_environment_with_id(df, label_mapper, labels, id_col, data_cols, label_cols)[source]

Parameters:

df (DataFrame) –
label_mapper (Callable[[Any], Optional[str]]) –
labels (Optional[Iterable[str]]) –
id_col (str) –
data_cols (Sequence[str]) –
label_cols (Sequence[str]) –

Return type:

AbstractEnvironment[MemoryTextInstance[Any, ndarray[Any, dtype[Any]]], Union[Any, UUID], str, ndarray[Any, dtype[Any]], str, str]

instancelib.ingest.spreadsheet.build_from_multiple_dfs(df_dict, label_mapper, labels, data_cols, label_cols)[source]

Build an environment from a data frame

Parameters:

df (pd.DataFrame) – A data frame that contains all texts and labels
label_mapping (Mapping[int, str]) – A mapping from indices to label strings
data_cols (Sequence[str]) – A sequence of columns that contain the texts
label_col (str) – The name of the column that contains the label data
df_dict (Dict[str, DataFrame]) –
label_mapper (Callable[[Any], Optional[str]]) –
labels (Optional[Iterable[str]]) –
label_cols (Sequence[str]) –

Returns:

A MemoryEnvironment that contains the

Return type:

MemoryEnvironment[int, str, npt.NDArray[Any], str]

instancelib.ingest.spreadsheet.build_from_multiple_dfs_with_ids(df_dict, label_mapper, labels, id_col, data_cols, label_cols)[source]

Build an environment from a data frame

Parameters:

df (pd.DataFrame) – A data frame that contains all texts and labels
label_mapping (Mapping[int, str]) – A mapping from indices to label strings
data_cols (Sequence[str]) – A sequence of columns that contain the texts
label_col (str) – The name of the column that contains the label data
df_dict (Dict[str, DataFrame]) –
label_mapper (Callable[[Any], Optional[str]]) –
labels (Optional[Iterable[str]]) –
id_col (str) –
label_cols (Sequence[str]) –

Returns:

A MemoryEnvironment that contains the

Return type:

MemoryEnvironment[int, str, npt.NDArray[Any], str]

instancelib.ingest.spreadsheet.extract_data(dataset_df, data_cols, labelfunc)[source]

Extract text data and labels from a dataframe

Parameters:

dataset_df (pd.DataFrame) – The dataset
data_cols (List[str]) – The cols in which the text is stored
labelfunc (Callable[..., FrozenSet[str]]) – A function that maps rows to sets of labels

Returns:

[description]

Return type:

Tuple[List[int], List[str], List[FrozenSet[str]]]

instancelib.ingest.spreadsheet.extract_data_with_id(dataset_df, id_col, data_cols, labelfunc)[source]

Extract text data and labels from a dataframe

Parameters:

dataset_df (pd.DataFrame) – The dataset
id_col (str) – The column where the identifier is stored
data_cols (List[str]) – The cols in which the text is stored
labelfunc (Callable[..., FrozenSet[str]]) – A function that maps rows to sets of labels

Returns:

[description]

Return type:

Tuple[List[int], List[str], List[FrozenSet[str]]]

instancelib.ingest.spreadsheet.id_col(col)[source]

Parameters:: col (str) –
Return type:: Callable[[Series, Any], Any]

instancelib.ingest.spreadsheet.id_index()[source]

Return type:: Callable[[Series, Any], Any]

instancelib.ingest.spreadsheet.id_index_prefix(prefix)[source]

Parameters:: prefix (str) –
Return type:: Callable[[Series, Any], str]

instancelib.ingest.spreadsheet.identity_mapper(value)[source]

Coerces any value to its string represenation

Parameters:: value (Any) – Any value that can be coerced into a string
Returns:: The string representation of the value. If coercion somehow failed, it will return None.
Return type:: Optional[str]

instancelib.ingest.spreadsheet.instance_extractor(df, id_extractor, data_extractor, vector_extractor, repr_extractor, label_extractor, builder)[source]

Parameters:

df (DataFrame) –
id_extractor (Callable[[Series, Any], TypeVar(KT)]) –
data_extractor (Callable[[Series], TypeVar(DT)]) –
vector_extractor (Callable[[Series], TypeVar(VT)]) –
repr_extractor (Callable[[Series], TypeVar(RT)]) –
label_extractor (Callable[[Series], FrozenSet[TypeVar(LT)]]) –
builder (Callable[[TypeVar(KT), TypeVar(DT), TypeVar(VT), TypeVar(RT), Series, Any], TypeVar(IT, bound= Instance[Any, Any, Any, Any])]) –

Return type:

Iterator[Tuple[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any]), FrozenSet[TypeVar(LT)]]]

instancelib.ingest.spreadsheet.inv_transform_mapping(columns, row, label_mapper=<function identity_mapper>)[source]

Convert the numeric coded label in column column_name in row row to a string according to the mapping in label_mapping.

Parameters:

column_name (str) – The column in which the labels are stored
row (pd.Series) – A row from a Pandas DataFrame
label_mapper (Callable[[Any], str], optional) – A mapping from values to strings, by default identity_mapper, a function that coerces values to strings

Returns:

A set of labels that belong to the row

Return type:

FrozenSet[str]

instancelib.ingest.spreadsheet.no_vector()[source]

Return type:: Callable[[Series], Optional[ndarray[Any, dtype[Any]]]]

instancelib.ingest.spreadsheet.one_hot_encoded_extractor(*cols)[source]

Parameters:: cols (str) –
Return type:: Callable[[Series], FrozenSet[str]]

instancelib.ingest.spreadsheet.pandas_to_env(df, data_cols, label_cols, labels=None)[source]

Parameters:

df (Union[DataFrame, Dict[str, DataFrame]]) –
data_cols (Union[str, Sequence[str]]) –
label_cols (Union[str, Sequence[str]]) –
labels (Optional[Iterable[str]]) –

Return type:

AbstractEnvironment[MemoryTextInstance[Any, ndarray[Any, dtype[Any]]], Union[Any, UUID], str, ndarray[Any, dtype[Any]], str, str]

instancelib.ingest.spreadsheet.pandas_to_env_with_id(df, id_col, data_cols, label_cols, labels=None)[source]

Parameters:

df (Union[DataFrame, Dict[str, DataFrame]]) –
id_col (str) –
data_cols (Union[str, Sequence[str]]) –
label_cols (Union[str, Sequence[str]]) –
labels (Optional[Iterable[str]]) –

Return type:

AbstractEnvironment[MemoryTextInstance[Any, ndarray[Any, dtype[Any]]], Union[Any, UUID], str, ndarray[Any, dtype[Any]], str, str]

instancelib.ingest.spreadsheet.read_csv_dataset(path, data_cols, label_cols, labels=None, label_mapper=<function identity_mapper>)[source]

Read Excel filse that contain text data

Parameters:

path (Union[str, PathLike[str]]) – The path to the Excel file
data_cols (Sequence[str]) – The columns that contain the text data
label_cols (Sequence[str]) – The columns that contain the columns
labels (Optional[Iterable[str]], optional) – The set of labels that are possible. If None, the set will be inferred from data This parameter is by default None
label_mapper (Callable[[Any], Optional[str]], optional) – A function that transferm labels into another representation This paramater is by default identity_mapper(), which just outputs its input.

Returns:

An environment that contains all the information from the Excel file

Return type:

AbstractEnvironment[TextInstance[int, npt.NDArray[Any]], Union[int, UUID], str, npt.NDArray[Any], str, str]

instancelib.ingest.spreadsheet.read_excel_dataset(path, data_cols, label_cols, labels=None, label_mapper=<function identity_mapper>)[source]

Read csv datasets that contain text data

Parameters:

path (Union[str, PathLike[str]]) – The path to the csv file
data_cols (Sequence[str]) – The columns that contain the text data
label_cols (Sequence[str]) – The columns that contain the columns
labels (Optional[Iterable[str]], optional) – The set of labels that are possible. If None, the set will be inferred from data This parameter is by default None
label_mapper (Callable[[Any], Optional[str]], optional) – A function that transferm labels into another representation This paramater is by default identity_mapper(), which just outputs its input.

Returns:

An environment that contains all the information from the CSV file

Return type:

AbstractEnvironment[TextInstance[int, npt.NDArray[Any]], Union[int, UUID], str, npt.NDArray[Any], str, str]

instancelib.ingest.spreadsheet.text_builder(identifier, data, vector, representation, row, idx)[source]

Parameters:

identifier (TypeVar(KT)) –
data (str) –
vector (TypeVar(VT)) –
representation (str) –
row (Series) –
idx (Any) –

Return type:

MemoryTextInstance[TypeVar(KT), TypeVar(VT)]

instancelib.ingest.spreadsheet.text_concatenation(*cols)[source]

Parameters:: cols (str) –
Return type:: Callable[[Series], str]

instancelib.ingest.spreadsheet.text_from_pandas_multilabel(df_dict, text_cols, label_cols, labelset)[source]

Parameters:

df_dict (Dict[str, DataFrame]) –
text_cols (Sequence[str]) –
label_cols (Sequence[str]) –
labelset (FrozenSet[str]) –

instancelib.ingest.spreadsheet.to_dicts(triples)[source]

Parameters:: triples (Iterator[Tuple[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any]), FrozenSet[TypeVar(LT)]]]) –
Return type:: Tuple[Mapping[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any])], Mapping[TypeVar(KT), FrozenSet[TypeVar(LT)]]]

instancelib.ingest.spreadsheet.to_environment(prov_builder, labelprov_builder, dictionaries)[source]

Parameters:

prov_builder (Callable[[Mapping[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any])]], InstanceProvider[TypeVar(IT, bound= Instance[Any, Any, Any, Any]), TypeVar(KT), TypeVar(DT), TypeVar(VT), TypeVar(RT)]]) –
labelprov_builder (Callable[[Mapping[TypeVar(KT), FrozenSet[TypeVar(LT)]]], LabelProvider[TypeVar(KT), TypeVar(LT)]]) –
dictionaries (Tuple[Mapping[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any])], Mapping[TypeVar(KT), FrozenSet[TypeVar(LT)]]]) –

Return type:

AbstractEnvironment[TypeVar(IT, bound= Instance[Any, Any, Any, Any]), TypeVar(KT), TypeVar(DT), TypeVar(VT), TypeVar(RT), TypeVar(LT)]