instancelib.ingest.spreadsheet module

instancelib.ingest.spreadsheet.build_environment(df, label_mapper, labels, data_cols, label_cols)[source]

Build an environment from a data frame

Parameters:
  • df (pd.DataFrame) – A data frame that contains all texts and labels

  • label_mapping (Mapping[int, str]) – A mapping from indices to label strings

  • data_cols (Sequence[str]) – A sequence of columns that contain the texts

  • label_col (str) – The name of the column that contains the label data

  • label_mapper (Callable[[Any], Optional[str]]) –

  • labels (Optional[Iterable[str]]) –

  • label_cols (Sequence[str]) –

Returns:

A MemoryEnvironment that contains the

Return type:

MemoryEnvironment[int, str, npt.NDArray[Any], str]

instancelib.ingest.spreadsheet.build_environment_with_id(df, label_mapper, labels, id_col, data_cols, label_cols)[source]
Parameters:
Return type:

AbstractEnvironment[MemoryTextInstance[Any, ndarray[Any, dtype[Any]]], Union[Any, UUID], str, ndarray[Any, dtype[Any]], str, str]

instancelib.ingest.spreadsheet.build_from_multiple_dfs(df_dict, label_mapper, labels, data_cols, label_cols)[source]

Build an environment from a data frame

Parameters:
  • df (pd.DataFrame) – A data frame that contains all texts and labels

  • label_mapping (Mapping[int, str]) – A mapping from indices to label strings

  • data_cols (Sequence[str]) – A sequence of columns that contain the texts

  • label_col (str) – The name of the column that contains the label data

  • df_dict (Dict[str, DataFrame]) –

  • label_mapper (Callable[[Any], Optional[str]]) –

  • labels (Optional[Iterable[str]]) –

  • label_cols (Sequence[str]) –

Returns:

A MemoryEnvironment that contains the

Return type:

MemoryEnvironment[int, str, npt.NDArray[Any], str]

instancelib.ingest.spreadsheet.build_from_multiple_dfs_with_ids(df_dict, label_mapper, labels, id_col, data_cols, label_cols)[source]

Build an environment from a data frame

Parameters:
  • df (pd.DataFrame) – A data frame that contains all texts and labels

  • label_mapping (Mapping[int, str]) – A mapping from indices to label strings

  • data_cols (Sequence[str]) – A sequence of columns that contain the texts

  • label_col (str) – The name of the column that contains the label data

  • df_dict (Dict[str, DataFrame]) –

  • label_mapper (Callable[[Any], Optional[str]]) –

  • labels (Optional[Iterable[str]]) –

  • id_col (str) –

  • label_cols (Sequence[str]) –

Returns:

A MemoryEnvironment that contains the

Return type:

MemoryEnvironment[int, str, npt.NDArray[Any], str]

instancelib.ingest.spreadsheet.extract_data(dataset_df, data_cols, labelfunc)[source]

Extract text data and labels from a dataframe

Parameters:
  • dataset_df (pd.DataFrame) – The dataset

  • data_cols (List[str]) – The cols in which the text is stored

  • labelfunc (Callable[..., FrozenSet[str]]) – A function that maps rows to sets of labels

Returns:

[description]

Return type:

Tuple[List[int], List[str], List[FrozenSet[str]]]

instancelib.ingest.spreadsheet.extract_data_with_id(dataset_df, id_col, data_cols, labelfunc)[source]

Extract text data and labels from a dataframe

Parameters:
  • dataset_df (pd.DataFrame) – The dataset

  • id_col (str) – The column where the identifier is stored

  • data_cols (List[str]) – The cols in which the text is stored

  • labelfunc (Callable[..., FrozenSet[str]]) – A function that maps rows to sets of labels

Returns:

[description]

Return type:

Tuple[List[int], List[str], List[FrozenSet[str]]]

instancelib.ingest.spreadsheet.id_col(col)[source]
Parameters:

col (str) –

Return type:

Callable[[Series, Any], Any]

instancelib.ingest.spreadsheet.id_index()[source]
Return type:

Callable[[Series, Any], Any]

instancelib.ingest.spreadsheet.id_index_prefix(prefix)[source]
Parameters:

prefix (str) –

Return type:

Callable[[Series, Any], str]

instancelib.ingest.spreadsheet.identity_mapper(value)[source]

Coerces any value to its string represenation

Parameters:

value (Any) – Any value that can be coerced into a string

Returns:

The string representation of the value. If coercion somehow failed, it will return None.

Return type:

Optional[str]

instancelib.ingest.spreadsheet.instance_extractor(df, id_extractor, data_extractor, vector_extractor, repr_extractor, label_extractor, builder)[source]
Parameters:
Return type:

Iterator[Tuple[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any]), FrozenSet[TypeVar(LT)]]]

instancelib.ingest.spreadsheet.inv_transform_mapping(columns, row, label_mapper=<function identity_mapper>)[source]

Convert the numeric coded label in column column_name in row row to a string according to the mapping in label_mapping.

Parameters:
  • column_name (str) – The column in which the labels are stored

  • row (pd.Series) – A row from a Pandas DataFrame

  • label_mapper (Callable[[Any], str], optional) – A mapping from values to strings, by default identity_mapper, a function that coerces values to strings

Returns:

A set of labels that belong to the row

Return type:

FrozenSet[str]

instancelib.ingest.spreadsheet.no_vector()[source]
Return type:

Callable[[Series], Optional[ndarray[Any, dtype[Any]]]]

instancelib.ingest.spreadsheet.one_hot_encoded_extractor(*cols)[source]
Parameters:

cols (str) –

Return type:

Callable[[Series], FrozenSet[str]]

instancelib.ingest.spreadsheet.pandas_to_env(df, data_cols, label_cols, labels=None)[source]
Parameters:
Return type:

AbstractEnvironment[MemoryTextInstance[Any, ndarray[Any, dtype[Any]]], Union[Any, UUID], str, ndarray[Any, dtype[Any]], str, str]

instancelib.ingest.spreadsheet.pandas_to_env_with_id(df, id_col, data_cols, label_cols, labels=None)[source]
Parameters:
Return type:

AbstractEnvironment[MemoryTextInstance[Any, ndarray[Any, dtype[Any]]], Union[Any, UUID], str, ndarray[Any, dtype[Any]], str, str]

instancelib.ingest.spreadsheet.read_csv_dataset(path, data_cols, label_cols, labels=None, label_mapper=<function identity_mapper>)[source]

Read Excel filse that contain text data

Parameters:
  • path (Union[str, PathLike[str]]) – The path to the Excel file

  • data_cols (Sequence[str]) – The columns that contain the text data

  • label_cols (Sequence[str]) – The columns that contain the columns

  • labels (Optional[Iterable[str]], optional) – The set of labels that are possible. If None, the set will be inferred from data This parameter is by default None

  • label_mapper (Callable[[Any], Optional[str]], optional) – A function that transferm labels into another representation This paramater is by default identity_mapper(), which just outputs its input.

Returns:

An environment that contains all the information from the Excel file

Return type:

AbstractEnvironment[TextInstance[int, npt.NDArray[Any]], Union[int, UUID], str, npt.NDArray[Any], str, str]

instancelib.ingest.spreadsheet.read_excel_dataset(path, data_cols, label_cols, labels=None, label_mapper=<function identity_mapper>)[source]

Read csv datasets that contain text data

Parameters:
  • path (Union[str, PathLike[str]]) – The path to the csv file

  • data_cols (Sequence[str]) – The columns that contain the text data

  • label_cols (Sequence[str]) – The columns that contain the columns

  • labels (Optional[Iterable[str]], optional) – The set of labels that are possible. If None, the set will be inferred from data This parameter is by default None

  • label_mapper (Callable[[Any], Optional[str]], optional) – A function that transferm labels into another representation This paramater is by default identity_mapper(), which just outputs its input.

Returns:

An environment that contains all the information from the CSV file

Return type:

AbstractEnvironment[TextInstance[int, npt.NDArray[Any]], Union[int, UUID], str, npt.NDArray[Any], str, str]

instancelib.ingest.spreadsheet.text_builder(identifier, data, vector, representation, row, idx)[source]
Parameters:
Return type:

MemoryTextInstance[TypeVar(KT), TypeVar(VT)]

instancelib.ingest.spreadsheet.text_concatenation(*cols)[source]
Parameters:

cols (str) –

Return type:

Callable[[Series], str]

instancelib.ingest.spreadsheet.text_from_pandas_multilabel(df_dict, text_cols, label_cols, labelset)[source]
Parameters:
instancelib.ingest.spreadsheet.to_dicts(triples)[source]
Parameters:

triples (Iterator[Tuple[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any]), FrozenSet[TypeVar(LT)]]]) –

Return type:

Tuple[Mapping[TypeVar(KT), TypeVar(IT, bound= Instance[Any, Any, Any, Any])], Mapping[TypeVar(KT), FrozenSet[TypeVar(LT)]]]

instancelib.ingest.spreadsheet.to_environment(prov_builder, labelprov_builder, dictionaries)[source]
Parameters:
Return type:

AbstractEnvironment[TypeVar(IT, bound= Instance[Any, Any, Any, Any]), TypeVar(KT), TypeVar(DT), TypeVar(VT), TypeVar(RT), TypeVar(LT)]