instancelib.instances.base module

class instancelib.instances.base.AbstractBucketProvider(*args, **kwds)[source]

Bases: InstanceProvider[InstanceType, KT, DT, VT, RT], ABC, Generic[InstanceType, KT, DT, VT, RT]

This class allows the creation of subsets (buckets) from a provider, without copying data, while still preserving the InstanceProvider API.

For example, in Poolbased Active Learning, the dataset is partitioned in several sets; e.g., the labeled and unlabeled parts of the dataset. Or in traditional supervised learning, the train, test and validation sets. No data is copied, only a set of identifiers is kept in this provider. All data resides in the original provider.

Variables:

dataset – The InstanceProvider that you want to take a subset from

add_child(parent, child)[source]

Register a parent child relation between two instances

Parameters:
Return type:

None

clear()[source]
Return type:

None

create(*args, **kwargs)[source]

Create a new instance of type InstanceType. The created instance is subsequently added to the provider.

Note: The number of arguments and keyword arguments may differ in actual implementation, so there are no standard arguments.

Returns:

The new instance Type

Return type:

InstanceType

Parameters:
  • args (Any) –

  • kwargs (Any) –

data_chunker(batch_size=200)[source]

Iterate over all instances data parts in this provider

Parameters:

batch_size (int) – The batch size, the generator will return lists with size batch_size

Yields:

Sequence[Tuple[KT,DT]] – A sequence of instances with length batch_size. The last list may have a shorter length.

Return type:

Iterator[Sequence[Tuple[TypeVar(KT), TypeVar(DT)]]]

data_chunker_selector(keys, batch_size=200)[source]
Parameters:
Return type:

Iterator[Sequence[Tuple[TypeVar(KT), TypeVar(DT)]]]

dataset: InstanceProvider[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any]), TypeVar(KT), TypeVar(DT), TypeVar(VT), TypeVar(RT)]

The original dataset. All data will remain there

discard_children(parent)[source]

Discard the children that are registered to this parent

Parameters:

parent (Union[KT, Instance[KT, DT, VT, RT]]) – The parent from which you want to get the children from.

Return type:

None

property empty: bool

Determines if the provider does not contain instances

Returns:

True if the provider is empty

Return type:

bool

get_all()[source]

Get an iterator that iterates over all instances

Yields:

InstanceType – An iterator that iterates over all instances

Return type:

Iterator[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any])]

get_children(parent)[source]

Get the children that are registered to this parent

Parameters:

parent (Union[KT, Instance[KT, DT, VT, RT]]) – The parent from which you want to get the children from.

Returns:

A list containing the children

Return type:

Sequence[InstanceType]

get_children_keys(parent)[source]

Get the children that are registered to this parent

Parameters:

parent (Union[KT, Instance[KT, DT, VT, RT]]) – The parent from which you want to get the children from.

Returns:

A list containing the children

Return type:

Sequence[InstanceType]

get_parent(child)[source]

Get the parent of a child

Parameters:

child (Union[KT, Instance[KT, DT, VT, RT]]) – A child instance from which you want to get the children from.

Returns:

The parent of this child instance

Return type:

InstanceType

Raises:

KeyError – If there is no parent associated with this Instance

vector_chunker(batch_size=200)[source]

Iterate over all pairs of keys and vectors in this provider

Parameters:

batch_size (int) – The batch size, the generator will return lists with size batch_size

Returns:

An iterator over sequences of key vector tuples

Return type:

Iterator[Sequence[Tuple[KT, VT]]]

Yields:

Sequence[Tuple[KT, VT]] – Sequences of key vector tuples

vector_chunker_selector(keys, batch_size=200)[source]

Iterate over all instances (with or without vectors) in belonging the identifier Iterable in the keys parameter.

Parameters:
  • keys (Iterable[KT]) – The keys that should should be chunked

  • batch_size (int) – The batch size, the generator will return lists with size batch_size

Yields:

Sequence[Instance[KT, DT, VT, RT]]] – A sequence of instances with length batch_size. The last list may have a shorter length.

Returns:

An iterator over sequences of key vector tuples

Return type:

Iterator[Sequence[Tuple[KT, VT]]]

class instancelib.instances.base.Instance(*args, **kwds)[source]

Bases: ABC, Generic[KT, DT, VT, RT]

A base Instance Class.

Every Instance contains 4 properties:

  • A unique identifier (identifier)

  • The raw data (data)

  • A vector representation of the data (vector)

  • A human readable representation (representation)

The ABC Instance has four Generic types:

  • KT: The type of the key

  • DT: The type of the data

  • VT: The type of the vector

  • RT: The type of the representation

Combining these four items in a single object enables easy transfer between different operations like predictions, annotatation and transformation.

abstract property data: DT

Return the raw data of this instance

Returns:

The Raw Data

Return type:

DT

abstract property identifier: KT

Get the identifier of the instance

Returns:

The identifier key of the instance

Return type:

KT

static map_data(func)[source]

Transform function that works on raw data into a function that works on Instance objects.

Parameters:

func (Callable[[DT], _V]) – The function that works on raw data

Returns:

The transformed function

Return type:

Callable[[Instance[KT, DT, VT, RT]], _V]

static map_vector(func)[source]

Transform function that works on vectors into a function that works on Instance objects.

Parameters:

func (Callable[[VT], _V]) – The function that works on vectors

Returns:

The transformed function

Return type:

Callable[[Instance[KT, DT, VT, RT]], _V]

abstract property representation: RT

Return a representation for annotation

Returns:

A representation of the raw data

Return type:

RT

to_dict()[source]
Return type:

Mapping[str, Any]

property type_info: TypeInfo
abstract property vector: VT | None

Get the vector represenation of the raw data

Returns:

The Vector

Return type:

Optional[VT]

static vectorized_data_map(func)[source]

Transform function that works on sequences of raw data into a function that works on sequences of Instance objects.

Parameters:

func (Callable[[Iterable[DT]], _V]) – The function that works on sequences of raw data

Returns:

The transformed function

Return type:

Callable[[Iterable[Instance[KT, DT, VT, RT]]], _V]

class instancelib.instances.base.InstanceProvider(*args, **kwds)[source]

Bases: MutableMapping[KT, InstanceType], ROInstanceProvider[InstanceType, KT, DT, VT, RT], ABC, Generic[InstanceType, KT, DT, VT, RT]

The Base InstanceProvider class.

This class provides an abstract implementation for a dataset. The InstanceProvider has five Generic types:

  • InstanceType : A subclass of Instance

  • KT: The type of the key

  • DT: The type of the data

  • VT: The type of the vector

  • RT: The type of the representation

Specifying these allows Python to ensure the correctness of your implementation and eases further integration in your application.

Examples

Instance access:

>>> provider = InstanceProvider() # Replace with your implementation's constructor
>>> first_key = next(iter(textprovider))
>>> first_doc = textprovider[first_key]

Set operations:

>>> new_instance = Instance()
>>> provider.add(new_instance)
>>> provider.discard(new_instance)

Example implementation:

>>> class TextProvider(InstanceProvider[Instance[int, str, npt.NDArray[Any], str],
...                                     int, str, npt.NDArray[Any], str]):
...     # Further implementation is needed
>>> textprovider = TextProvider()

There are a number of abstractmethod() that need to be implemented in your own implementation. See the source of this file to see what you need to implement.

add(instance)[source]

Add an instance to this provider.

If the provider already contains instance, nothing happens.

Parameters:

instance (Instance[KT, DT, VT, RT]) – The instance that should be added to the provider

Return type:

None

abstract add_child(parent, child)[source]

Register a parent child relation between two instances

Parameters:
Return type:

None

add_range(*instances)[source]

Add multiple instances to this provider.

If the provider already contains instance, nothing happens.

Parameters:
Return type:

None

abstract create(*args, **kwargs)[source]

Create a new instance of type InstanceType. The created instance is subsequently added to the provider.

Note: The number of arguments and keyword arguments may differ in actual implementation, so there are no standard arguments.

Returns:

The new instance Type

Return type:

InstanceType

Parameters:
  • args (Any) –

  • kwargs (Any) –

discard(instance)[source]

Remove an instance from this provider. If the provider does not contain instance, nothing happens.

Parameters:

instance (Instance[KT, DT, VT, RT]) – The instance that should be removed from the provider

Return type:

None

abstract discard_children(parent)[source]

Discard the children that are registered to this parent

Parameters:

parent (Union[KT, Instance[KT, DT, VT, RT]]) – The parent from which you want to get the children from.

Return type:

None

abstract get_children(parent)[source]

Get the children that are registered to this parent

Parameters:

parent (Union[KT, Instance[KT, DT, VT, RT]]) – The parent from which you want to get the children from.

Returns:

A list containing the children

Return type:

Sequence[InstanceType]

get_children_keys(parent)[source]

Get the children that are registered to this parent

Parameters:

parent (Union[KT, Instance[KT, DT, VT, RT]]) – The parent from which you want to get the children from.

Returns:

A list containing the children

Return type:

Sequence[InstanceType]

abstract get_parent(child)[source]

Get the parent of a child

Parameters:

child (Union[KT, Instance[KT, DT, VT, RT]]) – A child instance from which you want to get the children from.

Returns:

The parent of this child instance

Return type:

InstanceType

Raises:

KeyError – If there is no parent associated with this Instance

map_mutate(func)[source]

Run a function on this provider that modifies all Instances in place

Parameters:

func (Callable[[InstanceType], InstanceType]) – A function that modifies instances in place

Return type:

None

class instancelib.instances.base.ROInstanceProvider(*args, **kwds)[source]

Bases: Mapping[KT, InstanceType], ABC, Generic[InstanceType, KT, DT, VT, RT]

The Base InstanceProvider class (ReadOnly).

This class provides an abstract implementation for a dataset. The InstanceProvider has five Generic types:

  • InstanceType : A subclass of Instance

  • KT: The type of the key

  • DT: The type of the data

  • VT: The type of the vector

  • RT: The type of the representation

Specifying these allows Python to ensure the correctness of your implementation and eases further integration in your application.

Examples

Instance access:

>>> provider = InstanceProvider() # Replace with your implementation's constructor
>>> first_key = next(iter(textprovider))
>>> first_doc = textprovider[first_key]

Set operations:

>>> new_instance = Instance()
>>> provider.add(new_instance)
>>> provider.discard(new_instance)

Example implementation:

>>> class TextProvider(InstanceProvider[Instance[int, str, npt.NDArray[Any], str],
...                                     int, str, npt.NDArray[Any], str]):
...     # Further implementation is needed
>>> textprovider = TextProvider()

There are a number of abstractmethod() that need to be implemented in your own implementation. See the source of this file to see what you need to implement.

abstract __contains__(item)[source]

Special method that checks if something is contained in this provider.

Parameters:

item (object) – The item of which we want to know if it is contained in this provider

Returns:

True if the provider contains item.

Return type:

bool

Examples

Example usage; check if the item exists and then remove it

>>> doc_id = 20
>>> provider = InstanceProvider()
>>> if doc_id in provider:
...     del provider[doc_id]
abstract __iter__()[source]

Enables you to iterate over Instances

Yields:

KT – Keys included in the provider

Return type:

Iterator[TypeVar(KT)]

all_data()[source]

Return all the raw data from the instances in this provider

Yields:

DT – Raw data

Return type:

Iterator[TypeVar(DT)]

bulk_add_vectors(keys, values)[source]

This methods adds vectors in values to the instances specified in keys.

In some use cases, vectors are not known beforehand. This library provides several vectorizer s that convert raw data points in feature vector form. Once these vectors are available, they can be added to the provider by using this method

Parameters:
Return type:

None

Warning

We assume that the indices and length of the parameters keys and values match.

bulk_get_all()[source]

Returns a list of all instances in this provider.

Returns:

A list of all instances in this provider

Return type:

List[Instance[KT, DT, VT, RT]]

Warning

When using this method on very large providers with lazily loaded instances, this may yield Out of Memory errors, as all the data will be loaded into RAM. Use with caution!

bulk_get_vectors(keys)[source]

Given a list of instance keys, return the vectors

Parameters:

keys (Sequence[KT]) – A list of vectors

Returns:

A tuple of two sequences, one with keys and one with vectors. The indices match, so the instance with keys[2] has as vector vectors[2]

Return type:

Tuple[Sequence[KT], Sequence[VT]]

Warning

Some underlying implementations do not preserve the ordering of the parameter keys. Therefore, always use the keys variable from the returned tuple for the correct matching.

abstract clear()[source]

Removes all instances from the provider :rtype: None

Warning

Use this operation with caution! This operation is intended for use with providers that function as temporary user queues, not for large proportions of the dataset like unlabeled and labeled sets.

data_chunker(batch_size=200)[source]

Iterate over all instances data parts in this provider

Parameters:

batch_size (int) – The batch size, the generator will return lists with size batch_size

Yields:

Sequence[Tuple[KT,DT]] – A sequence of instances with length batch_size. The last list may have a shorter length.

Return type:

Iterator[Sequence[Tuple[TypeVar(KT), TypeVar(DT)]]]

data_chunker_selector(keys, batch_size=200)[source]
Parameters:
Return type:

Iterator[Sequence[Tuple[TypeVar(KT), TypeVar(DT)]]]

data_map(func)[source]

A higher order function that maps any function that works on individual KT object on every Instance object in this provider.

Parameters:

func (Callable[[TypeVar(DT)], TypeVar(_V)]) – The function that should be applied

Yields:

_V – The values produced by the function func

Return type:

Iterator[TypeVar(_V)]

abstract property empty: bool

Determines if the provider does not contain instances

Returns:

True if the provider is empty

Return type:

bool

abstract get_all()[source]

Get an iterator that iterates over all instances

Yields:

InstanceType – An iterator that iterates over all instances

Return type:

Iterator[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any])]

instance_chunker(batch_size=200)[source]

Iterate over all instances (with or without vectors) in this provider

Parameters:

batch_size (int) – The batch size, the generator will return lists with size batch_size

Yields:

Sequence[Instance[KT, DT, VT, RT]]] – A sequence of instances with length batch_size. The last list may have a shorter length.

Return type:

Iterator[Sequence[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any])]]

instance_chunker_selector(keys, batch_size=200)[source]
Parameters:
Return type:

Iterator[Sequence[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any])]]

property key_list: List[KT]

Return a list of all instance keys in this provider

Returns:

A list of instance keys

Return type:

List[KT]

map(func)[source]

A higher order function that maps any function that works on individual Instance objects on every contained object in this provider.

Parameters:

func (Callable[[InstanceType], _V]) – A function that works on Instance objects of type InstanceType

Yields:

Iterator[_V] – The values produced by the function func

Return type:

Iterator[TypeVar(_V)]

property type_info: TypeInfo | None
vector_chunker(batch_size=200)[source]

Iterate over all pairs of keys and vectors in this provider

Parameters:

batch_size (int) – The batch size, the generator will return lists with size batch_size

Returns:

An iterator over sequences of key vector tuples

Return type:

Iterator[Sequence[Tuple[KT, VT]]]

Yields:

Sequence[Tuple[KT, VT]] – Sequences of key vector tuples

vector_chunker_selector(keys, batch_size=200)[source]

Iterate over all instances (with or without vectors) in belonging the identifier Iterable in the keys parameter.

Parameters:
  • keys (Iterable[KT]) – The keys that should should be chunked

  • batch_size (int) – The batch size, the generator will return lists with size batch_size

Yields:

Sequence[Instance[KT, DT, VT, RT]]] – A sequence of instances with length batch_size. The last list may have a shorter length.

Returns:

An iterator over sequences of key vector tuples

Return type:

Iterator[Sequence[Tuple[KT, VT]]]

vectorized_data_map(func, batch_size=200)[source]

Maps a function that works on multiple raw data points onto all the instances in batches of size batch_size.

Note: If you run a function that combines multiple instances into a single result, this may possibly lead to undiserable results if batches are not taken into account.

Parameters:
Yields:

_V – The result type of the function in parameter func

Return type:

Iterator[TypeVar(_V)]

vectorized_map(func, batch_size=200)[source]

Maps a function that works on multiple instances onto all the instances in batches of size batch_size.

Note: If you run a function that combines multiple instances into a single result, this may possibly lead to undiserable results if batches are not taken into account.

Parameters:
  • func (Callable[[Iterable[InstanceType]], _V]) – The function that should be applied

  • batch_size (int, optional) – The size of the batch, by default 200

Yields:

_V – The result type of the function in parameter func

Return type:

Iterator[TypeVar(_V)]

property with_vector: FrozenSet[KT]
property without_vector: FrozenSet[KT]
class instancelib.instances.base.SubtractionProvider(*args, **kwds)[source]

Bases: AbstractBucketProvider[InstanceType, KT, DT, VT, RT], ABC, Generic[InstanceType, KT, DT, VT, RT]

This abstract class allows the creation of large subsets (buckets) that do not contain some elements, specified in a bucket. No data is copied, however, the InstanceProvider API is preserved.

In some underlying implementations (like a Many to Many relation in Django), the creation of a large elements set takes a lot of time. This class allows the creation to subtract a (small) bucket from the dataset and include only the remainder.

This method can be used in the Poolbased Active Learning setting; suppose you have a small labeled set and a huge dataset. You can subtract the labeled from the dataset and create an InstanceProvider that contains all unlabeled examples.

Variables:

Warning

If possible, do not use this class: a solution that is based on only InstanceProvider objects and AbstractBucketProvider will probably be faster.

bucket: InstanceProvider[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any]), TypeVar(KT), TypeVar(DT), TypeVar(VT), TypeVar(RT)]

The provider that should be excluded from the original dataset.

clear()[source]
Return type:

None

create(*args, **kwargs)[source]

Create a new instance of type InstanceType. The created instance is subsequently added to the provider.

Note: The number of arguments and keyword arguments may differ in actual implementation, so there are no standard arguments.

Returns:

The new instance Type

Return type:

InstanceType

Parameters:
  • args (Any) –

  • kwargs (Any) –

class instancelib.instances.base.TypeInfo(identifier, data, vector, representation)[source]

Bases: object

Parameters:
  • identifier (Type) –

  • data (Type) –

  • vector (Type) –

  • representation (Type) –

data: Type
identifier: Type
representation: Type
vector: Type
instancelib.instances.base.default_instance_viewer(ins)[source]
Parameters:

ins (Instance[Any, Any, Any, TypeVar(RT)]) –

Return type:

Mapping[str, TypeVar(RT)]