instancelib.instances.base module
- class instancelib.instances.base.AbstractBucketProvider(*args, **kwds)[source]
Bases:
InstanceProvider[InstanceType,KT,DT,VT,RT],ABC,Generic[InstanceType,KT,DT,VT,RT]This class allows the creation of subsets (buckets) from a provider, without copying data, while still preserving the
InstanceProviderAPI.For example, in Poolbased Active Learning, the dataset is partitioned in several sets; e.g., the labeled and unlabeled parts of the dataset. Or in traditional supervised learning, the train, test and validation sets. No data is copied, only a set of identifiers is kept in this provider. All data resides in the original provider.
- Variables:
dataset – The
InstanceProviderthat you want to take a subset from
- create(*args, **kwargs)[source]
Create a new instance of type
InstanceType. The created instance is subsequently added to the provider.Note: The number of arguments and keyword arguments may differ in actual implementation, so there are no standard arguments.
-
dataset:
InstanceProvider[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any]),TypeVar(KT),TypeVar(DT),TypeVar(VT),TypeVar(RT)] The original dataset. All data will remain there
- property empty: bool
Determines if the provider does not contain instances
- Returns:
True if the provider is empty
- Return type:
- vector_chunker_selector(keys, batch_size=200)[source]
Iterate over all instances (with or without vectors) in belonging the identifier
Iterablein the keys parameter.- Parameters:
- Yields:
Sequence[Instance[KT, DT, VT, RT]]] – A sequence of instances with length batch_size. The last list may have a shorter length.
- Returns:
An iterator over sequences of key vector tuples
- Return type:
- class instancelib.instances.base.Instance(*args, **kwds)[source]
Bases:
ABC,Generic[KT,DT,VT,RT]A base Instance Class.
Every Instance contains 4 properties:
A unique identifier (identifier)
The raw data (data)
A vector representation of the data (vector)
A human readable representation (representation)
The ABC Instance has four Generic types:
KT: The type of the keyDT: The type of the dataVT: The type of the vectorRT: The type of the representation
Combining these four items in a single object enables easy transfer between different operations like predictions, annotatation and transformation.
- abstract property identifier: KT
Get the identifier of the instance
- Returns:
The identifier key of the instance
- Return type:
- static map_data(func)[source]
Transform function that works on raw data into a function that works on
Instanceobjects.
- static map_vector(func)[source]
Transform function that works on vectors into a function that works on
Instanceobjects.
- abstract property representation: RT
Return a representation for annotation
- Returns:
A representation of the raw data
- Return type:
- abstract property vector: VT | None
Get the vector represenation of the raw data
- Returns:
The Vector
- Return type:
Optional[VT]
- class instancelib.instances.base.InstanceProvider(*args, **kwds)[source]
Bases:
MutableMapping[KT,InstanceType],ROInstanceProvider[InstanceType,KT,DT,VT,RT],ABC,Generic[InstanceType,KT,DT,VT,RT]The Base InstanceProvider class.
This class provides an abstract implementation for a dataset. The InstanceProvider has five Generic types:
InstanceType: A subclass ofInstanceKT: The type of the keyDT: The type of the dataVT: The type of the vectorRT: The type of the representation
Specifying these allows Python to ensure the correctness of your implementation and eases further integration in your application.
Examples
Instance access:
>>> provider = InstanceProvider() # Replace with your implementation's constructor >>> first_key = next(iter(textprovider)) >>> first_doc = textprovider[first_key]
Set operations:
>>> new_instance = Instance() >>> provider.add(new_instance) >>> provider.discard(new_instance)
Example implementation:
>>> class TextProvider(InstanceProvider[Instance[int, str, npt.NDArray[Any], str], ... int, str, npt.NDArray[Any], str]): ... # Further implementation is needed >>> textprovider = TextProvider()
There are a number of
abstractmethod()that need to be implemented in your own implementation. See the source of this file to see what you need to implement.- add(instance)[source]
Add an instance to this provider.
If the provider already contains instance, nothing happens.
- add_range(*instances)[source]
Add multiple instances to this provider.
If the provider already contains instance, nothing happens.
- abstract create(*args, **kwargs)[source]
Create a new instance of type
InstanceType. The created instance is subsequently added to the provider.Note: The number of arguments and keyword arguments may differ in actual implementation, so there are no standard arguments.
- discard(instance)[source]
Remove an instance from this provider. If the provider does not contain instance, nothing happens.
- class instancelib.instances.base.ROInstanceProvider(*args, **kwds)[source]
Bases:
Mapping[KT,InstanceType],ABC,Generic[InstanceType,KT,DT,VT,RT]The Base InstanceProvider class (ReadOnly).
This class provides an abstract implementation for a dataset. The InstanceProvider has five Generic types:
InstanceType: A subclass ofInstanceKT: The type of the keyDT: The type of the dataVT: The type of the vectorRT: The type of the representation
Specifying these allows Python to ensure the correctness of your implementation and eases further integration in your application.
Examples
Instance access:
>>> provider = InstanceProvider() # Replace with your implementation's constructor >>> first_key = next(iter(textprovider)) >>> first_doc = textprovider[first_key]
Set operations:
>>> new_instance = Instance() >>> provider.add(new_instance) >>> provider.discard(new_instance)
Example implementation:
>>> class TextProvider(InstanceProvider[Instance[int, str, npt.NDArray[Any], str], ... int, str, npt.NDArray[Any], str]): ... # Further implementation is needed >>> textprovider = TextProvider()
There are a number of
abstractmethod()that need to be implemented in your own implementation. See the source of this file to see what you need to implement.- abstract __contains__(item)[source]
Special method that checks if something is contained in this provider.
- Parameters:
item (object) – The item of which we want to know if it is contained in this provider
- Returns:
True if the provider contains item.
- Return type:
Examples
Example usage; check if the item exists and then remove it
>>> doc_id = 20 >>> provider = InstanceProvider() >>> if doc_id in provider: ... del provider[doc_id]
- bulk_add_vectors(keys, values)[source]
This methods adds vectors in values to the instances specified in keys.
In some use cases, vectors are not known beforehand. This library provides several vectorizer s that convert raw data points in feature vector form. Once these vectors are available, they can be added to the provider by using this method
- Parameters:
- Return type:
Warning
We assume that the indices and length of the parameters keys and values match.
- bulk_get_all()[source]
Returns a list of all instances in this provider.
Warning
When using this method on very large providers with lazily loaded instances, this may yield Out of Memory errors, as all the data will be loaded into RAM. Use with caution!
- bulk_get_vectors(keys)[source]
Given a list of instance keys, return the vectors
- Parameters:
keys (Sequence[KT]) – A list of vectors
- Returns:
A tuple of two sequences, one with keys and one with vectors. The indices match, so the instance with
keys[2]has as vectorvectors[2]- Return type:
Warning
Some underlying implementations do not preserve the ordering of the parameter keys. Therefore, always use the keys variable from the returned tuple for the correct matching.
- abstract clear()[source]
Removes all instances from the provider :rtype:
NoneWarning
Use this operation with caution! This operation is intended for use with providers that function as temporary user queues, not for large proportions of the dataset like unlabeled and labeled sets.
- data_map(func)[source]
A higher order function that maps any function that works on individual
KTobject on everyInstanceobject in this provider.
- abstract property empty: bool
Determines if the provider does not contain instances
- Returns:
True if the provider is empty
- Return type:
- instance_chunker(batch_size=200)[source]
Iterate over all instances (with or without vectors) in this provider
- Parameters:
batch_size (int) – The batch size, the generator will return lists with size batch_size
- Yields:
Sequence[Instance[KT, DT, VT, RT]]] – A sequence of instances with length batch_size. The last list may have a shorter length.
- Return type:
Iterator[Sequence[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any])]]
- property key_list: List[KT]
Return a list of all instance keys in this provider
- Returns:
A list of instance keys
- Return type:
List[KT]
- map(func)[source]
A higher order function that maps any function that works on individual
Instanceobjects on every contained object in this provider.
- vector_chunker_selector(keys, batch_size=200)[source]
Iterate over all instances (with or without vectors) in belonging the identifier
Iterablein the keys parameter.- Parameters:
- Yields:
Sequence[Instance[KT, DT, VT, RT]]] – A sequence of instances with length batch_size. The last list may have a shorter length.
- Returns:
An iterator over sequences of key vector tuples
- Return type:
- vectorized_data_map(func, batch_size=200)[source]
Maps a function that works on multiple raw data points onto all the instances in batches of size batch_size.
Note: If you run a function that combines multiple instances into a single result, this may possibly lead to undiserable results if batches are not taken into account.
- vectorized_map(func, batch_size=200)[source]
Maps a function that works on multiple instances onto all the instances in batches of size batch_size.
Note: If you run a function that combines multiple instances into a single result, this may possibly lead to undiserable results if batches are not taken into account.
- class instancelib.instances.base.SubtractionProvider(*args, **kwds)[source]
Bases:
AbstractBucketProvider[InstanceType,KT,DT,VT,RT],ABC,Generic[InstanceType,KT,DT,VT,RT]This abstract class allows the creation of large subsets (buckets) that do not contain some elements, specified in a bucket. No data is copied, however, the
InstanceProviderAPI is preserved.In some underlying implementations (like a Many to Many relation in Django), the creation of a large elements set takes a lot of time. This class allows the creation to subtract a (small) bucket from the dataset and include only the remainder.
This method can be used in the Poolbased Active Learning setting; suppose you have a small labeled set and a huge dataset. You can subtract the labeled from the dataset and create an InstanceProvider that contains all unlabeled examples.
- Variables:
dataset – The
InstanceProviderthat you want to take a subset frombucket – The
InstanceProviderthat you want to exclude from the dataset
Warning
If possible, do not use this class: a solution that is based on only
InstanceProviderobjects andAbstractBucketProviderwill probably be faster.-
bucket:
InstanceProvider[TypeVar(InstanceType, bound= Instance[Any, Any, Any, Any]),TypeVar(KT),TypeVar(DT),TypeVar(VT),TypeVar(RT)] The provider that should be excluded from the original dataset.