instancelib.instances.hdf5vector module
- class instancelib.instances.hdf5vector.HDF5VectorStorage(h5path, mode='r')[source]
Bases:
VectorStorage[KT,ndarray,ndarray],Generic[KT,DType]This class provides the handling of on disk vector storage in HDF5 format. In many cases, storing feature matrices or large sets of vectors in memory is not feasible.
This class provides methods that InstanceProvider implementations can use to ensure that only the vectors needed by some operations are kept in memory. This class enables processing all vector in chunks that do fit in memory, enabling ordering all unlabeled instances for very large datasets.
- Parameters:
- add_bulk(input_keys, input_values)[source]
Add a bulk of keys and values (vectors) to the vector storage
- property datasets_exist: bool
Check if the HDF5 file contains a dataset
- Returns:
True, if the file contains a dataset
- Return type:
- get_matrix(keys)[source]
Return a matrix containing the vectors that correspond with the keys
- Parameters:
keys (Sequence[KT]) – A list of identifier keys
- Returns:
A tuple containing:
- A list with identifier keys
(order may differ from keys argument)
- A matrix containing the vectors
(rows correspond with the returned list)
- Return type:
Tuple[Sequence[KT], npt.NDArray[DType]]
- Raises:
NoVectorsException – If there are no vectors returned
- get_matrix_chunked(keys, chunk_size=200)[source]
Return matrices in chunks of chunk_size containing the vectors requested in keys
- Parameters:
- Yields:
Tuple[Sequence[KT], npt.NDArray[DType]] –
A tuple containing:
- A list with identifier keys
(order may differ from keys argument)
- A matrix containing the vectors
(rows correspond with the returned list)
- Raises:
StopIteration – When there are no more chunks to process
- Return type:
Iterator[Tuple[Sequence[TypeVar(KT)],ndarray[Any,dtype[TypeVar(DType,float64,int32,int64,float32,float16,bool_)]]]]
- get_vectors_chunked(keys, chunk_size=200)[source]
Return vectors in chunks of chunk_size containing the vectors requested in keys
- Parameters:
- Yields:
Tuple[Sequence[KT], Sequence[npt.NDArray[DType]]] –
A tuple containing two lists:
A list with identifiers (order may differ from keys argument)
A list with vectors
- Return type:
Iterator[Tuple[Sequence[TypeVar(KT)],Sequence[ndarray[Any,dtype[TypeVar(DType,float64,int32,int64,float32,float16,bool_)]]]]]
- get_vectors_zipped(keys, chunk_size=200)[source]
Return vectors in chunks of chunk_size containing the vectors requested in keys
- matrices_chunker(chunk_size=200)[source]
Yield matrices in chunks of chunk_size containing all the vectors in this object
- Parameters:
chunk_size (int, optional) – The size of the chunks, by default 200
- Yields:
Tuple[Sequence[KT], npt.NDArray[DType]] –
A tuple containing:
A list with identifier keys
- A matrix containing the vectors
(row indices correspond with the list indices)
- Raises:
StopIteration – When there are no more chunks to process
- rebuild_index(type_restorer=<function identity>)[source]
Rebuild the index after manual manipulation of a HDF5 file.
- Raises:
NoVectorsException – If there are no vectors, or if they are stored incorrectly
- Parameters:
- Return type:
- vectors_chunker(chunk_size=200)[source]
Return vectors in chunks of chunk_size. This generator will yield all vectors contained in this object.
- Parameters:
chunk_size (int, optional) – The size of the chunks, by default 200
- Yields:
Sequence[Tuple[KT, npt.NDArray[DType]]] –
A list containing tuples of:
An identifier
A vector
- Return type:
Iterator[Sequence[Tuple[TypeVar(KT),ndarray[Any,dtype[TypeVar(DType,float64,int32,int64,float32,float16,bool_)]]]]]