instancelib.instances.hdf5vector module

class instancelib.instances.hdf5vector.HDF5VectorStorage(h5path, mode='r')[source]

Bases: VectorStorage[KT, ndarray, ndarray], Generic[KT, DType]

This class provides the handling of on disk vector storage in HDF5 format. In many cases, storing feature matrices or large sets of vectors in memory is not feasible.

This class provides methods that InstanceProvider implementations can use to ensure that only the vectors needed by some operations are kept in memory. This class enables processing all vector in chunks that do fit in memory, enabling ordering all unlabeled instances for very large datasets.

Parameters:
  • h5path (str) – The path to the hdf5 file

  • mode (str, optional) – The file mode (see h5py documentation), by default “r”

__len__()[source]

Returns the size of the dataset :returns: The size of the dataset :rtype: int

add_bulk(input_keys, input_values)[source]

Add a bulk of keys and values (vectors) to the vector storage

Parameters:
  • input_keys (Sequence[KT]) – The keys of the Instances

  • input_values (Sequence[Optional[npt.NDArray[DType]]]) – The vectors that correspond with the indices

Return type:

None

add_bulk_matrix(keys, matrix)[source]

Add matrices in bulk

Parameters:
  • keys (Sequence[KT]) – A list of identifiers. The following should hold: len(keys) == matrix.shape[0]

  • matrix (npt.NDArray[DType]) – A matrix. The rows should correspond with the identifiers in keys

Return type:

None

close()[source]

Close the file and store changes to the index to disk

Return type:

None

property datasets_exist: bool

Check if the HDF5 file contains a dataset

Returns:

True, if the file contains a dataset

Return type:

bool

get_matrix(keys)[source]

Return a matrix containing the vectors that correspond with the keys

Parameters:

keys (Sequence[KT]) – A list of identifier keys

Returns:

A tuple containing:

  • A list with identifier keys

    (order may differ from keys argument)

  • A matrix containing the vectors

    (rows correspond with the returned list)

Return type:

Tuple[Sequence[KT], npt.NDArray[DType]]

Raises:

NoVectorsException – If there are no vectors returned

get_matrix_chunked(keys, chunk_size=200)[source]

Return matrices in chunks of chunk_size containing the vectors requested in keys

Parameters:
  • keys (Sequence[KT]) – A list of identifier keys

  • chunk_size (int, optional) – The size of the chunks, by default 200

Yields:

Tuple[Sequence[KT], npt.NDArray[DType]]

A tuple containing:

  • A list with identifier keys

    (order may differ from keys argument)

  • A matrix containing the vectors

    (rows correspond with the returned list)

Raises:

StopIteration – When there are no more chunks to process

Return type:

Iterator[Tuple[Sequence[TypeVar(KT)], ndarray[Any, dtype[TypeVar(DType, float64, int32, int64, float32, float16, bool_)]]]]

get_vectors(keys)[source]

Return the vectors that correspond with the keys

Parameters:

keys (Sequence[KT]) – A list of identifier keys

Returns:

A tuple containing two lists:

  • A list with identifier (order may differ from keys argument)

  • A list with vectors

Return type:

Tuple[Sequence[KT], Sequence[npt.NDArray[DType]]]

get_vectors_chunked(keys, chunk_size=200)[source]

Return vectors in chunks of chunk_size containing the vectors requested in keys

Parameters:
  • keys (Sequence[KT]) – A list of identifier keys

  • chunk_size (int, optional) – The size of the chunks, by default 200

Yields:

Tuple[Sequence[KT], Sequence[npt.NDArray[DType]]]

A tuple containing two lists:

  • A list with identifiers (order may differ from keys argument)

  • A list with vectors

Return type:

Iterator[Tuple[Sequence[TypeVar(KT)], Sequence[ndarray[Any, dtype[TypeVar(DType, float64, int32, int64, float32, float16, bool_)]]]]]

get_vectors_zipped(keys, chunk_size=200)[source]

Return vectors in chunks of chunk_size containing the vectors requested in keys

Parameters:
  • keys (Sequence[KT]) – A list of identifier keys

  • chunk_size (int, optional) – The size of the chunks, by default 200

Yields:

Sequence[Tuple[KT, npt.NDArray[DType]]]

A list containing tuples of:

  • An identifier (order may differ from keys argument)

  • A vector

Return type:

Iterator[Sequence[Tuple[TypeVar(KT), ndarray[Any, dtype[TypeVar(DType, float64, int32, int64, float32, float16, bool_)]]]]]

matrices_chunker(chunk_size=200)[source]

Yield matrices in chunks of chunk_size containing all the vectors in this object

Parameters:

chunk_size (int, optional) – The size of the chunks, by default 200

Yields:

Tuple[Sequence[KT], npt.NDArray[DType]]

A tuple containing:

  • A list with identifier keys

  • A matrix containing the vectors

    (row indices correspond with the list indices)

Raises:

StopIteration – When there are no more chunks to process

rebuild_index(type_restorer=<function identity>)[source]

Rebuild the index after manual manipulation of a HDF5 file.

Raises:

NoVectorsException – If there are no vectors, or if they are stored incorrectly

Parameters:

type_restorer (Callable[[Any], TypeVar(KT)]) –

Return type:

None

reload()[source]

Reload the index from disk

Return type:

None

vectors_chunker(chunk_size=200)[source]

Return vectors in chunks of chunk_size. This generator will yield all vectors contained in this object.

Parameters:

chunk_size (int, optional) – The size of the chunks, by default 200

Yields:

Sequence[Tuple[KT, npt.NDArray[DType]]]

A list containing tuples of:

  • An identifier

  • A vector

Return type:

Iterator[Sequence[Tuple[TypeVar(KT), ndarray[Any, dtype[TypeVar(DType, float64, int32, int64, float32, float16, bool_)]]]]]

property writeable: bool

Check if the storage is writeable

Returns:

True when writeable

Return type:

bool

instancelib.instances.hdf5vector.keys_wrapper(keys)[source]
Parameters:

keys (Sequence[Any]) –

Return type:

Sequence[Union[str, int]]