daft.functions.minhash#
minhash #
minhash(text: Expression, *, num_hashes: int, ngram_size: int, seed: int = 1, hash_function: Literal['murmurhash3', 'xxhash', 'xxhash32', 'xxhash64', 'xxhash3_64', 'sha1'] = 'murmurhash3') -> Expression
Runs the MinHash algorithm on the series.
For a string, calculates the minimum hash over all its ngrams, repeating with num_hashes permutations. Returns as a list of 32-bit unsigned integers.
Tokens for the ngrams are delimited by spaces. The strings are not normalized or pre-processed, so it is recommended to normalize the strings yourself.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | String Expression | The expression to hash. | required |
num_hashes | int | The number of hash permutations to compute. | required |
ngram_size | int | The number of tokens in each shingle/ngram. | required |
seed | int, default=1 | Seed used for generating permutations and the initial string hashes. Defaults to 1. | 1 |
hash_function | str, default="murmurhash3" | Hash function to use for initial string hashing. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "murmurhash3". | 'murmurhash3' |
Returns:
| Name | Type | Description |
|---|---|---|
Expression | FixedSizedList[UInt32, num_hashes] Expression | expression representing the MinHash values. |
Source code in daft/functions/misc.py
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | |