Skip to content

daft.functions.minhash#

minhash #

minhash(text: Expression, *, num_hashes: int, ngram_size: int, seed: int = 1, hash_function: Literal['murmurhash3', 'xxhash', 'xxhash32', 'xxhash64', 'xxhash3_64', 'sha1'] = 'murmurhash3') -> Expression

Runs the MinHash algorithm on the series.

For a string, calculates the minimum hash over all its ngrams, repeating with num_hashes permutations. Returns as a list of 32-bit unsigned integers.

Tokens for the ngrams are delimited by spaces. The strings are not normalized or pre-processed, so it is recommended to normalize the strings yourself.

Parameters:

Name Type Description Default
text String Expression

The expression to hash.

required
num_hashes int

The number of hash permutations to compute.

required
ngram_size int

The number of tokens in each shingle/ngram.

required
seed int, default=1

Seed used for generating permutations and the initial string hashes. Defaults to 1.

1
hash_function str, default="murmurhash3"

Hash function to use for initial string hashing. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "murmurhash3".

'murmurhash3'

Returns:

Name Type Description
Expression FixedSizedList[UInt32, num_hashes] Expression

expression representing the MinHash values.

Source code in daft/functions/misc.py
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
def minhash(
    text: Expression,
    *,
    num_hashes: int,
    ngram_size: int,
    seed: int = 1,
    hash_function: Literal["murmurhash3", "xxhash", "xxhash32", "xxhash64", "xxhash3_64", "sha1"] = "murmurhash3",
) -> Expression:
    """Runs the MinHash algorithm on the series.

    For a string, calculates the minimum hash over all its ngrams,
    repeating with `num_hashes` permutations. Returns as a list of 32-bit unsigned integers.

    Tokens for the ngrams are delimited by spaces.
    The strings are not normalized or pre-processed, so it is recommended
    to normalize the strings yourself.

    Args:
        text (String Expression): The expression to hash.
        num_hashes (int): The number of hash permutations to compute.
        ngram_size (int): The number of tokens in each shingle/ngram.
        seed (int, default=1): Seed used for generating permutations and the initial string hashes. Defaults to 1.
        hash_function (str, default="murmurhash3"): Hash function to use for initial string hashing. One of "murmurhash3", "xxhash" (alias for "xxhash3_64"), "xxhash32", "xxhash64", "xxhash3_64", or "sha1". Defaults to "murmurhash3".

    Returns:
        Expression (FixedSizedList[UInt32, num_hashes] Expression):
            expression representing the MinHash values.

    """
    return Expression._call_builtin_scalar_fn(
        "minhash", text, num_hashes=num_hashes, ngram_size=ngram_size, seed=seed, hash_function=hash_function
    )