From 6f84713d1ac884b6fd79bef6fae4536d0013fe34 Mon Sep 17 00:00:00 2001 From: zackcam Date: Tue, 18 Feb 2025 00:45:40 +0000 Subject: [PATCH 01/10] Adding bloom command meta data and bloom group, adding bloom data type as well Signed-off-by: zackcam --- commands/bf.add.md | 12 +++++ commands/bf.card.md | 13 +++++ commands/bf.exists.md | 16 ++++++ commands/bf.info.md | 35 +++++++++++++ commands/bf.insert.md | 40 +++++++++++++++ commands/bf.load.md | 4 ++ commands/bf.madd.md | 15 ++++++ commands/bf.mexists.md | 16 ++++++ commands/bf.reserve.md | 23 +++++++++ commands/commands | 1 + groups.json | 4 ++ resp2_replies.json | 49 ++++++++++++++++++ resp3_replies.json | 49 ++++++++++++++++++ topics/bloomfilters.md | 111 +++++++++++++++++++++++++++++++++++++++++ topics/data-types.md | 8 +++ 15 files changed, 396 insertions(+) create mode 100644 commands/bf.add.md create mode 100644 commands/bf.card.md create mode 100644 commands/bf.exists.md create mode 100644 commands/bf.info.md create mode 100644 commands/bf.insert.md create mode 100644 commands/bf.load.md create mode 100644 commands/bf.madd.md create mode 100644 commands/bf.mexists.md create mode 100644 commands/bf.reserve.md create mode 120000 commands/commands create mode 100644 topics/bloomfilters.md diff --git a/commands/bf.add.md b/commands/bf.add.md new file mode 100644 index 00000000..a6d8091e --- /dev/null +++ b/commands/bf.add.md @@ -0,0 +1,12 @@ +Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name. +## Arguments +* key (required) - A Valkey key of Bloom data type +* item (required) - Item to add + +## Examples +``` +127.0.0.1:6379> BF.ADD key val +1 +127.0.0.1:6379> BF.ADD key val +0 +``` diff --git a/commands/bf.card.md b/commands/bf.card.md new file mode 100644 index 00000000..bae8c87f --- /dev/null +++ b/commands/bf.card.md @@ -0,0 +1,13 @@ +Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter. +## Arguments +* key (required) - A Valkey key of Bloom data type + +## Examples +``` +127.0.0.1:6379> BF.ADD key val +1 +127.0.0.1:6379> BF.CARD key +1 +127.0.0.1:6379> BF.CARD missing +0 +``` \ No newline at end of file diff --git a/commands/bf.exists.md b/commands/bf.exists.md new file mode 100644 index 00000000..d072f738 --- /dev/null +++ b/commands/bf.exists.md @@ -0,0 +1,16 @@ +Determines if a specified item has been added to the specified bloom filter. +Syntax + +## Arguments +* key (required) - A Valkey key of Bloom data type +* item (required) - The item that we are checking if it exists in the bloom object + +## Examples +``` +127.0.0.1:6379> BF.ADD key val +1 +127.0.0.1:6379> BF.EXISTS key val +1 +127.0.0.1:6379> BF.EXISTS key missing +0 +``` diff --git a/commands/bf.info.md b/commands/bf.info.md new file mode 100644 index 00000000..3a8301d8 --- /dev/null +++ b/commands/bf.info.md @@ -0,0 +1,35 @@ +Returns information about a bloomfilter + +## Arguments +* key (required) - A valkey key of bloom data type +* CAPACITY (optional) - Returns the number of unique items that would need to be added before scaling would happen +* SIZE (optional) - Returns the memory size which is the number of bytes allocated +* FILTERS (optional) - Returns the number of filters in the specified key +* ITEMS (optional) - Returns the number of unique items that have been added the the Bloom filter +* ERROR (optional) - Returns the false positive rate for the bloom filter +* EXPANSION (optional) - Returns the expansion rate +* MAXSCALEDCAPACITY (optional) - Returns the maximum capacity that can be reached before an error occurs +If none of the optional fields are specified, all the fields will be returned. MAXSCALEDCAPACITY will be an unrecognized argument on non scaling filters + +## Examples +``` +127.0.0.1:6379> BF.ADD key val +1 +127.0.0.1:6379> BF.INFO key + 1) Capacity + 2) (integer) 100 + 3) Size + 4) (integer) 384 + 5) Number of filters + 6) (integer) 1 + 7) Number of items inserted + 8) (integer) 2 + 9) Error rate +10) "0.01" +11) Expansion rate +12) (integer) 2 +13) Max scaled capacity +14) (integer) 26214300 +127.0.0.1:6379> BF.INFO key CAPACITY +100 +``` \ No newline at end of file diff --git a/commands/bf.insert.md b/commands/bf.insert.md new file mode 100644 index 00000000..4ad7e6e9 --- /dev/null +++ b/commands/bf.insert.md @@ -0,0 +1,40 @@ +Creates a bloom object with the specified parameters. If a parameter is not specified then the default value will be used. If ITEMS is specified then it will also attempt to add all items specified after + +## Arguments +Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSCALING and VALIDATESCALETO isn't allowed +* key (required) - Is the key name for a Bloom filter to add the item to +* CAPACITY capacity (optional) - capacity for the inital bloom filter +* ERROR fp_error (optional)- The false positive rate for the bloom filter +* EXPANSION expansion(optional) - The expansion rate for a scaling filter +* NOCREATE (optional) - Will not create the bloom filter and add items if the filter does not exist already +* TIGHTENING (optional) - The tightening ratio for the bloom filter +* SEED (optional) - The seed the hash functions will use +* NONSCALING (optional) - Will make it so the filter can not scale +* VALIDATESCALETO validatescaleto (optional) - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter +* ITEMS (optional) - Items we will add to the bloom filter + +## Examples +``` +127.0.0.1:6379> BF.INSERT key ITEMS item1 item2 +1) (integer) 1 +2) (integer) 1 +# This does not update the capcity but uses the origianl filters values +127.0.0.1:6379> BF.INSERT key CAPACITY 1000 ITEMS item2 item3 +1) (integer) 0 +2) (integer) 1 +127.0.0.1:6379> BF.INSERT key_new CAPACITY 1000 +[] +``` + +``` +127.0.0.1:6379> BF.INSERT key NONSCALING VALIDATESCALETO 100 +cannot use NONSCALING and VALIDATESCALETO options together +127.0.0.1:6379> BF.INSERT key CAPACITY 1000 VALIDATESCALETO 999999999999999999999 ITEMS item2 item3 +provided VALIDATESCALETO causes bloom object to exceed memory limit +127.0.0.1:6379> BF.INSERT key VALIDATESCALETO 999999999999999999999 EXPANSION 1 ITEMS item2 item3 +provided VALIDATESCALETO causes false positive to degrade to 0 +``` +``` +127.0.0.1:6379> BF.INSERT key NOCREATE ITEMS item1 item2 +not found +``` \ No newline at end of file diff --git a/commands/bf.load.md b/commands/bf.load.md new file mode 100644 index 00000000..04ae9979 --- /dev/null +++ b/commands/bf.load.md @@ -0,0 +1,4 @@ +Loads a bloom filter from a dump of an existing bloom object with all the properties and bit vector dump. +## Arguments +* key (required) - Is the key name for a Bloom filter to add the item to +* dump (required) - Is the dump we are restoring diff --git a/commands/bf.madd.md b/commands/bf.madd.md new file mode 100644 index 00000000..a044125c --- /dev/null +++ b/commands/bf.madd.md @@ -0,0 +1,15 @@ +Adds one or more items to a Bloom Filter, if the specified filter does not exist creates a default bloom filter with that name. +## Arguments +* key (required) - Is the key name for a Bloom filter to add the item to +* item (requires at least 1 item but can add as many as wanted ) - Is the item/s to add +## Examples +``` +127.0.0.1:6379> BF.MADD key item1 item2 +1) (integer) 1 +2) (integer) 1 +127.0.0.1:6379> BF.MADD key item2 item3 +1) (integer) 0 +2) (integer) 1 +127.0.0.1:6379> BF.MADD key_new item1 +1) (integer) 1 +``` \ No newline at end of file diff --git a/commands/bf.mexists.md b/commands/bf.mexists.md new file mode 100644 index 00000000..30dfa289 --- /dev/null +++ b/commands/bf.mexists.md @@ -0,0 +1,16 @@ +Determines if one or more items has been added to the specified bloom filter +## Arguments +* key (required) - A Valkey key of Bloom data type +* item (requires at least 1 item but can add as many as desired) - The item/s that we are checking if it exists in the bloom object +## Examples +``` +127.0.0.1:6379> BF.MADD key item1 item2 +1) (integer) 1 +2) (integer) 1 +127.0.0.1:6379> BF.MEXISTS key item1 item2 item3 +1) (integer) 1 +2) (integer) 1 +3) (integer) 0 +127.0.0.1:6379> BF.MEXISTS key item1 +1) (integer) 1 +``` \ No newline at end of file diff --git a/commands/bf.reserve.md b/commands/bf.reserve.md new file mode 100644 index 00000000..5fb5fe39 --- /dev/null +++ b/commands/bf.reserve.md @@ -0,0 +1,23 @@ +Creates an empty bloom object with the capacity and false positive rate specified +## Arguments +* key (required) - A Valkey key of Bloom data type +* error_rate (required) - The fp rate the bloom filter will be created with +* capacity (required) - The starting capacity the bloom filter will be created with +* EXPANSION expansion(optional)- The rate in which filters will increase by +* NONSCALING (optional) - Setting this will make it so the bloom object can’t expand past its initial capacity + +## Examples +``` +127.0.0.1:6379> BF.RESERVE key 0.01 1000 +OK +127.0.0.1:6379> BF.RESERVE key 0.1 1000000 +(error) ERR item exists +``` +``` +127.0.0.1:6379> BF.RESERVE bf_expansion 0.0001 5000 EXPANSION 3 +OK +``` +``` +127.0.0.1:6379> BF.RESERVE bf_nonscaling 0.0001 5000 NONSCALING +OK +``` diff --git a/commands/commands b/commands/commands new file mode 120000 index 00000000..55a41c2b --- /dev/null +++ b/commands/commands @@ -0,0 +1 @@ +../valkey-doc/commands \ No newline at end of file diff --git a/groups.json b/groups.json index 46f69a50..67735da4 100644 --- a/groups.json +++ b/groups.json @@ -3,6 +3,10 @@ "display": "Bitmap", "description": "Operations on the Bitmap data type" }, + "bloom": { + "display": "Bloom", + "description": "Operations on the Bloom filter data type" + }, "cluster": { "display": "Cluster", "description": "Valkey Cluster management" diff --git a/resp2_replies.json b/resp2_replies.json index b678d630..f7226512 100644 --- a/resp2_replies.json +++ b/resp2_replies.json @@ -62,6 +62,55 @@ "AUTH": [ "[Simple string reply](../topics/protocol.md#simple-strings): `OK`, or an error if the password, or username/password pair, is invalid." ], + "BF.ADD": [ + "One of the following:", + "* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added", + "* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if attempting to add to a non scaling filter that is full" + ], + "BF.CARD": [ + "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.EXISTS": [ + "One of the following:", + "* [Integer reply](../topics/protocol.md#integers): '1'. The item exists in the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): '0'. The bloom filter does not exist or the item has not been added to the bloom filter", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.INFO": [ + "When no optional arguments are provided:", + "[Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", + "When an optional argument is provided:", + "* [Integer reply](../topics/protocol.md#integers): argument value", + "* [String reply??](../topics/protocol.md#simple-strings): argument value", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if trying to get MAXSCALEDCAPACITY of a non scaling filter" + ], + "BF.INSERT": [ + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", + "[String reply??](../topics/protocol.md#simple-strings): not found, if the filter does not exist and NOCREATE is specified", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" + ], + "BF.MADD": [ + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.MEXISTS": [ + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.RESERVE": [ + "[Simple string reply](../topics/protocol.md#simple-strings): `OK`.", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided arguments are not in the valid range, if the provided bloom filter name already exists" + ], "BGREWRITEAOF": [ "[Simple string reply](../topics/protocol.md#simple-strings): a simple string reply indicating that the rewriting started or is about to start ASAP when the call is executed with success.", "", diff --git a/resp3_replies.json b/resp3_replies.json index f3fc3994..82f6ad16 100644 --- a/resp3_replies.json +++ b/resp3_replies.json @@ -62,6 +62,55 @@ "AUTH": [ "[Simple string reply](../topics/protocol.md#simple-strings): `OK`, or an error if the password, or username/password pair, is invalid." ], + "BF.ADD": [ + "One of the following:", + "* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added", + "* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if attempting to add to a non scaling filter that is full" + ], + "BF.CARD": [ + "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.EXISTS": [ + "One of the following:", + "* [Integer reply](../topics/protocol.md#integers): '1'. The item exists in the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): '0'. The bloom filter does not exist or the item has not been added to the bloom filter", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.INFO": [ + "When no optional arguments are provided:", + "[Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", + "When an optional argument is provided:", + "* [Integer reply](../topics/protocol.md#integers): argument value", + "* [String reply??](../topics/protocol.md#simple-strings): argument value", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if trying to get MAXSCALEDCAPACITY of a non scaling filter" + ], + "BF.INSERT": [ + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", + "[String reply??](../topics/protocol.md#simple-strings): not found, if the filter does not exist and NOCREATE is specified", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" + ], + "BF.MADD": [ + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.MEXISTS": [ + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", + "", + "The command will fail if the wrong number of arguments are provided" + ], + "BF.RESERVE": [ + "[Simple string reply](../topics/protocol.md#simple-strings): `OK`.", + "", + "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided arguments are not in the valid range, if the provided bloom filter name already exists" + ], "BGREWRITEAOF": [ "[Bulk string reply](../topics/protocol.md#bulk-strings): a simple string reply indicating that the rewriting started or is about to start ASAP when the call is executed with success.", "", diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md new file mode 100644 index 00000000..6cd10c14 --- /dev/null +++ b/topics/bloomfilters.md @@ -0,0 +1,111 @@ +--- +title: "Bloom Filters" +description: > + Introduction to Bloom Filters +--- + +Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. + +## Bloom commands +* `BF.ADD` adds an item to a bloom filter +* `BF.CARD` returns the cardinality of a bloom filter +* `BF.EXISTS` checks if an item has been added to a bloom filter +* `BF.INFO` returns information about a bloom filter +* `BF.INSERT` can create and/or add items to a bloom filter +* `BF.LOAD` load a bloom filter from a dump +* `BF.MADD` adds one or more items to a bloom filter +* `BF.MEXISTS` checks if one of more items are present in a bloomfilter +* `BF.RESERVE` creates an empty bloom filter + +See the [bloom commands in more detail](../commands/#bloom). + +## Common use cases for bloom filters + +**Financial fraud detection** + +Bloom filters can help answer the question "Has the user paid from this location before?", which can then give insights if there has been suspicious activity in shopping habits. + +For the above each user would have a Bloom filter which is then checked for every transaction. + +**Ad placement** + +Bloom filters can help answer the following questions to advertisers: +* Has the user already seen this ad? +* Has the user already bought this product? + +Use a Bloom filter for every user, storing all bought products. The recommendation engine can then suggest a new product and checks if the product is in the user's Bloom filter. + +* If no, the ad is shown to the user and is added to the Bloom filter. +* If yes, the process restarts and repeats until it finds a product that is not present in the filter. + +**Check if URL's are malicious** + +Bloom filters can answer the question is a URL malicious. Any URL inputted would be checked against a malicious URL bloom filter. + +* If no then we allow access to the site +* If yes then we can deny access or perform a full check of the URL + +**Check if a username is taken** + +Bloom filters can answer the question: Has this username/email/domain name/slug already been used? + +For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter. + +* If no, the user is created and the username is added to the Bloom filter. +* If yes, the app can decide to either check the main database or reject the username. + +## Default bloom properties + +Capacity - 100 + +Error rate - 0.01 + +Expansion - 2 + +As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module. +Example of default bloom objects information: + +``` +127.0.0.1:6379> BF.ADD default_filter item +1 +127.0.0.1:6379> BF.INFO default_filter + 1) Capacity + 2) (integer) 100 + 3) Size + 4) (integer) 384 + 5) Number of filters + 6) (integer) 1 + 7) Number of items inserted + 8) (integer) 1 + 9) Error rate +10) "0.01" +11) Expansion rate +12) (integer) 2 +13) Max scaled capacity +14) (integer) 26214300 +``` + +## Performance + +Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. + +There are a few bloom commands that are O(1) as they don't work on items but instead work on the data about the bloom filter itself. + +## Limits + +The consumption of memory by a single Bloom object is limited to a default of 128 MB (configurable in the bloom module), which is the size of the in-memory data structure not the capacity of the Bloom object. You can check the amount of memory consumed by a Bloom object by using the BF.INFO command. When a bloom filter scales out it will add another filter, there is a limit on the number of filters that can be added. This filter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom objects. + +We have implemented an optional argument into insert (VALIDATESCALETO) that can help you determine the max capacity of the objects on creation. The VALIDATESCALETO when specified would check a few things, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the capacity that is desired will the bloom object be less than the max memory limit (by default 128 MB). +There is also a way to check the max capacity that can be reached for Bloom objects. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom object can reach. + +Example usage for a default bloom object: +``` +127.0.0.1:6379> bf.insert validate_scale_fail VALIDATESCALETO 26214301 +(error) ERR provided VALIDATESCALETO causes bloom object to exceed memory limit +127.0.0.1:6379> bf.insert validate_scale_valid VALIDATESCALETO 26214300 +[] +127.0.0.1:6379> bf.info validate_scale_valid MAXSCALEDCAPACITY +(integer) 26214300 +``` + +As you can see above when trying to create a bloom object that the user wants to achieve a capacity more than what is possible given the memory limits the command will output an error and not create the bloom object. However if the wanted capacity is within the limits then the creation of the bloom object will succeed. diff --git a/topics/data-types.md b/topics/data-types.md index cea10c4d..09e53321 100644 --- a/topics/data-types.md +++ b/topics/data-types.md @@ -92,6 +92,14 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat * [Overview of HyperLogLog](hyperloglogs.md) * [HyperLogLog command reference](../commands/#hyperloglog) +## Bloom Filter + +[Bloom filters](bloomfilters.md) provides a space efficient probabilistic data structure that allows checking if an element is a member of a set. False positives are possible, but it guarantees no false negatives. +For more information, see: + +* [Overview of Bloom Filters](bloomfilters.md) +* [Bloom filter command reference](../commands/#bloom) + ## Extensions To extend the features provided by the included data types, use one of these options: From 1f892b902118e5c741c8677dcae27824fcca31cb Mon Sep 17 00:00:00 2001 From: zackcam Date: Fri, 21 Feb 2025 07:47:33 +0000 Subject: [PATCH 02/10] First round of rewording and changes to documentation. Added ability to generate bloom man pages Signed-off-by: zackcam --- Makefile | 21 ++++++++++++++++----- commands/bf.add.md | 8 ++++---- commands/bf.card.md | 3 +-- commands/bf.exists.md | 13 ++++++++----- commands/bf.info.md | 22 ++++++++++++---------- commands/bf.insert.md | 38 ++++++++++++++++++++------------------ commands/bf.load.md | 3 --- commands/bf.madd.md | 9 +++++---- commands/bf.mexists.md | 13 +++++++++---- commands/bf.reserve.md | 14 ++++++++------ groups.json | 2 +- modules.json | 7 +++++++ resp2_replies.json | 16 ++++++++-------- resp3_replies.json | 16 ++++++++-------- topics/bloomfilters.md | 29 +++++++++++++---------------- topics/data-types.md | 5 ++++- 16 files changed, 124 insertions(+), 95 deletions(-) create mode 100644 modules.json diff --git a/Makefile b/Makefile index 255070ed..0150221e 100644 --- a/Makefile +++ b/Makefile @@ -8,7 +8,7 @@ DATE ?= 2025-01-08 # Path to the code repo. VALKEY_ROOT ?= ../valkey - +VALKEY_BLOOM_ROOT ?= ../valkey-bloom # Where to install man pages INSTALL_MAN_DIR ?= /usr/local/share/man @@ -30,6 +30,10 @@ ifeq ("$(wildcard $(VALKEY_ROOT))","") $(error Please provide the VALKEY_ROOT variable pointing to the Valkey source code) endif +ifeq ("$(wildcard $(VALKEY_BLOOM_ROOT))","") + $(error Please provide the VALKEY_ROOT variable pointing to the Valkey source code) +endif + ifeq ("$(shell which pandoc)","") $(error Please install pandoc) endif @@ -54,7 +58,9 @@ endif documented_commands = $(wildcard commands/*.md) commands_json_files = $(wildcard $(VALKEY_ROOT)/src/commands/*.json) -existing_commands = $(commands_json_files:$(VALKEY_ROOT)/src/commands/%.json=commands/%.md) +bloom_commands_json_files = $(wildcard $(VALKEY_BLOOM_ROOT)/src/commands/*.json) +existing_commands = $(commands_json_files:$(VALKEY_ROOT)/src/commands/%.json=commands/%.md) \ + $(bloom_commands_json_files:$(VALKEY_BLOOM_ROOT)/src/commands/%.json=commands/%.md) topics = $(wildcard topics/*) commands = $(filter $(existing_commands),$(documented_commands)) @@ -65,7 +71,9 @@ topics_pics = $(filter-out %.md,$(topics)) # ---- Temp files ---- # JSON files for the commands that have a .md file (excluding undocumented commands). -json_for_documented_commands = $(commands:commands/%.md=$(VALKEY_ROOT)/src/commands/%.json) +json_for_documented_commands = \ + $(patsubst commands/%.md,$(VALKEY_ROOT)/src/commands/%.json,$(filter $(commands_json_files:$(VALKEY_ROOT)/src/commands/%.json=commands/%.md),$(commands))) \ + $(patsubst commands/%.md,$(VALKEY_BLOOM_ROOT)/src/commands/%.json,$(filter $(bloom_commands_json_files:$(VALKEY_BLOOM_ROOT)/src/commands/%.json=commands/%.md),$(commands))) $(BUILD_DIR)/.commands-per-group.json: $(VALKEY_ROOT)/src/commands/. utils/build-command-groups.py | $(BUILD_DIR) utils/build-command-groups.py $(json_for_documented_commands) > $@~~ @@ -175,11 +183,14 @@ $(MAN_DIR)/man1/valkey-%.1.gz: topics/%.md $(man_scripts) utils/preprocess-markdown.py --man --page-type program \ --version $(VERSION) --date $(DATE) \$< \ | utils/links-to-man.py - | $(to_man) > $@ -$(MAN_DIR)/man3/%.3valkey.gz: commands/%.md $(VALKEY_ROOT)/src/commands/%.json $(BUILD_DIR)/.commands-per-group.json $(man_scripts) +$(MAN_DIR)/man3/%.3valkey.gz: commands/%.md $(BUILD_DIR)/.commands-per-group.json $(man_scripts) + $(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT)) + $(eval FINAL_ROOT := $(firstword $(foreach root,$(VALKEY_ROOTS),$(if $(wildcard $(root)/src/commands/$*.json),$(root))))) + $(if $(FINAL_ROOT),,$(eval FINAL_ROOT := $(lastword $(VALKEY_ROOTS)))) utils/preprocess-markdown.py --man --page-type command \ --version $(VERSION) --date $(DATE) \ --commands-per-group-json $(BUILD_DIR)/.commands-per-group.json \ - --valkey-root $(VALKEY_ROOT) $< \ + --valkey-root $(FINAL_ROOT) $< \ | utils/links-to-man.py - | $(to_man) > $@ $(MAN_DIR)/man5/%.5.gz: topics/%.md $(man_scripts) utils/preprocess-markdown.py --man --page-type config \ diff --git a/commands/bf.add.md b/commands/bf.add.md index a6d8091e..1dbcd728 100644 --- a/commands/bf.add.md +++ b/commands/bf.add.md @@ -1,9 +1,9 @@ -Adds an item to a bloom filter, if the specified filter does not exist creates a default bloom filter with that name. -## Arguments -* key (required) - A Valkey key of Bloom data type -* item (required) - Item to add +Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name. + +If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command. ## Examples + ``` 127.0.0.1:6379> BF.ADD key val 1 diff --git a/commands/bf.card.md b/commands/bf.card.md index bae8c87f..19bbc092 100644 --- a/commands/bf.card.md +++ b/commands/bf.card.md @@ -1,8 +1,7 @@ Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter. -## Arguments -* key (required) - A Valkey key of Bloom data type ## Examples + ``` 127.0.0.1:6379> BF.ADD key val 1 diff --git a/commands/bf.exists.md b/commands/bf.exists.md index d072f738..ae289a7e 100644 --- a/commands/bf.exists.md +++ b/commands/bf.exists.md @@ -1,11 +1,14 @@ -Determines if a specified item has been added to the specified bloom filter. -Syntax +Determines if an item has been added to the bloom filter. + +A Bloom filter has two possible responses when you check if an item exists: + +* "No" (Definite) - If the filter says an item is NOT present, this is 100% certain. The item is definitely not in the set. + +* "Maybe" (Probabilistic) - If the filter says an item IS present, this is uncertain. There's a chance it's a false positive. The item might be in the set, but may not be -## Arguments -* key (required) - A Valkey key of Bloom data type -* item (required) - The item that we are checking if it exists in the bloom object ## Examples + ``` 127.0.0.1:6379> BF.ADD key val 1 diff --git a/commands/bf.info.md b/commands/bf.info.md index 3a8301d8..8ef6acd9 100644 --- a/commands/bf.info.md +++ b/commands/bf.info.md @@ -1,17 +1,19 @@ -Returns information about a bloomfilter +Returns information about a bloom filter + +## Info Fields + +* CAPACITY - Returns the number of unique items that would need to be added before scaling would happen +* SIZE - Returns the number of bytes allocated +* FILTERS - Returns the number of filters in the specified key +* ITEMS - Returns the number of unique items that have been added the the bloom filter +* ERROR - Returns the false positive rate for the bloom filter +* EXPANSION - Returns the expansion rate +* MAXSCALEDCAPACITY - Returns the maximum capacity that can be reached before an error occurs -## Arguments -* key (required) - A valkey key of bloom data type -* CAPACITY (optional) - Returns the number of unique items that would need to be added before scaling would happen -* SIZE (optional) - Returns the memory size which is the number of bytes allocated -* FILTERS (optional) - Returns the number of filters in the specified key -* ITEMS (optional) - Returns the number of unique items that have been added the the Bloom filter -* ERROR (optional) - Returns the false positive rate for the bloom filter -* EXPANSION (optional) - Returns the expansion rate -* MAXSCALEDCAPACITY (optional) - Returns the maximum capacity that can be reached before an error occurs If none of the optional fields are specified, all the fields will be returned. MAXSCALEDCAPACITY will be an unrecognized argument on non scaling filters ## Examples + ``` 127.0.0.1:6379> BF.ADD key val 1 diff --git a/commands/bf.insert.md b/commands/bf.insert.md index 4ad7e6e9..6c1ee348 100644 --- a/commands/bf.insert.md +++ b/commands/bf.insert.md @@ -1,19 +1,21 @@ -Creates a bloom object with the specified parameters. If a parameter is not specified then the default value will be used. If ITEMS is specified then it will also attempt to add all items specified after +Creates a bloom filter with the specified parameters. If a parameter is not specified then the default value will be used. If ITEMS is specified then it will also attempt to add all items specified + +## Insert Fields + +* CAPACITY capacity - capacity for the initial bloom filter +* ERROR `fp_error` - The false positive rate for the bloom filter +* EXPANSION expansion - The expansion rate for a scaling filter +* NOCREATE - Will not create the bloom filter and add items if the filter does not exist already +* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter +* SEED seed - The seed the hash functions will use +* NONSCALING - Will make it so the filter can not scale +* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter +* ITEMS item - One or more items we will add to the bloom filter -## Arguments Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSCALING and VALIDATESCALETO isn't allowed -* key (required) - Is the key name for a Bloom filter to add the item to -* CAPACITY capacity (optional) - capacity for the inital bloom filter -* ERROR fp_error (optional)- The false positive rate for the bloom filter -* EXPANSION expansion(optional) - The expansion rate for a scaling filter -* NOCREATE (optional) - Will not create the bloom filter and add items if the filter does not exist already -* TIGHTENING (optional) - The tightening ratio for the bloom filter -* SEED (optional) - The seed the hash functions will use -* NONSCALING (optional) - Will make it so the filter can not scale -* VALIDATESCALETO validatescaleto (optional) - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter -* ITEMS (optional) - Items we will add to the bloom filter ## Examples + ``` 127.0.0.1:6379> BF.INSERT key ITEMS item1 item2 1) (integer) 1 @@ -28,13 +30,13 @@ Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSC ``` 127.0.0.1:6379> BF.INSERT key NONSCALING VALIDATESCALETO 100 -cannot use NONSCALING and VALIDATESCALETO options together -127.0.0.1:6379> BF.INSERT key CAPACITY 1000 VALIDATESCALETO 999999999999999999999 ITEMS item2 item3 -provided VALIDATESCALETO causes bloom object to exceed memory limit -127.0.0.1:6379> BF.INSERT key VALIDATESCALETO 999999999999999999999 EXPANSION 1 ITEMS item2 item3 -provided VALIDATESCALETO causes false positive to degrade to 0 +(error) ERR cannot use NONSCALING and VALIDATESCALETO options together +127.0.0.1:6379> BF.INSERT key CAPACITY 1000 VALIDATESCALETO 999999999999999999 ITEMS item2 item3 +(error) ERR provided VALIDATESCALETO causes bloom object to exceed memory limit +127.0.0.1:6379> BF.INSERT key VALIDATESCALETO 999999999999999999 EXPANSION 1 ITEMS item2 item3 +(error) ERR provided VALIDATESCALETO causes false positive to degrade to 0 ``` ``` 127.0.0.1:6379> BF.INSERT key NOCREATE ITEMS item1 item2 -not found +(error) ERR not found ``` \ No newline at end of file diff --git a/commands/bf.load.md b/commands/bf.load.md index 04ae9979..91c61ac3 100644 --- a/commands/bf.load.md +++ b/commands/bf.load.md @@ -1,4 +1 @@ Loads a bloom filter from a dump of an existing bloom object with all the properties and bit vector dump. -## Arguments -* key (required) - Is the key name for a Bloom filter to add the item to -* dump (required) - Is the dump we are restoring diff --git a/commands/bf.madd.md b/commands/bf.madd.md index a044125c..a94bee87 100644 --- a/commands/bf.madd.md +++ b/commands/bf.madd.md @@ -1,8 +1,9 @@ -Adds one or more items to a Bloom Filter, if the specified filter does not exist creates a default bloom filter with that name. -## Arguments -* key (required) - Is the key name for a Bloom filter to add the item to -* item (requires at least 1 item but can add as many as wanted ) - Is the item/s to add +Adds one or more items to a Bloom Filter, if the bloom filter does not exist creates a default bloom filter. + +If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command. + ## Examples + ``` 127.0.0.1:6379> BF.MADD key item1 item2 1) (integer) 1 diff --git a/commands/bf.mexists.md b/commands/bf.mexists.md index 30dfa289..f5865656 100644 --- a/commands/bf.mexists.md +++ b/commands/bf.mexists.md @@ -1,8 +1,13 @@ -Determines if one or more items has been added to the specified bloom filter -## Arguments -* key (required) - A Valkey key of Bloom data type -* item (requires at least 1 item but can add as many as desired) - The item/s that we are checking if it exists in the bloom object +Determines if one or more items has been added to a bloom filter. + +A Bloom filter has two possible responses when you check if an item exists: + +* "No" (Definite) - If the filter says an item is NOT present, this is 100% certain. The item is definitely not in the set. + +* "Maybe" (Probabilistic) - If the filter says an item IS present, this is uncertain. There's a chance it's a false positive. The item might be in the set, but may not be + ## Examples + ``` 127.0.0.1:6379> BF.MADD key item1 item2 1) (integer) 1 diff --git a/commands/bf.reserve.md b/commands/bf.reserve.md index 5fb5fe39..4afe3013 100644 --- a/commands/bf.reserve.md +++ b/commands/bf.reserve.md @@ -1,12 +1,14 @@ Creates an empty bloom object with the capacity and false positive rate specified -## Arguments -* key (required) - A Valkey key of Bloom data type -* error_rate (required) - The fp rate the bloom filter will be created with -* capacity (required) - The starting capacity the bloom filter will be created with -* EXPANSION expansion(optional)- The rate in which filters will increase by -* NONSCALING (optional) - Setting this will make it so the bloom object can’t expand past its initial capacity + +## Reserve fields + +* error_rate - The false positive rate the bloom filter will be created with +* capacity - The starting capacity the bloom filter will be created with +* EXPANSION expansion - The rate in which filters will increase by +* NONSCALING - Setting this will make it so the bloom object can’t expand past its initial capacity ## Examples + ``` 127.0.0.1:6379> BF.RESERVE key 0.01 1000 OK diff --git a/groups.json b/groups.json index 67735da4..29724020 100644 --- a/groups.json +++ b/groups.json @@ -4,7 +4,7 @@ "description": "Operations on the Bitmap data type" }, "bloom": { - "display": "Bloom", + "display": "Bloom filter", "description": "Operations on the Bloom filter data type" }, "cluster": { diff --git a/modules.json b/modules.json new file mode 100644 index 00000000..03ca3a68 --- /dev/null +++ b/modules.json @@ -0,0 +1,7 @@ +{ + "valkey_bloom": { + "name": "valkey-bloom", + "repo": "https://github.com/valkey-io/valkey-bloom", + "description": "Module that allows users to use the bloom filter data type" + } +} \ No newline at end of file diff --git a/resp2_replies.json b/resp2_replies.json index f7226512..a49d2621 100644 --- a/resp2_replies.json +++ b/resp2_replies.json @@ -64,8 +64,8 @@ ], "BF.ADD": [ "One of the following:", - "* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added", - "* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", + "* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", "", "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if attempting to add to a non scaling filter that is full" ], @@ -76,23 +76,23 @@ ], "BF.EXISTS": [ "One of the following:", - "* [Integer reply](../topics/protocol.md#integers): '1'. The item exists in the bloom filter", - "* [Integer reply](../topics/protocol.md#integers): '0'. The bloom filter does not exist or the item has not been added to the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): `1` if the item exists in the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter", "", "The command will fail if the wrong number of arguments are provided" ], "BF.INFO": [ "When no optional arguments are provided:", - "[Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", - "When an optional argument is provided:", + "* [Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", + "When an optional argument excluding ERROR is provided:", "* [Integer reply](../topics/protocol.md#integers): argument value", - "* [String reply??](../topics/protocol.md#simple-strings): argument value", + "When ERROR is provided as an optional argument:", + "* [String reply](../topics/protocol.md#simple-strings): argument value", "", "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if trying to get MAXSCALEDCAPACITY of a non scaling filter" ], "BF.INSERT": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", - "[String reply??](../topics/protocol.md#simple-strings): not found, if the filter does not exist and NOCREATE is specified", "", "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" ], diff --git a/resp3_replies.json b/resp3_replies.json index 82f6ad16..9832c111 100644 --- a/resp3_replies.json +++ b/resp3_replies.json @@ -64,8 +64,8 @@ ], "BF.ADD": [ "One of the following:", - "* [Integer reply](../topics/protocol.md#integers): '1'. The item was successfully added", - "* [Integer reply](../topics/protocol.md#integers): '0'. The item already existed in the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", + "* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", "", "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if attempting to add to a non scaling filter that is full" ], @@ -76,23 +76,23 @@ ], "BF.EXISTS": [ "One of the following:", - "* [Integer reply](../topics/protocol.md#integers): '1'. The item exists in the bloom filter", - "* [Integer reply](../topics/protocol.md#integers): '0'. The bloom filter does not exist or the item has not been added to the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): `1` if the item exists in the bloom filter", + "* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter", "", "The command will fail if the wrong number of arguments are provided" ], "BF.INFO": [ "When no optional arguments are provided:", - "[Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", - "When an optional argument is provided:", + "* [Array reply](../topics/protocol.md#arrays): List of information about the bloom filter.", + "When an optional argument excluding ERROR is provided:", "* [Integer reply](../topics/protocol.md#integers): argument value", - "* [String reply??](../topics/protocol.md#simple-strings): argument value", + "When ERROR is provided as an optional argument:", + "* [String reply](../topics/protocol.md#simple-strings): argument value", "", "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if trying to get MAXSCALEDCAPACITY of a non scaling filter" ], "BF.INSERT": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", - "[String reply??](../topics/protocol.md#simple-strings): not found, if the filter does not exist and NOCREATE is specified", "", "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" ], diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index 6cd10c14..772e84bb 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -4,30 +4,26 @@ description: > Introduction to Bloom Filters --- +The bloom filter data type is taken from a [separate module](https://github.com/valkey-io/valkey-bloom) that users will need to install in order to use. + Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. -## Bloom commands +## Basic Bloom commands + * `BF.ADD` adds an item to a bloom filter * `BF.CARD` returns the cardinality of a bloom filter * `BF.EXISTS` checks if an item has been added to a bloom filter * `BF.INFO` returns information about a bloom filter -* `BF.INSERT` can create and/or add items to a bloom filter -* `BF.LOAD` load a bloom filter from a dump -* `BF.MADD` adds one or more items to a bloom filter -* `BF.MEXISTS` checks if one of more items are present in a bloomfilter -* `BF.RESERVE` creates an empty bloom filter -See the [bloom commands in more detail](../commands/#bloom). +See the [complete list of bloom filter commands](../commands/#bloom). ## Common use cases for bloom filters -**Financial fraud detection** +### Financial fraud detection -Bloom filters can help answer the question "Has the user paid from this location before?", which can then give insights if there has been suspicious activity in shopping habits. +Bloom filters can help answer the question "Has this card been flagged as stolen?", use a bloom filter that has cards reported stolen added to it. Check a card on use that it is not present in the bloom filter. If it isn't then the card is not marked as stolen, if present then a check to the main database can happen or deny the purchase. -For the above each user would have a Bloom filter which is then checked for every transaction. - -**Ad placement** +### Ad placement Bloom filters can help answer the following questions to advertisers: * Has the user already seen this ad? @@ -38,14 +34,14 @@ Use a Bloom filter for every user, storing all bought products. The recommendati * If no, the ad is shown to the user and is added to the Bloom filter. * If yes, the process restarts and repeats until it finds a product that is not present in the filter. -**Check if URL's are malicious** +### Check if URL's are malicious -Bloom filters can answer the question is a URL malicious. Any URL inputted would be checked against a malicious URL bloom filter. +Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter. * If no then we allow access to the site * If yes then we can deny access or perform a full check of the URL -**Check if a username is taken** +### Check if a username is taken Bloom filters can answer the question: Has this username/email/domain name/slug already been used? @@ -95,7 +91,8 @@ There are a few bloom commands that are O(1) as they don't work on items but ins The consumption of memory by a single Bloom object is limited to a default of 128 MB (configurable in the bloom module), which is the size of the in-memory data structure not the capacity of the Bloom object. You can check the amount of memory consumed by a Bloom object by using the BF.INFO command. When a bloom filter scales out it will add another filter, there is a limit on the number of filters that can be added. This filter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom objects. -We have implemented an optional argument into insert (VALIDATESCALETO) that can help you determine the max capacity of the objects on creation. The VALIDATESCALETO when specified would check a few things, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the capacity that is desired will the bloom object be less than the max memory limit (by default 128 MB). +We have implemented an optional argument into BF.INSERT (VALIDATESCALETO) that can help you determine the max capacity of the objects on creation. The VALIDATESCALETO when specified would check a few things, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the capacity that is desired will the bloom object be less than the max memory limit (by default 128 MB). + There is also a way to check the max capacity that can be reached for Bloom objects. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom object can reach. Example usage for a default bloom object: diff --git a/topics/data-types.md b/topics/data-types.md index 09e53321..3fa4efdf 100644 --- a/topics/data-types.md +++ b/topics/data-types.md @@ -94,11 +94,14 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat ## Bloom Filter -[Bloom filters](bloomfilters.md) provides a space efficient probabilistic data structure that allows checking if an element is a member of a set. False positives are possible, but it guarantees no false negatives. +[Bloom filters](bloomfilters.md) are a space efficient data type that can tell you if something is definitely not in a set, or it might be in the set. + +Bloom filters are provided by the module `valkey-bloom` For more information, see: * [Overview of Bloom Filters](bloomfilters.md) * [Bloom filter command reference](../commands/#bloom) +* [The valkey-bloom module on GitHub](https://github.com/valkey-io/valkey-bloom/) ## Extensions From f062a8a0740af7122d488b447a1beb8aec622ad6 Mon Sep 17 00:00:00 2001 From: zackcam Date: Fri, 7 Mar 2025 19:16:09 +0000 Subject: [PATCH 03/10] Changes based on feedback for bloom commands and documentation Signed-off-by: zackcam --- Makefile | 2 +- README.md | 4 ++-- commands/bf.add.md | 6 ++++-- commands/bf.card.md | 4 ++-- commands/bf.exists.md | 9 ++++----- commands/bf.info.md | 14 +++++++------- commands/bf.insert.md | 20 +++++++++++--------- commands/bf.load.md | 2 +- commands/bf.madd.md | 4 ++-- commands/bf.mexists.md | 6 +++--- commands/bf.reserve.md | 12 +++++++----- topics/bloomfilters.md | 28 ++++++++++++++++++++++++---- topics/data-types.md | 2 +- 13 files changed, 69 insertions(+), 44 deletions(-) diff --git a/Makefile b/Makefile index 0150221e..6822503d 100644 --- a/Makefile +++ b/Makefile @@ -31,7 +31,7 @@ ifeq ("$(wildcard $(VALKEY_ROOT))","") endif ifeq ("$(wildcard $(VALKEY_BLOOM_ROOT))","") - $(error Please provide the VALKEY_ROOT variable pointing to the Valkey source code) + $(error Please provide the VALKEY_BLOOM_ROOT variable pointing to the valkey-bloom source code) endif ifeq ("$(shell which pandoc)","") diff --git a/README.md b/README.md index fe0383ab..36b7c5dc 100644 --- a/README.md +++ b/README.md @@ -7,11 +7,11 @@ for generating content for the website and man pages. This repo comes with a Makefile to build and install man pages. - make VALKEY_ROOT=path/to/valkey + make VALKEY_ROOT=path/to/valkey VALKEY_BLOOM_ROOT=path/to/valkey-bloom sudo make install INSTALL_MAN_DIR=/usr/local/share/man Prerequisites: GNU Make, Python 3, Python 3 YAML (pyyaml), Pandoc. -Additionally, the scripts need access to the valkey code repo, +Additionally, the scripts need access to the valkey and valkey-bloom code repos, where metadata files about the commands are stored. The pages are generated under `_build/man/` by default. The default install diff --git a/commands/bf.add.md b/commands/bf.add.md index 1dbcd728..e659c8b4 100644 --- a/commands/bf.add.md +++ b/commands/bf.add.md @@ -1,6 +1,8 @@ -Adds an item to a bloom filter, if the specified bloom filter does not exist creates a bloom filter with default configurations with that name. +Adds a single item to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties. -If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command. +To add multiple items to a bloom filter, you can use the BF.MADD or BF.INSERT commands. + +If you want to create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command. ## Examples diff --git a/commands/bf.card.md b/commands/bf.card.md index 19bbc092..00224d26 100644 --- a/commands/bf.card.md +++ b/commands/bf.card.md @@ -1,4 +1,4 @@ -Gets the cardinality of a Bloom filter - number of items that have been successfully added to a Bloom filter. +Returns the cardinality of a Bloom filter which is the number of items that have been successfully added to it. ## Examples @@ -7,6 +7,6 @@ Gets the cardinality of a Bloom filter - number of items that have been successf 1 127.0.0.1:6379> BF.CARD key 1 -127.0.0.1:6379> BF.CARD missing +127.0.0.1:6379> BF.CARD nonexistentkey 0 ``` \ No newline at end of file diff --git a/commands/bf.exists.md b/commands/bf.exists.md index ae289a7e..a5269abb 100644 --- a/commands/bf.exists.md +++ b/commands/bf.exists.md @@ -1,11 +1,10 @@ -Determines if an item has been added to the bloom filter. +Determines if an item has been added to the bloom filter previously. A Bloom filter has two possible responses when you check if an item exists: -* "No" (Definite) - If the filter says an item is NOT present, this is 100% certain. The item is definitely not in the set. - -* "Maybe" (Probabilistic) - If the filter says an item IS present, this is uncertain. There's a chance it's a false positive. The item might be in the set, but may not be +* 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. +* 1 - The item exists with a given false positive (fp) percentage. There is an fp rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. ## Examples @@ -14,6 +13,6 @@ A Bloom filter has two possible responses when you check if an item exists: 1 127.0.0.1:6379> BF.EXISTS key val 1 -127.0.0.1:6379> BF.EXISTS key missing +127.0.0.1:6379> BF.EXISTS key nonexistent 0 ``` diff --git a/commands/bf.info.md b/commands/bf.info.md index 8ef6acd9..aa26f433 100644 --- a/commands/bf.info.md +++ b/commands/bf.info.md @@ -1,14 +1,14 @@ -Returns information about a bloom filter +Returns usage information and properties of a specific bloom filter ## Info Fields -* CAPACITY - Returns the number of unique items that would need to be added before scaling would happen -* SIZE - Returns the number of bytes allocated +* CAPACITY - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. +* SIZE - The number of bytes allocated by this bloom filter. * FILTERS - Returns the number of filters in the specified key -* ITEMS - Returns the number of unique items that have been added the the bloom filter -* ERROR - Returns the false positive rate for the bloom filter -* EXPANSION - Returns the expansion rate -* MAXSCALEDCAPACITY - Returns the maximum capacity that can be reached before an error occurs +* ITEMS - The number of unique items that have been added to the bloom filter +* ERROR - The false positive rate of the bloom filter +* EXPANSION - The expansion rate of the bloom filter. Non scaling filters will have an expansion rate of nil. +* MAXSCALEDCAPACITY - The [maximum capacity](../topics/bloomfilters.md) that a scalable bloom filter can be expand to and reach before a subsequent scale out will fail. If none of the optional fields are specified, all the fields will be returned. MAXSCALEDCAPACITY will be an unrecognized argument on non scaling filters diff --git a/commands/bf.insert.md b/commands/bf.insert.md index 6c1ee348..2f3f6e5a 100644 --- a/commands/bf.insert.md +++ b/commands/bf.insert.md @@ -1,16 +1,18 @@ -Creates a bloom filter with the specified parameters. If a parameter is not specified then the default value will be used. If ITEMS is specified then it will also attempt to add all items specified +If the bloom filter does not exist under the specified name, a bloom filter is created with the specified parameters. Default properties will be used if the options below are not specified. + +When the ITEMS option is provided, all items provided will be attempted to be added. ## Insert Fields -* CAPACITY capacity - capacity for the initial bloom filter -* ERROR `fp_error` - The false positive rate for the bloom filter -* EXPANSION expansion - The expansion rate for a scaling filter +* CAPACITY `capacity` - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. +* ERROR `fp_error` - The false positive rate of the bloom filter +* EXPANSION `expansion` - This option will specify the bloom filter as scaling and controls the size of the sub filter that will be created upon scale out / expansion of the bloom filter. * NOCREATE - Will not create the bloom filter and add items if the filter does not exist already * TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter -* SEED seed - The seed the hash functions will use -* NONSCALING - Will make it so the filter can not scale -* VALIDATESCALETO `validatescaleto` - Checks if the filter could scale to this capacity and if not show an error and don’t create the bloom filter -* ITEMS item - One or more items we will add to the bloom filter +* SEED `seed` - The seed the hash functions will use +* NONSCALING - This option will configure the bloom filter as non scaling; it cannot expand / scale beyond its specified capacity. +* VALIDATESCALETO `validatescaleto` - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter +* ITEMS `item` - One or more items to be added to the bloom filter Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSCALING and VALIDATESCALETO isn't allowed @@ -20,7 +22,7 @@ Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSC 127.0.0.1:6379> BF.INSERT key ITEMS item1 item2 1) (integer) 1 2) (integer) 1 -# This does not update the capcity but uses the origianl filters values +# This does not update the capacity since the filter already exists. It only adds the provided items. 127.0.0.1:6379> BF.INSERT key CAPACITY 1000 ITEMS item2 item3 1) (integer) 0 2) (integer) 1 diff --git a/commands/bf.load.md b/commands/bf.load.md index 91c61ac3..77706361 100644 --- a/commands/bf.load.md +++ b/commands/bf.load.md @@ -1 +1 @@ -Loads a bloom filter from a dump of an existing bloom object with all the properties and bit vector dump. +Restores a bloom filter from a dump of an existing bloom filter with all of its specific the properties and bit vector dump of sub filter/s. This command is only generated during AOF Rewrite to restore a bloom filter in the future. diff --git a/commands/bf.madd.md b/commands/bf.madd.md index a94bee87..433ac5c0 100644 --- a/commands/bf.madd.md +++ b/commands/bf.madd.md @@ -1,6 +1,6 @@ -Adds one or more items to a Bloom Filter, if the bloom filter does not exist creates a default bloom filter. +Adds one or more items to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties -If you want to create a bloom filter with non-standard options, use the `BF.INSERT` or `BF.RESERVE` command. +If you want to create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command. ## Examples diff --git a/commands/bf.mexists.md b/commands/bf.mexists.md index f5865656..782d18f7 100644 --- a/commands/bf.mexists.md +++ b/commands/bf.mexists.md @@ -1,10 +1,10 @@ -Determines if one or more items has been added to a bloom filter. +Determines if the provided item/s have been added to a bloom filter previously. A Bloom filter has two possible responses when you check if an item exists: -* "No" (Definite) - If the filter says an item is NOT present, this is 100% certain. The item is definitely not in the set. +* 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. -* "Maybe" (Probabilistic) - If the filter says an item IS present, this is uncertain. There's a chance it's a false positive. The item might be in the set, but may not be +* 1 - The item exists with a given false positive (fp) percentage. There is an fp rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. ## Examples diff --git a/commands/bf.reserve.md b/commands/bf.reserve.md index 4afe3013..4607d5d4 100644 --- a/commands/bf.reserve.md +++ b/commands/bf.reserve.md @@ -1,11 +1,13 @@ -Creates an empty bloom object with the capacity and false positive rate specified +Creates an empty bloom filter with the capacity and false positive rate specified. By default, a scaling filter is created with the default expansion rate. + +To specify the scaling / non scaling nature of the bloom filter, use the options: NONSCALING or SCALING . It is invalid to provide both options together. ## Reserve fields -* error_rate - The false positive rate the bloom filter will be created with -* capacity - The starting capacity the bloom filter will be created with -* EXPANSION expansion - The rate in which filters will increase by -* NONSCALING - Setting this will make it so the bloom object can’t expand past its initial capacity +* error_rate - The false positive rate of the bloom filter +* capacity - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. +* EXPANSION expansion - This option will specify the bloom filter as scaling and controls the size of the sub filter that will be created upon scale out / expansion of the bloom filter. +* NONSCALING - This option will configure the bloom filter as non scaling; it cannot expand / scale beyond its specified capacity. ## Examples diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index 772e84bb..d4cffe18 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -50,6 +50,20 @@ For example for usernames. Use a Bloom filter for every username that has signed * If no, the user is created and the username is added to the Bloom filter. * If yes, the app can decide to either check the main database or reject the username. +## Scaling and non scaling bloom filters + +The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. + +When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). + +When a non scaling filter reaches its capcity, if a user tries to add a new unique item an error will be returned + +The expansion rate is the rate that a scaling bloom filter will have its capacity increased by on the scale out. For example we have a bloom filter with capacity 100 at creation with an expansion rate of 2. After adding 101 unique items we will scale out and create a new filter with capacity 200. Then after adding 200 more unique items (301 items total) we will create a new filter of capacity 400 and so on. + +### When should you use scaling vs non-scaling filters + +If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters. + ## Default bloom properties Capacity - 100 @@ -81,11 +95,17 @@ Example of default bloom objects information: 14) (integer) 26214300 ``` +### Advanced Properties + +Seed - The seed used by the bloom filter can be specified by the user in the BF.INSERT command. This property is only useful if you have a specific 32 byte seed that you want your bloom filter to use. By defualt every bloom filter will use a random seed. + +Tightening Ratio - We do not recommend fine tuning this unless there is a specific use case for lower memory usage with higher false positive or vice versa. + ## Performance Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. -There are a few bloom commands that are O(1) as they don't work on items but instead work on the data about the bloom filter itself. +There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself. ## Limits @@ -97,11 +117,11 @@ There is also a way to check the max capacity that can be reached for Bloom obje Example usage for a default bloom object: ``` -127.0.0.1:6379> bf.insert validate_scale_fail VALIDATESCALETO 26214301 +127.0.0.1:6379> BF.INSERT validate_scale_fail VALIDATESCALETO 26214301 (error) ERR provided VALIDATESCALETO causes bloom object to exceed memory limit -127.0.0.1:6379> bf.insert validate_scale_valid VALIDATESCALETO 26214300 +127.0.0.1:6379> BF.INSERT validate_scale_valid VALIDATESCALETO 26214300 [] -127.0.0.1:6379> bf.info validate_scale_valid MAXSCALEDCAPACITY +127.0.0.1:6379> BF.INFO validate_scale_valid MAXSCALEDCAPACITY (integer) 26214300 ``` diff --git a/topics/data-types.md b/topics/data-types.md index 3fa4efdf..00435505 100644 --- a/topics/data-types.md +++ b/topics/data-types.md @@ -94,7 +94,7 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat ## Bloom Filter -[Bloom filters](bloomfilters.md) are a space efficient data type that can tell you if something is definitely not in a set, or it might be in the set. +[Bloom filters](bloomfilters.md) are a space efficient probabilistic data type that can be used to check if item/s are definitely not present in a set, or if they exist within the set (with the configured false positive rate). Bloom filters are provided by the module `valkey-bloom` For more information, see: From 0249aec83d5a9b34c0519789f40d40eaae624e5b Mon Sep 17 00:00:00 2001 From: zackcam Date: Mon, 17 Mar 2025 22:44:54 +0000 Subject: [PATCH 04/10] Adding table for bloom default properties as well as cleaning up man creation and spelling Signed-off-by: zackcam --- Makefile | 18 +++++---- README.md | 5 ++- commands/bf.exists.md | 2 +- commands/bf.insert.md | 14 +++---- commands/bf.mexists.md | 2 +- commands/commands | 1 - resp2_replies.json | 24 ++++-------- resp3_replies.json | 24 ++++-------- topics/bloomfilters.md | 83 +++++++++++++++++++++++++++++++----------- 9 files changed, 99 insertions(+), 74 deletions(-) delete mode 120000 commands/commands diff --git a/Makefile b/Makefile index 6822503d..b2ac4b71 100644 --- a/Makefile +++ b/Makefile @@ -31,7 +31,7 @@ ifeq ("$(wildcard $(VALKEY_ROOT))","") endif ifeq ("$(wildcard $(VALKEY_BLOOM_ROOT))","") - $(error Please provide the VALKEY_BLOOM_ROOT variable pointing to the valkey-bloom source code) + $(info Valkey bloom variable pointed to nothing, skipping bloom filter commands) endif ifeq ("$(shell which pandoc)","") @@ -156,6 +156,9 @@ progs = valkey-cli valkey-server valkey-benchmark valkey-sentinel valkey-check-r programs = $(progs:valkey-%=topics/%.md) configs = topics/valkey.conf.md +# Define the base directories where valkey commands can come from +VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT) + man1_src = $(filter $(programs),$(topics_md)) man3_src = $(commands) man5_src = $(filter $(configs),$(topics_md)) @@ -184,14 +187,13 @@ $(MAN_DIR)/man1/valkey-%.1.gz: topics/%.md $(man_scripts) --version $(VERSION) --date $(DATE) \$< \ | utils/links-to-man.py - | $(to_man) > $@ $(MAN_DIR)/man3/%.3valkey.gz: commands/%.md $(BUILD_DIR)/.commands-per-group.json $(man_scripts) - $(eval VALKEY_ROOTS := $(VALKEY_ROOT) $(VALKEY_BLOOM_ROOT)) $(eval FINAL_ROOT := $(firstword $(foreach root,$(VALKEY_ROOTS),$(if $(wildcard $(root)/src/commands/$*.json),$(root))))) - $(if $(FINAL_ROOT),,$(eval FINAL_ROOT := $(lastword $(VALKEY_ROOTS)))) - utils/preprocess-markdown.py --man --page-type command \ - --version $(VERSION) --date $(DATE) \ - --commands-per-group-json $(BUILD_DIR)/.commands-per-group.json \ - --valkey-root $(FINAL_ROOT) $< \ - | utils/links-to-man.py - | $(to_man) > $@ + $(if $(FINAL_ROOT), \ + utils/preprocess-markdown.py --man --page-type command \ + --version $(VERSION) --date $(DATE) \ + --commands-per-group-json $(BUILD_DIR)/.commands-per-group.json \ + --valkey-root $(FINAL_ROOT) $< \ + | utils/links-to-man.py - | $(to_man) > $@) $(MAN_DIR)/man5/%.5.gz: topics/%.md $(man_scripts) utils/preprocess-markdown.py --man --page-type config \ --version $(VERSION) --date $(DATE) $< \ diff --git a/README.md b/README.md index 36b7c5dc..9baaf310 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,9 @@ This repo comes with a Makefile to build and install man pages. sudo make install INSTALL_MAN_DIR=/usr/local/share/man Prerequisites: GNU Make, Python 3, Python 3 YAML (pyyaml), Pandoc. -Additionally, the scripts need access to the valkey and valkey-bloom code repos, -where metadata files about the commands are stored. +Additionally, the scripts need access to the valkey code repo, +where metadata files about the commands are stored. Additionally +access to the valkey-bloom repo is optional. The pages are generated under `_build/man/` by default. The default install location is `/usr/local/share/man` (in the appropriate subdirectories). diff --git a/commands/bf.exists.md b/commands/bf.exists.md index a5269abb..a759134d 100644 --- a/commands/bf.exists.md +++ b/commands/bf.exists.md @@ -4,7 +4,7 @@ A Bloom filter has two possible responses when you check if an item exists: * 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. -* 1 - The item exists with a given false positive (fp) percentage. There is an fp rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. +* 1 - The item exists with a given false positive (`fp`) percentage. There is an `fp` rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. ## Examples diff --git a/commands/bf.insert.md b/commands/bf.insert.md index 2f3f6e5a..91681692 100644 --- a/commands/bf.insert.md +++ b/commands/bf.insert.md @@ -4,15 +4,15 @@ When the ITEMS option is provided, all items provided will be attempted to be ad ## Insert Fields -* CAPACITY `capacity` - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. -* ERROR `fp_error` - The false positive rate of the bloom filter -* EXPANSION `expansion` - This option will specify the bloom filter as scaling and controls the size of the sub filter that will be created upon scale out / expansion of the bloom filter. +* CAPACITY *capacity* - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. +* ERROR *fp_error* - The false positive rate of the bloom filter +* EXPANSION *expansion* - This option will specify the bloom filter as scaling and controls the size of the sub filter that will be created upon scale out / expansion of the bloom filter. * NOCREATE - Will not create the bloom filter and add items if the filter does not exist already -* TIGHTENING `tightening_ratio` - The tightening ratio for the bloom filter -* SEED `seed` - The seed the hash functions will use +* TIGHTENING *tightening_ratio* - The tightening ratio for the bloom filter +* SEED *seed* - The seed the hash functions will use * NONSCALING - This option will configure the bloom filter as non scaling; it cannot expand / scale beyond its specified capacity. -* VALIDATESCALETO `validatescaleto` - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter -* ITEMS `item` - One or more items to be added to the bloom filter +* VALIDATESCALETO *validatescaleto* - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter +* ITEMS *item* - One or more items to be added to the bloom filter Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSCALING and VALIDATESCALETO isn't allowed diff --git a/commands/bf.mexists.md b/commands/bf.mexists.md index 782d18f7..5ee4744f 100644 --- a/commands/bf.mexists.md +++ b/commands/bf.mexists.md @@ -4,7 +4,7 @@ A Bloom filter has two possible responses when you check if an item exists: * 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. -* 1 - The item exists with a given false positive (fp) percentage. There is an fp rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. +* 1 - The item exists with a given false positive (`fp`) percentage. There is an `fp` rate % chance that the item does not exist. You can create bloom filters with a more strict false positive rate as needed. ## Examples diff --git a/commands/commands b/commands/commands deleted file mode 120000 index 55a41c2b..00000000 --- a/commands/commands +++ /dev/null @@ -1 +0,0 @@ -../valkey-doc/commands \ No newline at end of file diff --git a/resp2_replies.json b/resp2_replies.json index fc1c166e..a114ab77 100644 --- a/resp2_replies.json +++ b/resp2_replies.json @@ -67,19 +67,15 @@ "* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", "* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if attempting to add to a non scaling filter that is full" + "The command may fail with an error when attempting to add to a non scaling filter that is full" ], "BF.CARD": [ - "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist", - "", - "The command will fail if the wrong number of arguments are provided" + "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist" ], "BF.EXISTS": [ "One of the following:", "* [Integer reply](../topics/protocol.md#integers): `1` if the item exists in the bloom filter", - "* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter", - "", - "The command will fail if the wrong number of arguments are provided" + "* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter" ], "BF.INFO": [ "When no optional arguments are provided:", @@ -89,27 +85,23 @@ "When ERROR is provided as an optional argument:", "* [String reply](../topics/protocol.md#simple-strings): argument value", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if trying to get MAXSCALEDCAPACITY or TIGHTENING of a non scaling filter" + "The command may fail with an error when trying to get MAXSCALEDCAPACITY or TIGHTENING of a non scaling filter" ], "BF.INSERT": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" + "The command may fail with an error for several reasons: if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" ], "BF.MADD": [ - "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", - "", - "The command will fail if the wrong number of arguments are provided" + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" ], "BF.MEXISTS": [ - "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", - "", - "The command will fail if the wrong number of arguments are provided" + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" ], "BF.RESERVE": [ "[Simple string reply](../topics/protocol.md#simple-strings): `OK`.", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided arguments are not in the valid range, if the provided bloom filter name already exists" + "The command may fail with an error for several reasons: if the provided arguments are not in the valid range, if the provided bloom filter name already exists" ], "BGREWRITEAOF": [ "[Simple string reply](../topics/protocol.md#simple-strings): a simple string reply indicating that the rewriting started or is about to start ASAP when the call is executed with success.", diff --git a/resp3_replies.json b/resp3_replies.json index 71254790..0231887a 100644 --- a/resp3_replies.json +++ b/resp3_replies.json @@ -67,19 +67,15 @@ "* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", "* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if attempting to add to a non scaling filter that is full" + "The command may fail with an error when attempting to add to a non scaling filter that is full" ], "BF.CARD": [ - "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist", - "", - "The command will fail if the wrong number of arguments are provided" + "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist" ], "BF.EXISTS": [ "One of the following:", "* [Integer reply](../topics/protocol.md#integers): `1` if the item exists in the bloom filter", - "* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter", - "", - "The command will fail if the wrong number of arguments are provided" + "* [Integer reply](../topics/protocol.md#integers): `0` if the bloom filter does not exist or the item has not been added to the bloom filter" ], "BF.INFO": [ "When no optional arguments are provided:", @@ -89,27 +85,23 @@ "When ERROR is provided as an optional argument:", "* [String reply](../topics/protocol.md#simple-strings): argument value", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if trying to get MAXSCALEDCAPACITY or TIGHTENING of a non scaling filter" + "The command may fail with an error when trying to get MAXSCALEDCAPACITY or TIGHTENING of a non scaling filter" ], "BF.INSERT": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" + "The command may fail with an error for several reasons: if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" ], "BF.MADD": [ - "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", - "", - "The command will fail if the wrong number of arguments are provided" + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" ], "BF.MEXISTS": [ - "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", - "", - "The command will fail if the wrong number of arguments are provided" + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" ], "BF.RESERVE": [ "[Simple string reply](../topics/protocol.md#simple-strings): `OK`.", "", - "The command may fail with an error for several reasons: if the wrong number of arguments are provided, if the provided arguments are not in the valid range, if the provided bloom filter name already exists" + "The command may fail with an error for several reasons: if the provided arguments are not in the valid range, if the provided bloom filter name already exists" ], "BGREWRITEAOF": [ "[Bulk string reply](../topics/protocol.md#bulk-strings): a simple string reply indicating that the rewriting started or is about to start ASAP when the call is executed with success.", diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index e190cbdd..70e73711 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -56,7 +56,7 @@ The difference between scaling and non scaling bloom filters is that scaling blo When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). -When a non scaling filter reaches its capcity, if a user tries to add a new unique item an error will be returned +When a non scaling filter reaches its capacity, if a user tries to add a new unique item an error will be returned The expansion rate is the rate that a scaling bloom filter will have its capacity increased by on the scale out. For example we have a bloom filter with capacity 100 at creation with an expansion rate of 2. After adding 101 unique items we will scale out and create a new filter with capacity 200. Then after adding 200 more unique items (301 items total) we will create a new filter of capacity 400 and so on. @@ -66,16 +66,62 @@ If the data size is known and fixed then using a non-scaling bloom filter is pre There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better. -## Default bloom properties +## Bloom properties -Capacity - 100 -* The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. +* Capacity - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. -Error rate - 0.01 -* The probability of the filter incorrectly indicating that an element is present when it actually is not. +* False Positive Rate (Error rate) - The rate that controls the probability of bloom check/set operations being false positives. Example: An item addition returning 0 (or an item check returning 1) indicating that the item was already added even though it was not. + +* Expansion - This is a property of scalable bloom filters which controls the growth in overall capacity when a bloom filter scales out by determining the capacity of the new sub filter which gets created. This new capacity is equal to the previous filters capacity * expansion rate + +### Advanced Properties + +The following two properties can be specified in the `BF.INSERT` command: + +* Seed - This is the key with which hash functions are created for the bloom filter. In case of scalable bloom filters, the same seed is used across all sub filters. This property is only useful if you have a specific 32 byte seed that you want your bloom filter to use. By default every bloom filter will use a random seed. + +* Tightening Ratio - This is a property of scalable bloom filters which controls the overall correctness of the bloom filter as it scales out by keeping the actual false positive rate closer to the user requested false positive rate when the bloom filter was created. This is done by using the tightening ratio to set a stricter false positive on the new sub filter which gets created during each scale out. We do not recommend fine tuning this unless there is a specific use case for lower memory usage with higher false positive or vice versa. + +### Default bloom properties + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PropertyDefault ValueEngine Config (Global)
Capacity100BF.BLOOM-CAPACITY
False Positive Rate0.01BF.BLOOM-FP-RATE
Scaling / Non ScalingScalingBF.BLOOM-EXPANSION
Expansion Rate2BF.BLOOM-EXPANSION
Tightening Ratio0.5BF.BLOOM-TIGHTENING-RATIO
SeedRandom SeedBF.BLOOM-USE-RANDOM-SEED
-Expansion - 2 -* The rate that a scaling bloom filter will have its capacity increased by on the scale out. For example we have a bloom filter with capacity 100 at creation with an expansion rate of 2. After adding 101 unique items we will scale out and create a new filter with capacity 200. Then after adding 200 more unique items we will create a new filter of capacity 400 and so forth. As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module. Example of default bloom objects information: @@ -102,13 +148,6 @@ Example of default bloom objects information: 16) (integer) 26214300 ``` -### Advanced Properties - -The following two properties can be specified in the `BF.INSERT` command: -* Seed - This property is only useful if you have a specific 32 byte seed that you want your bloom filter to use. By defualt every bloom filter will use a random seed. - -* Tightening Ratio - We do not recommend fine tuning this unless there is a specific use case for lower memory usage with higher false positive or vice versa. - ## Performance Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. @@ -152,21 +191,21 @@ bf_bloom_defrag_misses:0 ### Bloom filter core metrics -* bf_bloom_total_memory_bytes: Current total number of bytes used by all bloom filters. +* `bf_bloom_total_memory_bytes`: Current total number of bytes used by all bloom filters. -* bf_bloom_num_objects: Current total number of bloom objects. +* `bf_bloom_num_objects`: Current total number of bloom objects. -* bf_bloom_num_filters_across_objects: Current total number of filters across all bloom objects. +* `bf_bloom_num_filters_across_objects`: Current total number of filters across all bloom objects. -* bf_bloom_num_items_across_objects: Current total number of items across all bloom objects. +* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom objects. -* bf_bloom_capacity_across_objects: Current total number of filters across all bloom objects. +* `bf_bloom_capacity_across_objects`: Current total number of filters across all bloom objects. ### Bloom filter defrag metrics -* bf_bloom_defrag_hits: Total number of defrag hits that have occured on bloom objects. +* `bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom objects. -* bf_bloom_defrag_misses: Total number of defrag misses that have occured on bloom objects. +* `bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom objects. ## Limits From 47332ac068018e445c7c923c6051ab6c85fd4df3 Mon Sep 17 00:00:00 2001 From: zackcam Date: Mon, 24 Mar 2025 19:46:46 +0000 Subject: [PATCH 05/10] Topic documentation updates for bloomfilter Signed-off-by: zackcam --- topics/bloomfilters.md | 91 +++++++++++++++++++++++++++--------------- 1 file changed, 58 insertions(+), 33 deletions(-) diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index 70e73711..5dfccc72 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -6,7 +6,7 @@ description: > In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature. -Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. +Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. A false positive is when a structure incorrectly indicates that an element is in the set when it actually is not. False negatives are when a structure incorrectly indicates that an element is not in the set when it actually is. ## Basic Bloom commands @@ -21,38 +21,38 @@ See the [complete list of bloom filter commands](../commands/#bloom). ### Financial fraud detection -Bloom filters can help answer the question "Has this card been flagged as stolen?", use a bloom filter that has cards reported stolen added to it. Check a card on use that it is not present in the bloom filter. If it isn't then the card is not marked as stolen, if present then a check to the main database can happen or deny the purchase. +Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied. ### Ad placement -Bloom filters can help answer the following questions to advertisers: +Bloom filters can help advertisers answer the following questions: * Has the user already seen this ad? -* Has the user already bought this product? +* Has the user already purchased this product? -Use a Bloom filter for every user, storing all bought products. The recommendation engine can then suggest a new product and checks if the product is in the user's Bloom filter. +For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter. -* If no, the ad is shown to the user and is added to the Bloom filter. -* If yes, the process restarts and repeats until it finds a product that is not present in the filter. +* If the product is not in the filter, the ad is shown to the user, and the product is added to the filter. +* If the product is already in the filter, it means the ad has already been shown to the user and the recommendation engine finds a different ad to show. ### Check if URL's are malicious Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter. -* If no then we allow access to the site -* If yes then we can deny access or perform a full check of the URL +* If no, then we allow access to the site +* If yes, then we can deny access or perform a full check of the URL ### Check if a username is taken Bloom filters can answer the question: Has this username/email/domain name/slug already been used? -For example for usernames. Use a Bloom filter for every username that has signed up. A new user types in the desired username. The app checks if the username exists in the Bloom filter. +In this username example, we can use use a Bloom filter to track every username that has signed up. When a new user attempts to sign up with their desired username, the app checks if the username exists in the Bloom filter. * If no, the user is created and the username is added to the Bloom filter. * If yes, the app can decide to either check the main database or reject the username. ## Scaling and non scaling bloom filters -The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. +The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. Scaling bloom filters consist of a vector of “Subfilters” with length >= 1 while non scaling will only contain 1 subfilter. When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). @@ -62,9 +62,11 @@ The expansion rate is the rate that a scaling bloom filter will have its capacit ### When should you use scaling vs non-scaling filters -If the data size is known and fixed then using a non-scaling bloom filter is preferred, for example a static dictionary could use a non scaling bloom filter as the amount of items should be fixed. Likewise the reverse case for dynamic data and unknown final sizes is when you should use a scaling bloom filters. +If the capacity (number of items we want to add) is known and fixed, using a non-scaling bloom filter is preferred. Likewise the reverse case, if the capacity is unknown / dynamically calculated, using a scaling bloom filters is ideal. -There are a few benefits for using non scaling filters, a non scaling filter will have better performance than a filter that has scaled out. A non scaling filter also will use less memory for the capacity that is available. However if you don't want to hit an error and want use-as-you-go capacity, scaling is better. +There are a few benefits for using non scaling filters. A non scaling filter will have better performance than a filter that has scaled out several times (e.g. > 100). Also, non scaling filters in general use less memory for a scaling filter that has scaled out several times to hold the same capacity. + +However, to ensure you do not hit any capacity related errors, and want use-as-you-go capacity, scaling is better. ## Bloom properties @@ -84,47 +86,57 @@ The following two properties can be specified in the `BF.INSERT` command: ### Default bloom properties +These are the default bloom properties along with the commands and configs which allow customizing. + - - - + + + + + + + + + +
PropertyDefault ValueEngine Config (Global)PropertyDefault ValueCommand NameConfiguration name
Capacity 100BF.INSERT, BF.RESERVE BF.BLOOM-CAPACITY
False Positive Rate 0.01BF.INSERT, BF.RESERVE BF.BLOOM-FP-RATE
Scaling / Non Scaling ScalingBF.INSERT, BF.RESERVE BF.BLOOM-EXPANSION
Expansion Rate 2BF.INSERT, BF.RESERVE BF.BLOOM-EXPANSION
Tightening Ratio 0.5BF.INSERT BF.BLOOM-TIGHTENING-RATIO
Seed Random SeedBF.INSERT BF.BLOOM-USE-RANDOM-SEED
-As bloom filters have a default expansion of 2 this means all default bloom objects will be scaling. These options are used when not specified explicitly in the commands used to create a new bloom object. For example doing a BF.ADD for a new filter will create a filter with the exact above qualities. These default properties can be configured through configs on the bloom module. -Example of default bloom objects information: +Since bloom filters have a default expansion of 2, this means any default creation as a result of `BF.ADD`, `BF.MADD`, `BF.INSERT` will be a scalable bloom filter. Users can create a non scaling bloom filter using `BF.RESERVE NONSCALING` or by specifying `NONSCALING` in `BF.INSERT`. Additionally, the other default properties of a bloom filter creation can be seen in the table above and BF.INFO command response below. These default properties can be configured through configs on the bloom module. + +Example of default bloom filter information: ``` 127.0.0.1:6379> BF.ADD default_filter item @@ -150,11 +162,11 @@ Example of default bloom objects information: ## Performance -Most bloom commands are O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only work with one 1 item. +The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item. -As performance can rely on the number of hash functions, choosing the correct capacity and expansion rate can be very important. When you scale out you will be adding more hash functions that will be used. For this reason it is recommended that you should choose a capacity after evaluating your use case as this can avoid several scale outs. +Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to help avoid several scale outs and reduce the number of checks. -There are a few bloom commands that are O(1): BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (if no items are specified). These commands have constant time complexity since they don't work on items but instead work on the data about the bloom filter itself. +There other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided). ## Monitoring @@ -193,36 +205,49 @@ bf_bloom_defrag_misses:0 * `bf_bloom_total_memory_bytes`: Current total number of bytes used by all bloom filters. -* `bf_bloom_num_objects`: Current total number of bloom objects. +* `bf_bloom_num_objects`: Current total number of bloom filters. -* `bf_bloom_num_filters_across_objects`: Current total number of filters across all bloom objects. +* `bf_bloom_num_filters_across_objects`: Current total number of subfilters across all bloom filters. -* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom objects. +* `bf_bloom_num_items_across_objects`: Current total number of items across all bloom filters. -* `bf_bloom_capacity_across_objects`: Current total number of filters across all bloom objects. +* `bf_bloom_capacity_across_objects`: Current total capacity across all bloom filters. ### Bloom filter defrag metrics -* `bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom objects. +* `bf_bloom_defrag_hits`: Total number of defrag hits that have occurred on bloom filters. -* `bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom objects. +* `bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom filters. ## Limits -The consumption of memory by a single Bloom object is limited to a default of 128 MB (configurable in the bloom module), which is the size of the in-memory data structure not the capacity of the Bloom object. You can check the amount of memory consumed by a Bloom object by using the BF.INFO command. When a bloom filter scales out it will add another filter, there is a limit on the number of filters that can be added. This filter limit will change depending on the false positive rate, capacity, expansion and tightening ratio, where this filter limit is specified on the memory limit of the bloom objects. +There are two limits a bloom filter faces. + +1. Memory Usage Limit: + + The memory usage limit per bloom filter by default is defined by the BF.BLOOM-MEMORY-USAGE-LIMIT module configuration which has a default value of 128 MB. If a command results in a creation / scale out causing the overall memory usage to exceed this limit, the command is rejected. + +2. Number of sub filters (in case of scalable bloom filters): + + When a bloom filter scales out, a new sub filter is added. The limit on the number of sub filters depends on the false positive rate and tightening ratio. Each sub filter has a stricter false positive, and this is controlled by the tightening ratio. If a command attempting a scale out results in the sub filter reaching a false positive of 0, the command is rejected. -We have implemented an optional argument into BF.INSERT (VALIDATESCALETO) that can help you determine the max capacity of the objects on creation. The VALIDATESCALETO when specified would check a few things, the first is that when a bloom filter has scaled out to the desired capacity will the tightening ratio reach zero, and if so we will reject the creation. The second thing it will check is that once we reach the capacity that is desired will the bloom object be less than the max memory limit (by default 128 MB). -There is also a way to check the max capacity that can be reached for Bloom objects. Using MAXSCALEDCAPACITY in BF.INFO will provide the exact capacity that the bloom object can reach. +We have implemented VALIDATESCALETO as an optional arg of BF.INSERT to help determine whether the bloom filter can scale out to the reach the specified capacity without hitting either limits mentioned above, and will reject the command otherwise. + +As seen below when trying to create a bloom filter with a capacity more than what is possible given the memory limits, the command is rejected. However if the wanted capacity is within the limits then the creation of the bloom filter will succeed. + +Example: -Example usage for a default bloom object: ``` 127.0.0.1:6379> BF.INSERT validate_scale_fail VALIDATESCALETO 26214301 (error) ERR provided VALIDATESCALETO causes bloom object to exceed memory limit 127.0.0.1:6379> BF.INSERT validate_scale_valid VALIDATESCALETO 26214300 [] +``` + +We can use the BF.INFO command's MAXSCALEDCAPACITY field to find out the maximum capacity that the scalable bloom filter can expand to hold. + +``` 127.0.0.1:6379> BF.INFO validate_scale_valid MAXSCALEDCAPACITY (integer) 26214300 ``` - -As you can see above when trying to create a bloom object that the user wants to achieve a capacity more than what is possible given the memory limits the command will output an error and not create the bloom object. However if the wanted capacity is within the limits then the creation of the bloom object will succeed. From 8e1f364937bcc55725ccb7a95decd33e9fa7001b Mon Sep 17 00:00:00 2001 From: zackcam Date: Fri, 28 Mar 2025 17:13:33 -0700 Subject: [PATCH 06/10] Apply suggestions from code review Making changes based on review comments Co-authored-by: KarthikSubbarao Signed-off-by: zackcam --- commands/bf.add.md | 4 ++-- commands/bf.info.md | 13 +++++++------ commands/bf.insert.md | 16 ++++++++-------- commands/bf.madd.md | 2 +- commands/bf.reserve.md | 4 ++-- resp2_replies.json | 14 +++++++------- resp3_replies.json | 14 +++++++------- topics/bloomfilters.md | 36 +++++++++++++++++++----------------- topics/data-types.md | 4 ++-- 9 files changed, 55 insertions(+), 52 deletions(-) diff --git a/commands/bf.add.md b/commands/bf.add.md index e659c8b4..6c5691e1 100644 --- a/commands/bf.add.md +++ b/commands/bf.add.md @@ -1,8 +1,8 @@ Adds a single item to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties. -To add multiple items to a bloom filter, you can use the BF.MADD or BF.INSERT commands. +To add multiple items to a bloom filter, you can use the `BF.MADD` or `BF.INSERT` commands. -If you want to create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command. +To create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command. ## Examples diff --git a/commands/bf.info.md b/commands/bf.info.md index 48a62495..924eb812 100644 --- a/commands/bf.info.md +++ b/commands/bf.info.md @@ -1,17 +1,18 @@ -Returns usage information and properties of a specific bloom filter +Returns usage information and properties of a specific bloom filter. ## Info Fields * CAPACITY - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. * SIZE - The number of bytes allocated by this bloom filter. -* FILTERS - Returns the number of filters in the specified key -* ITEMS - The number of unique items that have been added to the bloom filter -* ERROR - The false positive rate of the bloom filter +* FILTERS - Returns the number of sub filters contained within the bloom filter. +* ITEMS - The number of unique items that have been added to the bloom filter. +* ERROR - The false positive rate of the bloom filter. * EXPANSION - The expansion rate of the bloom filter. Non scaling filters will have an expansion rate of nil. -* TIGHTENING - The tightening ratio of the bloom filter +* TIGHTENING - The tightening ratio of the bloom filter. * MAXSCALEDCAPACITY - The [maximum capacity](../topics/bloomfilters.md) that a scalable bloom filter can be expand to and reach before a subsequent scale out will fail. -For non-scaling filters, the TIGHTENING and MAXSCALEDCAPACITY fields are not applicable and will not be returned, as they don't provide relevant functionality. When no optional fields are specified, the system will return all available fields for the given filter type. +For non-scaling filters, the `TIGHTENING` and `MAXSCALEDCAPACITY` fields are not applicable and will not be returned. +When no optional fields are specified, all available fields for the given filter type are returned. ## Examples diff --git a/commands/bf.insert.md b/commands/bf.insert.md index 91681692..60a0365b 100644 --- a/commands/bf.insert.md +++ b/commands/bf.insert.md @@ -1,20 +1,20 @@ If the bloom filter does not exist under the specified name, a bloom filter is created with the specified parameters. Default properties will be used if the options below are not specified. -When the ITEMS option is provided, all items provided will be attempted to be added. +When the `ITEMS` option is provided, all items provided will be attempted to be added. ## Insert Fields * CAPACITY *capacity* - The number of unique items that would need to be added before a scale out occurs or (non scaling) before it rejects addition of unique items. -* ERROR *fp_error* - The false positive rate of the bloom filter +* ERROR *fp_error* - The false positive rate of the bloom filter. * EXPANSION *expansion* - This option will specify the bloom filter as scaling and controls the size of the sub filter that will be created upon scale out / expansion of the bloom filter. -* NOCREATE - Will not create the bloom filter and add items if the filter does not exist already -* TIGHTENING *tightening_ratio* - The tightening ratio for the bloom filter -* SEED *seed* - The seed the hash functions will use +* NOCREATE - Will not create the bloom filter and add items if the filter does not exist already. +* TIGHTENING *tightening_ratio* - The tightening ratio for the bloom filter. +* SEED *seed* - The seed the hash functions will use. * NONSCALING - This option will configure the bloom filter as non scaling; it cannot expand / scale beyond its specified capacity. -* VALIDATESCALETO *validatescaleto* - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter -* ITEMS *item* - One or more items to be added to the bloom filter +* VALIDATESCALETO *validatescaleto* - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter. +* ITEMS *item* - One or more items to be added to the bloom filter. -Due to the nature of NONSCALING and VALIDATESCALETO arguments, specifying NONSCALING and VALIDATESCALETO isn't allowed +Due to the nature of `NONSCALING` and `VALIDATESCALETO` arguments, specifying `NONSCALING` and `VALIDATESCALETO` together is not allowed. ## Examples diff --git a/commands/bf.madd.md b/commands/bf.madd.md index 433ac5c0..d44a8b4b 100644 --- a/commands/bf.madd.md +++ b/commands/bf.madd.md @@ -1,4 +1,4 @@ -Adds one or more items to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties +Adds one or more items to a bloom filter. If the specified bloom filter does not exist, a bloom filter is created with the provided name with default properties. If you want to create a bloom filter with non-default properties, use the `BF.INSERT` or `BF.RESERVE` command. diff --git a/commands/bf.reserve.md b/commands/bf.reserve.md index 4607d5d4..8ced88f9 100644 --- a/commands/bf.reserve.md +++ b/commands/bf.reserve.md @@ -1,6 +1,6 @@ -Creates an empty bloom filter with the capacity and false positive rate specified. By default, a scaling filter is created with the default expansion rate. +Creates an empty bloom filter with the specified capacity and false positive rate. By default, a scaling filter is created with the default expansion rate. -To specify the scaling / non scaling nature of the bloom filter, use the options: NONSCALING or SCALING . It is invalid to provide both options together. +To specify the scaling / non scaling nature of the bloom filter, use the options: `NONSCALING` or `SCALING `. It is invalid to provide both options together. ## Reserve fields diff --git a/resp2_replies.json b/resp2_replies.json index a114ab77..a7894274 100644 --- a/resp2_replies.json +++ b/resp2_replies.json @@ -67,7 +67,7 @@ "* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", "* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", "", - "The command may fail with an error when attempting to add to a non scaling filter that is full" + "The command will be rejected if input is invalid, if a non bloom filter key with the same name already exists, if the bloom filter creation / scale out exceeds limits, or if an item is being added to a full non scaling filter." ], "BF.CARD": [ "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist" @@ -83,17 +83,17 @@ "When an optional argument excluding ERROR is provided:", "* [Integer reply](../topics/protocol.md#integers): argument value", "When ERROR is provided as an optional argument:", - "* [String reply](../topics/protocol.md#simple-strings): argument value", - "", - "The command may fail with an error when trying to get MAXSCALEDCAPACITY or TIGHTENING of a non scaling filter" + "* [String reply](../topics/protocol.md#simple-strings): argument value" ], "BF.INSERT": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", "", - "The command may fail with an error for several reasons: if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" + "The command will be rejected if input is invalid, if a non bloom filter key with the same name already exists, if the bloom filter creation / scale out exceeds limits, or if an item is being added to a full non scaling filter." ], "BF.MADD": [ - "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", + "", + "The command will be rejected if input is invalid, if a non bloom filter key with the same name already exists, if the bloom filter creation / scale out exceeds limits, or if an item is being added to a full non scaling filter." ], "BF.MEXISTS": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" @@ -101,7 +101,7 @@ "BF.RESERVE": [ "[Simple string reply](../topics/protocol.md#simple-strings): `OK`.", "", - "The command may fail with an error for several reasons: if the provided arguments are not in the valid range, if the provided bloom filter name already exists" + "The command will be rejected if input is invalid, if a key with the same name already exists, or if the bloom filter creation exceeds limits." ], "BGREWRITEAOF": [ "[Simple string reply](../topics/protocol.md#simple-strings): a simple string reply indicating that the rewriting started or is about to start ASAP when the call is executed with success.", diff --git a/resp3_replies.json b/resp3_replies.json index 0231887a..78369bba 100644 --- a/resp3_replies.json +++ b/resp3_replies.json @@ -67,7 +67,7 @@ "* [Integer reply](../topics/protocol.md#integers): `1` if the item was successfully added", "* [Integer reply](../topics/protocol.md#integers): `0` if the item already existed in the bloom filter", "", - "The command may fail with an error when attempting to add to a non scaling filter that is full" + "The command will be rejected if input is invalid, if a non bloom filter key with the same name already exists, if the bloom filter creation / scale out exceeds limits, or if an item is being added to a full non scaling filter." ], "BF.CARD": [ "[Integer reply](../topics/protocol.md#integers): The number of items successfully added to the bloom filter, or 0 if the key does not exist" @@ -83,17 +83,17 @@ "When an optional argument excluding ERROR is provided:", "* [Integer reply](../topics/protocol.md#integers): argument value", "When ERROR is provided as an optional argument:", - "* [String reply](../topics/protocol.md#simple-strings): argument value", - "", - "The command may fail with an error when trying to get MAXSCALEDCAPACITY or TIGHTENING of a non scaling filter" + "* [String reply](../topics/protocol.md#simple-strings): argument value" ], "BF.INSERT": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s) - if filter already exists or if creation was successful. An empty array if no items are provided", "", - "The command may fail with an error for several reasons: if the provided VALIDATESCALETO is not possible, if the provided optional args are not valid" + "The command will be rejected if input is invalid, if a non bloom filter key with the same name already exists, if the bloom filter creation / scale out exceeds limits, or if an item is being added to a full non scaling filter." ], "BF.MADD": [ - "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" + "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)", + "", + "The command will be rejected if input is invalid, if a non bloom filter key with the same name already exists, if the bloom filter creation / scale out exceeds limits, or if an item is being added to a full non scaling filter." ], "BF.MEXISTS": [ "[Array reply](../topics/protocol.md#arrays): Array of ints (1’s and 0’s)" @@ -101,7 +101,7 @@ "BF.RESERVE": [ "[Simple string reply](../topics/protocol.md#simple-strings): `OK`.", "", - "The command may fail with an error for several reasons: if the provided arguments are not in the valid range, if the provided bloom filter name already exists" + "The command will be rejected if input is invalid, if a key with the same name already exists, or if the bloom filter creation exceeds limits." ], "BGREWRITEAOF": [ "[Bulk string reply](../topics/protocol.md#bulk-strings): a simple string reply indicating that the rewriting started or is about to start ASAP when the call is executed with success.", diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index 5dfccc72..a1fa4eb5 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -6,7 +6,7 @@ description: > In Valkey, the bloom filter data type / commands are implemented in the [valkey-bloom module](https://github.com/valkey-io/valkey-bloom) which is an official valkey module compatible with versions 8.0 and above. Users will need to load this module onto their valkey server in order to use this feature. -Bloom filters are a space efficient probabilistic data structure that allows checking whether an element is member of a set. False positives are possible, but it guarantees no false negatives. A false positive is when a structure incorrectly indicates that an element is in the set when it actually is not. False negatives are when a structure incorrectly indicates that an element is not in the set when it actually is. +Bloom filters are a space efficient probabilistic data structure that allows adding elements and checking whether elements exist. False positives are possible where a filter incorrectly indicates that an element exists, even though it was not added. However, Bloom Filters guarantee that false negatives (incorrectly indicating that an element does not exist, even though it was added) do not occur. ## Basic Bloom commands @@ -19,11 +19,11 @@ See the [complete list of bloom filter commands](../commands/#bloom). ## Common use cases for bloom filters -### Financial fraud detection +### Fraud detection Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied. -### Ad placement +### Ad placement / Deduplication Bloom filters can help advertisers answer the following questions: * Has the user already seen this ad? @@ -52,13 +52,15 @@ In this username example, we can use use a Bloom filter to track every username ## Scaling and non scaling bloom filters -The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, but a capacity that can grow. While non-scaling bloom filters will have a fixed capacity which also means a fixed size. Scaling bloom filters consist of a vector of “Subfilters” with length >= 1 while non scaling will only contain 1 subfilter. +The bloom filter data type can act either as a "scaling bloom filter" or "non scaling bloom filter" depending on user configuration. -When a scaling filter reaches its capacity, adding a new unique item will cause a new bloom filter to be created and added to the vector of bloom filters. This new bloom filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). +The difference between scaling and non scaling bloom filters is that scaling bloom filters do not have a fixed capacity, instead they can grow. Non-scaling bloom filters will have a fixed capacity, meaning only a fixed number of items can be inserted to it. Scaling bloom filters consist of a vector of "Sub filters" with length >= 1, while non scaling will only contain 1 sub filter. -When a non scaling filter reaches its capacity, if a user tries to add a new unique item an error will be returned +When a scaling bloom filter reaches its capacity, adding a new unique item will trigger a scale out and a new sub filter is created and added to the vector of sub filters. This new sub filter will have a larger capacity (previous bloom filter's capacity * expansion rate of the bloom object). -The expansion rate is the rate that a scaling bloom filter will have its capacity increased by on the scale out. For example we have a bloom filter with capacity 100 at creation with an expansion rate of 2. After adding 101 unique items we will scale out and create a new filter with capacity 200. Then after adding 200 more unique items (301 items total) we will create a new filter of capacity 400 and so on. +After a non scaling bloom filter reaches its capacity, if a user tries to add a new unique item, an error will be returned + +The expansion rate is the rate that a scaling bloom filter's capacity is increased by upon scale out. For example, we have a bloom filter with capacity 100 at creation with an expansion rate of 2. After adding 101 unique items, it will scale out and create a new sub filter with capacity 200. Then, after adding 200 more unique items (301 items total), a new sub filter of capacity 400 is added upon scale out and so on. ### When should you use scaling vs non-scaling filters @@ -164,18 +166,18 @@ Example of default bloom filter information: The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item. -Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to help avoid several scale outs and reduce the number of checks. +Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to avoid several scale outs and reduce the number of checks. -There other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided). +The other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided). ## Monitoring -To check the overall bloom filter metrics not just for specific filters, you can use the following `info bf` or `info modules`. +To check the server's overall bloom filter metrics, you can use the `INFO BF` or the `INFO MODULES` command. -Example of `info bf` calls in different scenarios: +Example of `INFO BF` calls in different scenarios: ``` -127.0.0.1:6379> info bf +127.0.0.1:6379> INFO BF # bf_bloom_core_metrics bf_bloom_total_memory_bytes:0 bf_bloom_num_objects:0 @@ -207,7 +209,7 @@ bf_bloom_defrag_misses:0 * `bf_bloom_num_objects`: Current total number of bloom filters. -* `bf_bloom_num_filters_across_objects`: Current total number of subfilters across all bloom filters. +* `bf_bloom_num_filters_across_objects`: Current total number of sub filters across all bloom filters. * `bf_bloom_num_items_across_objects`: Current total number of items across all bloom filters. @@ -225,16 +227,16 @@ There are two limits a bloom filter faces. 1. Memory Usage Limit: - The memory usage limit per bloom filter by default is defined by the BF.BLOOM-MEMORY-USAGE-LIMIT module configuration which has a default value of 128 MB. If a command results in a creation / scale out causing the overall memory usage to exceed this limit, the command is rejected. + The memory usage limit per bloom filter by default is defined by the `BF.BLOOM-MEMORY-USAGE-LIMIT` module configuration which has a default value of 128 MB. If a command results in a creation / scale out causing the overall memory usage to exceed this limit, the command is rejected. 2. Number of sub filters (in case of scalable bloom filters): When a bloom filter scales out, a new sub filter is added. The limit on the number of sub filters depends on the false positive rate and tightening ratio. Each sub filter has a stricter false positive, and this is controlled by the tightening ratio. If a command attempting a scale out results in the sub filter reaching a false positive of 0, the command is rejected. -We have implemented VALIDATESCALETO as an optional arg of BF.INSERT to help determine whether the bloom filter can scale out to the reach the specified capacity without hitting either limits mentioned above, and will reject the command otherwise. +We have implemented `VALIDATESCALETO` as an optional arg of `BF.INSERT` to help determine whether the bloom filter can scale out to the reach the specified capacity without hitting either limits mentioned above. It will reject the command otherwise. -As seen below when trying to create a bloom filter with a capacity more than what is possible given the memory limits, the command is rejected. However if the wanted capacity is within the limits then the creation of the bloom filter will succeed. +As seen below, when trying to create a bloom filter with a capacity that cannot be achieved through scale outs (given the memory limits), the command is rejected. However, if the capacity can be achieved through scale out (even with the limits) then the creation of the bloom filter will succeed. Example: @@ -245,7 +247,7 @@ Example: [] ``` -We can use the BF.INFO command's MAXSCALEDCAPACITY field to find out the maximum capacity that the scalable bloom filter can expand to hold. +We can use the `BF.INFO` command's `MAXSCALEDCAPACITY` field to find out the maximum capacity that the scalable bloom filter can expand to hold. ``` 127.0.0.1:6379> BF.INFO validate_scale_valid MAXSCALEDCAPACITY diff --git a/topics/data-types.md b/topics/data-types.md index 00435505..a62dede6 100644 --- a/topics/data-types.md +++ b/topics/data-types.md @@ -94,9 +94,9 @@ The [HyperLogLog](hyperloglogs.md) data structures provide probabilistic estimat ## Bloom Filter -[Bloom filters](bloomfilters.md) are a space efficient probabilistic data type that can be used to check if item/s are definitely not present in a set, or if they exist within the set (with the configured false positive rate). +[Bloom filters](bloomfilters.md) are a space efficient probabilistic data structure that allows adding elements and checking if item/s are definitely not present, or if there is a possibility they exist (with the configured false positive rate). -Bloom filters are provided by the module `valkey-bloom` +The Bloom filter data type / command support is provided by the `valkey-bloom` module. For more information, see: * [Overview of Bloom Filters](bloomfilters.md) From 61f03c0623d7a4aef51d2d6005d58116cbaac257 Mon Sep 17 00:00:00 2001 From: zackcam Date: Mon, 31 Mar 2025 18:24:14 +0000 Subject: [PATCH 07/10] Updating for review comments Signed-off-by: zackcam --- commands/bf.insert.md | 2 +- topics/bloomfilters.md | 40 ++++++++++++++++++++++++---------------- topics/data-types.md | 2 +- 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/commands/bf.insert.md b/commands/bf.insert.md index 60a0365b..783c0924 100644 --- a/commands/bf.insert.md +++ b/commands/bf.insert.md @@ -9,7 +9,7 @@ When the `ITEMS` option is provided, all items provided will be attempted to be * EXPANSION *expansion* - This option will specify the bloom filter as scaling and controls the size of the sub filter that will be created upon scale out / expansion of the bloom filter. * NOCREATE - Will not create the bloom filter and add items if the filter does not exist already. * TIGHTENING *tightening_ratio* - The tightening ratio for the bloom filter. -* SEED *seed* - The seed the hash functions will use. +* SEED *seed* - The 32 byte seed the bloom filter's hash functions will use. * NONSCALING - This option will configure the bloom filter as non scaling; it cannot expand / scale beyond its specified capacity. * VALIDATESCALETO *validatescaleto* - Validates if the filter can scale out and reach to this capacity based on limits and if not, return an error without creating the bloom filter. * ITEMS *item* - One or more items to be added to the bloom filter. diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index a1fa4eb5..a89c07ea 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -19,27 +19,35 @@ See the [complete list of bloom filter commands](../commands/#bloom). ## Common use cases for bloom filters -### Fraud detection - -Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied. +### Advertisement / Campaign placement and deduplication -### Ad placement / Deduplication +Bloom filters can help e-commerce sites, streaming services, advertising networks, or marketing platforms answer the following questions: -Bloom filters can help advertisers answer the following questions: -* Has the user already seen this ad? -* Has the user already purchased this product? +* Has an advertisement already been shown to a user? +* Has a promotional email or notification already been sent to a user? +* Has a product already been purchased by a user? -For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter. +Example: For each user, use a Bloom filter to store all the products they have purchased. The recommendation engine can then suggest a new product and check if it is present in the user's Bloom filter. * If the product is not in the filter, the ad is shown to the user, and the product is added to the filter. * If the product is already in the filter, it means the ad has already been shown to the user and the recommendation engine finds a different ad to show. -### Check if URL's are malicious +### Fraud detection + +Bloom filters can be used to answer the question, "Has this card been flagged as stolen?". To do this, use a bloom filter that contains cards reported as stolen. When a card is used, check whether it is present in the bloom filter. If the card is not found, it means it is not marked as stolen. If the card is present in the filter, a check can be made against the main database, or the purchase can be denied. + +### Filtering Spam / Harmful Content +Bloom filters provide an efficient way to screen content for potential threats and harmful material. Here's how they can be effectively used: + +Example: Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter. + +* If no, then we allow access to the site. +* If yes, then we can deny access or perform a full check of the URL. -Bloom filters can answer the question "is a URL malicious?". Any URL inputted would be checked against a malicious URL bloom filter. +Example: Bloom filters can answer the question is this content harmful or spam. Create a bloom filter that contains spam email addresses or spam phone numbers. When an email or text is received then check if the number or email is present in the bloom filter. -* If no, then we allow access to the site -* If yes, then we can deny access or perform a full check of the URL +* If no, then the message can be displayed to the user. +* If yes, then we can send the message to the spam folder or perform a full check on the email or number. ### Check if a username is taken @@ -164,9 +172,9 @@ Example of default bloom filter information: ## Performance -The bloom commands which involve adding items or checking the existence of items have a time complexity of O(n * k) where n is the number of hash functions used by the bloom filter and k is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(n) as they only operate on one item. +The bloom commands which involve adding items or checking the existence of items have a time complexity of O(N * K) where N is the number of hash functions used by the bloom filter and K is the number of elements being inserted. This means that both BF.ADD and BF.EXISTS are both O(N) as they only operate on one item. -Since performance relies on the number of hash functions, choosing the correct capacity and expansion rate can be important. In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity after evaluating the use case / workload to avoid several scale outs and reduce the number of checks. +In case of scalable bloom filters, with every scale out, we increase the number of checks (using hash functions of each sub filter) performed during any add / exists operation. For this reason, it is recommended that users choose a capacity and expansion rate after evaluating the use case / workload to avoid several scale outs and reduce the number of checks. The other bloom filter commands are O(1) time complexity: BF.CARD, BF.INFO, BF.RESERVE, and BF.INSERT (when no items are provided). @@ -236,7 +244,7 @@ There are two limits a bloom filter faces. We have implemented `VALIDATESCALETO` as an optional arg of `BF.INSERT` to help determine whether the bloom filter can scale out to the reach the specified capacity without hitting either limits mentioned above. It will reject the command otherwise. -As seen below, when trying to create a bloom filter with a capacity that cannot be achieved through scale outs (given the memory limits), the command is rejected. However, if the capacity can be achieved through scale out (even with the limits) then the creation of the bloom filter will succeed. +As seen below, when trying to create a bloom filter with a capacity that cannot be achieved through scale outs (given the memory limits), the command is rejected. However, if the capacity can be achieved through scale out (even with the limits), then the creation of the bloom filter will succeed. Example: @@ -247,7 +255,7 @@ Example: [] ``` -We can use the `BF.INFO` command's `MAXSCALEDCAPACITY` field to find out the maximum capacity that the scalable bloom filter can expand to hold. +The `BF.INFO` command's `MAXSCALEDCAPACITY` field can be used to find out the maximum capacity that the scalable bloom filter can expand to hold. ``` 127.0.0.1:6379> BF.INFO validate_scale_valid MAXSCALEDCAPACITY diff --git a/topics/data-types.md b/topics/data-types.md index a62dede6..e1f06d89 100644 --- a/topics/data-types.md +++ b/topics/data-types.md @@ -101,7 +101,7 @@ For more information, see: * [Overview of Bloom Filters](bloomfilters.md) * [Bloom filter command reference](../commands/#bloom) -* [The valkey-bloom module on GitHub](https://github.com/valkey-io/valkey-bloom/) +* [Valkey-bloom module on GitHub](https://github.com/valkey-io/valkey-bloom/) ## Extensions From 613ed723bbabf708f3d7b5b3cb502c691cb260a3 Mon Sep 17 00:00:00 2001 From: zackcam Date: Mon, 31 Mar 2025 15:36:44 -0700 Subject: [PATCH 08/10] Aligning capitalization across bloom commands Co-authored-by: Harkrishn Patro Signed-off-by: zackcam --- commands/bf.card.md | 2 +- commands/bf.exists.md | 2 +- commands/bf.load.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/commands/bf.card.md b/commands/bf.card.md index 00224d26..6b7a0653 100644 --- a/commands/bf.card.md +++ b/commands/bf.card.md @@ -1,4 +1,4 @@ -Returns the cardinality of a Bloom filter which is the number of items that have been successfully added to it. +Returns the cardinality of a bloom filter which is the number of items that have been successfully added to it. ## Examples diff --git a/commands/bf.exists.md b/commands/bf.exists.md index a759134d..3d1c3df3 100644 --- a/commands/bf.exists.md +++ b/commands/bf.exists.md @@ -1,6 +1,6 @@ Determines if an item has been added to the bloom filter previously. -A Bloom filter has two possible responses when you check if an item exists: +A bloom filter has two possible responses when you check if an item exists: * 0 - The item definitely does not exist since with bloom filters, false negatives are not possible. diff --git a/commands/bf.load.md b/commands/bf.load.md index 77706361..fab45654 100644 --- a/commands/bf.load.md +++ b/commands/bf.load.md @@ -1 +1 @@ -Restores a bloom filter from a dump of an existing bloom filter with all of its specific the properties and bit vector dump of sub filter/s. This command is only generated during AOF Rewrite to restore a bloom filter in the future. +Restores a bloom filter from a dump of an existing bloom filter with all of its specific the properties and bit vector dump of sub filter/s. This command is only generated during AOF rewrite to restore a bloom filter in the future. From 8719dad485c610ea6c3a1d91f84bd15a1d61e66c Mon Sep 17 00:00:00 2001 From: zackcam Date: Tue, 1 Apr 2025 16:42:19 +0000 Subject: [PATCH 09/10] Updating command responses and making table in markdown not HTML Signed-off-by: zackcam --- commands/bf.add.md | 4 +-- commands/bf.card.md | 8 +++--- commands/bf.exists.md | 6 ++-- commands/bf.info.md | 4 +-- topics/bloomfilters.md | 62 +++++++++--------------------------------- 5 files changed, 24 insertions(+), 60 deletions(-) diff --git a/commands/bf.add.md b/commands/bf.add.md index 6c5691e1..88842339 100644 --- a/commands/bf.add.md +++ b/commands/bf.add.md @@ -8,7 +8,7 @@ To create a bloom filter with non-default properties, use the `BF.INSERT` or `BF ``` 127.0.0.1:6379> BF.ADD key val -1 +(integer) 1 127.0.0.1:6379> BF.ADD key val -0 +(integer) 0 ``` diff --git a/commands/bf.card.md b/commands/bf.card.md index 6b7a0653..ac617ce5 100644 --- a/commands/bf.card.md +++ b/commands/bf.card.md @@ -4,9 +4,9 @@ Returns the cardinality of a bloom filter which is the number of items that have ``` 127.0.0.1:6379> BF.ADD key val -1 +(integer) 1 127.0.0.1:6379> BF.CARD key -1 +(integer) 1 127.0.0.1:6379> BF.CARD nonexistentkey -0 -``` \ No newline at end of file +(integer) 0 +``` diff --git a/commands/bf.exists.md b/commands/bf.exists.md index 3d1c3df3..670eae42 100644 --- a/commands/bf.exists.md +++ b/commands/bf.exists.md @@ -10,9 +10,9 @@ A bloom filter has two possible responses when you check if an item exists: ``` 127.0.0.1:6379> BF.ADD key val -1 +(integer) 1 127.0.0.1:6379> BF.EXISTS key val -1 +(integer) 1 127.0.0.1:6379> BF.EXISTS key nonexistent -0 +(integer) 0 ``` diff --git a/commands/bf.info.md b/commands/bf.info.md index 924eb812..48ccb535 100644 --- a/commands/bf.info.md +++ b/commands/bf.info.md @@ -18,7 +18,7 @@ When no optional fields are specified, all available fields for the given filter ``` 127.0.0.1:6379> BF.ADD key val -1 +(integer) 1 127.0.0.1:6379> BF.INFO key 1) Capacity 2) (integer) 100 @@ -37,5 +37,5 @@ When no optional fields are specified, all available fields for the given filter 15) Max scaled capacity 16) (integer) 26214300 127.0.0.1:6379> BF.INFO key CAPACITY -100 +(integer) 100 ``` \ No newline at end of file diff --git a/topics/bloomfilters.md b/topics/bloomfilters.md index a89c07ea..7f45c6ff 100644 --- a/topics/bloomfilters.md +++ b/topics/bloomfilters.md @@ -98,50 +98,14 @@ The following two properties can be specified in the `BF.INSERT` command: These are the default bloom properties along with the commands and configs which allow customizing. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PropertyDefault ValueCommand NameConfiguration name
Capacity100BF.INSERT, BF.RESERVEBF.BLOOM-CAPACITY
False Positive Rate0.01BF.INSERT, BF.RESERVEBF.BLOOM-FP-RATE
Scaling / Non ScalingScalingBF.INSERT, BF.RESERVEBF.BLOOM-EXPANSION
Expansion Rate2BF.INSERT, BF.RESERVEBF.BLOOM-EXPANSION
Tightening Ratio0.5BF.INSERTBF.BLOOM-TIGHTENING-RATIO
SeedRandom SeedBF.INSERTBF.BLOOM-USE-RANDOM-SEED
+| Property | Default Value | Command Name | Configuration name | +|----------|--------------|--------------|-------------------| +| Capacity | 100 | BF.INSERT, BF.RESERVE | BF.BLOOM-CAPACITY | +| False Positive Rate | 0.01 | BF.INSERT, BF.RESERVE | BF.BLOOM-FP-RATE | +| Scaling / Non Scaling | Scaling | BF.INSERT, BF.RESERVE | BF.BLOOM-EXPANSION | +| Expansion Rate | 2 | BF.INSERT, BF.RESERVE | BF.BLOOM-EXPANSION | +| Tightening Ratio | 0.5 | BF.INSERT | BF.BLOOM-TIGHTENING-RATIO | +| Seed | Random Seed | BF.INSERT | BF.BLOOM-USE-RANDOM-SEED | Since bloom filters have a default expansion of 2, this means any default creation as a result of `BF.ADD`, `BF.MADD`, `BF.INSERT` will be a scalable bloom filter. Users can create a non scaling bloom filter using `BF.RESERVE NONSCALING` or by specifying `NONSCALING` in `BF.INSERT`. Additionally, the other default properties of a bloom filter creation can be seen in the table above and BF.INFO command response below. These default properties can be configured through configs on the bloom module. @@ -229,20 +193,20 @@ bf_bloom_defrag_misses:0 * `bf_bloom_defrag_misses`: Total number of defrag misses that have occurred on bloom filters. -## Limits +## Handling Large Bloom Filters -There are two limits a bloom filter faces. +There are two notable validations bloom filters faces. -1. Memory Usage Limit: +1. Memory Usage: - The memory usage limit per bloom filter by default is defined by the `BF.BLOOM-MEMORY-USAGE-LIMIT` module configuration which has a default value of 128 MB. If a command results in a creation / scale out causing the overall memory usage to exceed this limit, the command is rejected. + The memory usage limit per bloom filter by default is defined by the `BF.BLOOM-MEMORY-USAGE-LIMIT` module configuration which has a default value of 128 MB. If a command results in a creation / scale out causing the overall memory usage to exceed this limit, the command is rejected. This config is modifiable and can be increased as needed. 2. Number of sub filters (in case of scalable bloom filters): When a bloom filter scales out, a new sub filter is added. The limit on the number of sub filters depends on the false positive rate and tightening ratio. Each sub filter has a stricter false positive, and this is controlled by the tightening ratio. If a command attempting a scale out results in the sub filter reaching a false positive of 0, the command is rejected. -We have implemented `VALIDATESCALETO` as an optional arg of `BF.INSERT` to help determine whether the bloom filter can scale out to the reach the specified capacity without hitting either limits mentioned above. It will reject the command otherwise. +You can use `VALIDATESCALETO` as an optional arg of `BF.INSERT` to help determine whether the bloom filter can scale out to the reach the specified capacity without hitting either limits mentioned above. It will reject the command otherwise. As seen below, when trying to create a bloom filter with a capacity that cannot be achieved through scale outs (given the memory limits), the command is rejected. However, if the capacity can be achieved through scale out (even with the limits), then the creation of the bloom filter will succeed. From 11c88c216e43dba525ff47717dd4016e0c0531d4 Mon Sep 17 00:00:00 2001 From: zackcam Date: Wed, 2 Apr 2025 17:17:58 +0000 Subject: [PATCH 10/10] Adding words that are causing spellcheck to fail Signed-off-by: zackcam --- wordlist | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/wordlist b/wordlist index 19841317..9ca5bdeb 100644 --- a/wordlist +++ b/wordlist @@ -81,6 +81,7 @@ behaviour benchmarked Benchmarking benchmarking +BF.BLOOM-FP-RATE big-endian BigNumber \w+:\w.* @@ -179,6 +180,7 @@ deauthenticate deauthenticated Deauthenticates deduplicated +deduplication Defrag defrag defragging @@ -277,6 +279,7 @@ FlameGraph fmt foo[0-9] formatter +fp_error france_location FreeBSD FreeString @@ -908,6 +911,7 @@ UTF-8 utf8 utils v[0-9\.]+ +validatescaleto [Vv]alkey [Vv]alkey-[\w+-]+ Valkey's