Store the raw format name for DiskBBQ in field metadata #134812

thecoop · 2025-09-16T14:17:15Z

Store the flat vector format in fieldentry info for loading at read time.

At the moment, there's just one supported flat format, but more can be added relatively easily

thecoop · 2025-09-16T14:53:59Z

...r/src/main/java/org/elasticsearch/index/codec/vectors/diskbbq/ES920DiskBBQVectorsFormat.java

    public static final String CLUSTER_EXTENSION = "clivf";
    static final String IVF_META_EXTENSION = "mivf";

+    static final String RAW_VECTOR_FORMAT = "raw_vector_format";


Should this be scoped in some fashion?

elasticsearchmachine · 2025-09-16T15:54:29Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

benwtrent

I was thinking something way simpler.

On write, we add a string literal to writeMeta that indicates the flat format.

On read, we parse that string literal readFields.

Though maybe the segment info works ok. Let me think.

thecoop · 2025-09-17T08:53:38Z

So something like thecoop/elasticsearch@main...thecoop:elasticsearch:diskbbq_raw_format_name_meta then...

Both are pretty much the same. I think I prefer the attribute one as the meta feels very low-level, and the attribute one is more top-level functionality. But either would work.

benwtrent · 2025-09-17T17:44:39Z

Both are pretty much the same. I think I prefer the attribute one as the meta feels very low-level, and the attribute one is more top-level functionality. But either would work.

The top level attribute effectively blocks all other formats from being able to provide a different raw reader for their fields. Segment Info is shared. So, either we would have a unique segment info input per vector format (seems really messy to me), or we allow the format to handle it directly (seems better, even though lower level).

Of course, for either, we will need a map that caches the loaded formats. Readers are thread-safe and we know them all ahead of time, so we can populate the map eagerly and avoid the "synchronize" block that you have in the meta version

benwtrent · 2025-09-17T17:46:16Z

Also, I am not sure we actually need to validate with "supported" formats. We should just use the format KnnVectorFormat named loader and support anything that is a FlatVectorFormat for hnsw

thecoop · 2025-09-18T08:25:32Z

Lucene99FlatVectorsFormat isn't registered with SPI, as it's always constructed directly - hence the separate lookup map to actually create the formats

thecoop · 2025-09-18T09:22:06Z

I've updated the PR to use meta rather than attrs. Adding direct IO support is best done after Lucene 10.3 is merged, then I can update #130893 and apply it to the supported formats used by DiskBBQ

thecoop · 2025-09-18T10:20:55Z

server/src/main/java/org/elasticsearch/index/codec/vectors/diskbbq/IVFVectorsReader.java

+        if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) {
+            // IVF only works on floats, so pass through any others straight
+            // to the first flat vectors format we can find
+            return rawVectorReaders.values().iterator().next();


This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.

This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.

I think this is OK. The flat format needs to fully support the flat vectors reader API.

This getReaderForField should just be a map lookup.

benwtrent

I like this! I think it can be simplified even more But this is a good start. Just record the name and the name can be loaded via the KnnVectorsFormat.forName API store the reader in a map and BOOM, good to go!

benwtrent · 2025-09-18T14:49:25Z

server/src/main/java/org/elasticsearch/index/codec/vectors/diskbbq/IVFVectorsReader.java

+        if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) {
+            // IVF only works on floats, so pass through any others straight
+            // to the first flat vectors format we can find
+            return rawVectorReaders.values().iterator().next();


This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.

I think this is OK. The flat format needs to fully support the flat vectors reader API.

This getReaderForField should just be a map lookup.

benwtrent · 2025-09-19T17:01:29Z

isn't registered with SPI, as it's always constructed directly - hence the separate lookup map to actually create the formats

Yeah, this is silly. Lucene actually passes the name up to the SPI loader, but doesn't expose itself at all. Which is funny.

I think this will change eventually, but until then, simply holding a format object like this is pretty cheap for now.

benwtrent

I updated how we handle bytes, but this looks good to me now. Gonna merge :)

Store the flat vector format in fieldentry info for loading at read time. At the moment, there's just one supported flat format, but more can be added relatively easily

elasticsearchmachine added the v9.2.0 label Sep 16, 2025

Store the raw format name in field metadata

3b444c0

thecoop force-pushed the diskbbq_raw_format_name_attr branch from 1296862 to 3b444c0 Compare September 16, 2025 14:27

[CI] Auto commit changes from spotless

ee173bb

thecoop commented Sep 16, 2025

View reviewed changes

Add lookup map

d9e64c6

thecoop requested a review from benwtrent September 16, 2025 15:53

thecoop marked this pull request as ready for review September 16, 2025 15:53

thecoop added >refactoring :Search Relevance/Vectors Vector search labels Sep 16, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Sep 16, 2025

thecoop changed the title ~~Store the raw format name in field metadata~~ Store the raw format name for DiskBBQ in field metadata Sep 16, 2025

benwtrent reviewed Sep 16, 2025

View reviewed changes

Store format name in raw metadata

c983399

iter

3f08036

thecoop force-pushed the diskbbq_raw_format_name_attr branch from 0d0b436 to 3f08036 Compare September 18, 2025 09:17

thecoop added 2 commits September 18, 2025 10:35

Merge branch 'main' into diskbbq_raw_format_name_attr

bbc3663

Pass-through bytes directly

4fb062b

thecoop commented Sep 18, 2025

View reviewed changes

benwtrent reviewed Sep 18, 2025

View reviewed changes

simplifying some things

1e5157e

benwtrent approved these changes Sep 19, 2025

View reviewed changes

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 19, 2025

benwtrent added 2 commits September 19, 2025 13:16

Merge branch 'main' into diskbbq_raw_format_name_attr

823f85c

Merge branch 'main' into diskbbq_raw_format_name_attr

2fa4599

elasticsearchmachine merged commit 00e4d12 into elastic:main Sep 20, 2025
34 checks passed

thecoop deleted the diskbbq_raw_format_name_attr branch September 20, 2025 14:34

Store the raw format name for DiskBBQ in field metadata #134812

Store the raw format name for DiskBBQ in field metadata #134812

Uh oh!

Conversation

thecoop commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thecoop Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 16, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

thecoop commented Sep 17, 2025

Uh oh!

benwtrent commented Sep 17, 2025

Uh oh!

benwtrent commented Sep 17, 2025

Uh oh!

thecoop commented Sep 18, 2025

Uh oh!

thecoop commented Sep 18, 2025

Uh oh!

thecoop Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Sep 19, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thecoop commented Sep 16, 2025 •

edited

Loading

thecoop Sep 18, 2025 •

edited

Loading