Skip to content

Conversation

thecoop
Copy link
Member

@thecoop thecoop commented Sep 16, 2025

Store the flat vector format in fieldentry info for loading at read time.

At the moment, there's just one supported flat format, but more can be added relatively easily

@thecoop thecoop force-pushed the diskbbq_raw_format_name_attr branch from 1296862 to 3b444c0 Compare September 16, 2025 14:27
public static final String CLUSTER_EXTENSION = "clivf";
static final String IVF_META_EXTENSION = "mivf";

static final String RAW_VECTOR_FORMAT = "raw_vector_format";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be scoped in some fashion?

@thecoop thecoop requested a review from benwtrent September 16, 2025 15:53
@thecoop thecoop marked this pull request as ready for review September 16, 2025 15:53
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Sep 16, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@thecoop thecoop changed the title Store the raw format name in field metadata Store the raw format name for DiskBBQ in field metadata Sep 16, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something way simpler.

On write, we add a string literal to writeMeta that indicates the flat format.

On read, we parse that string literal readFields.

Though maybe the segment info works ok. Let me think.

@thecoop
Copy link
Member Author

thecoop commented Sep 17, 2025

So something like thecoop/elasticsearch@main...thecoop:elasticsearch:diskbbq_raw_format_name_meta then...

Both are pretty much the same. I think I prefer the attribute one as the meta feels very low-level, and the attribute one is more top-level functionality. But either would work.

@benwtrent
Copy link
Member

Both are pretty much the same. I think I prefer the attribute one as the meta feels very low-level, and the attribute one is more top-level functionality. But either would work.

The top level attribute effectively blocks all other formats from being able to provide a different raw reader for their fields. Segment Info is shared. So, either we would have a unique segment info input per vector format (seems really messy to me), or we allow the format to handle it directly (seems better, even though lower level).

Of course, for either, we will need a map that caches the loaded formats. Readers are thread-safe and we know them all ahead of time, so we can populate the map eagerly and avoid the "synchronize" block that you have in the meta version

@benwtrent
Copy link
Member

Also, I am not sure we actually need to validate with "supported" formats. We should just use the format KnnVectorFormat named loader and support anything that is a FlatVectorFormat for hnsw

@thecoop
Copy link
Member Author

thecoop commented Sep 18, 2025

Lucene99FlatVectorsFormat isn't registered with SPI, as it's always constructed directly - hence the separate lookup map to actually create the formats

@thecoop thecoop force-pushed the diskbbq_raw_format_name_attr branch from 0d0b436 to 3f08036 Compare September 18, 2025 09:17
@thecoop
Copy link
Member Author

thecoop commented Sep 18, 2025

I've updated the PR to use meta rather than attrs. Adding direct IO support is best done after Lucene 10.3 is merged, then I can update #130893 and apply it to the supported formats used by DiskBBQ

if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) {
// IVF only works on floats, so pass through any others straight
// to the first flat vectors format we can find
return rawVectorReaders.values().iterator().next();
Copy link
Member Author

@thecoop thecoop Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.

I think this is OK. The flat format needs to fully support the flat vectors reader API.

This getReaderForField should just be a map lookup.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! I think it can be simplified even more But this is a good start. Just record the name and the name can be loaded via the KnnVectorsFormat.forName API store the reader in a map and BOOM, good to go!

if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) {
// IVF only works on floats, so pass through any others straight
// to the first flat vectors format we can find
return rawVectorReaders.values().iterator().next();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.

I think this is OK. The flat format needs to fully support the flat vectors reader API.

This getReaderForField should just be a map lookup.

@benwtrent
Copy link
Member

isn't registered with SPI, as it's always constructed directly - hence the separate lookup map to actually create the formats

Yeah, this is silly. Lucene actually passes the name up to the SPI loader, but doesn't expose itself at all. Which is funny.

I think this will change eventually, but until then, simply holding a format object like this is pretty cheap for now.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated how we handle bytes, but this looks good to me now. Gonna merge :)

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 19, 2025
@elasticsearchmachine elasticsearchmachine merged commit 00e4d12 into elastic:main Sep 20, 2025
34 checks passed
@thecoop thecoop deleted the diskbbq_raw_format_name_attr branch September 20, 2025 14:34
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 22, 2025
Store the flat vector format in fieldentry info for loading at read
time.

At the moment, there's just one supported flat format, but more can be
added relatively easily
DonalEvans pushed a commit to DonalEvans/elasticsearch that referenced this pull request Sep 22, 2025
Store the flat vector format in fieldentry info for loading at read
time.

At the moment, there's just one supported flat format, but more can be
added relatively easily
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >refactoring :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants