-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Store the raw format name for DiskBBQ in field metadata #134812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store the raw format name for DiskBBQ in field metadata #134812
Conversation
1296862
to
3b444c0
Compare
public static final String CLUSTER_EXTENSION = "clivf"; | ||
static final String IVF_META_EXTENSION = "mivf"; | ||
|
||
static final String RAW_VECTOR_FORMAT = "raw_vector_format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be scoped in some fashion?
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something way simpler.
On write, we add a string literal to writeMeta
that indicates the flat format.
On read, we parse that string literal readFields
.
Though maybe the segment info works ok. Let me think.
So something like thecoop/elasticsearch@main...thecoop:elasticsearch:diskbbq_raw_format_name_meta then... Both are pretty much the same. I think I prefer the attribute one as the meta feels very low-level, and the attribute one is more top-level functionality. But either would work. |
The top level attribute effectively blocks all other formats from being able to provide a different raw reader for their fields. Segment Info is shared. So, either we would have a unique segment info input per vector format (seems really messy to me), or we allow the format to handle it directly (seems better, even though lower level). Of course, for either, we will need a map that caches the loaded formats. Readers are thread-safe and we know them all ahead of time, so we can populate the map eagerly and avoid the "synchronize" block that you have in the meta version |
Also, I am not sure we actually need to validate with "supported" formats. We should just use the format KnnVectorFormat named loader and support anything that is a FlatVectorFormat for hnsw |
|
0d0b436
to
3f08036
Compare
I've updated the PR to use meta rather than attrs. Adding direct IO support is best done after Lucene 10.3 is merged, then I can update #130893 and apply it to the supported formats used by DiskBBQ |
if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) { | ||
// IVF only works on floats, so pass through any others straight | ||
// to the first flat vectors format we can find | ||
return rawVectorReaders.values().iterator().next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.
I think this is OK. The flat format needs to fully support the flat vectors reader API.
This getReaderForField
should just be a map lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! I think it can be simplified even more But this is a good start. Just record the name and the name can be loaded via the KnnVectorsFormat.forName
API store the reader in a map and BOOM, good to go!
if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) { | ||
// IVF only works on floats, so pass through any others straight | ||
// to the first flat vectors format we can find | ||
return rawVectorReaders.values().iterator().next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that all the flat formats that could be used here handle bytes in the same way. Alternatively, an explicit byte-handling reader can be specified separately in the constructor.
I think this is OK. The flat format needs to fully support the flat vectors reader API.
This getReaderForField
should just be a map lookup.
Yeah, this is silly. Lucene actually passes the name up to the SPI loader, but doesn't expose itself at all. Which is funny. I think this will change eventually, but until then, simply holding a format object like this is pretty cheap for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated how we handle bytes, but this looks good to me now. Gonna merge :)
Store the flat vector format in fieldentry info for loading at read time. At the moment, there's just one supported flat format, but more can be added relatively easily
Store the flat vector format in fieldentry info for loading at read time. At the moment, there's just one supported flat format, but more can be added relatively easily
Store the flat vector format in fieldentry info for loading at read time.
At the moment, there's just one supported flat format, but more can be added relatively easily