-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Add base classes for generic HNSW formats #135343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
be1f204
to
64d858c
Compare
ad1841b
to
4f2154a
Compare
reason = "TODO Deprecate any lenient usage of Boolean#parseBoolean https://github.com/elastic/elasticsearch/issues/128993" | ||
) | ||
private static boolean getUseDirectIO() { | ||
return Boolean.parseBoolean(System.getProperty("vector.rescoring.directio", "false")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: update docs
a03e752
to
940be42
Compare
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Hi @thecoop, I've created a changelog YAML for you. |
...r/src/main/java/org/elasticsearch/index/codec/vectors/es93/ES93GenericHnswVectorsFormat.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/index/codec/vectors/es93/ES93GenericHnswVectorsFormat.java
Show resolved
Hide resolved
@Override | ||
public final KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException { | ||
var readFormats = supportedReadFlatVectorsFormats(); | ||
return new ES93GenericHnswVectorsReader(state, f -> { | ||
var format = readFormats.get(f); | ||
if (format == null) return null; | ||
return format.fieldsReader(state); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, instead of doing this, why don't we actually wrap the reader/writers. Having this new "meta format" write information to its own meta file vfi
(vector format information). And then utilize that information to construct the appropriate readers.
Example (with many TODOs):
private static final class FieldsWriter extends KnnVectorsWriter {
private final IndexOutput metaOut;
private final SegmentWriteState state;
private final KnnVectorsWriter rawVectorWriter;
FieldsWriter(SegmentWriteState state, KnnVectorsWriter rawWriter) throws IOException {
this.rawVectorWriter = rawWriter;
this.state = state;
final String metaFileName = IndexFileNames.segmentFileName(
state.segmentInfo.name,
state.segmentSuffix,
VECTOR_FORMAT_INFO_EXTENSION
);
try {
this.metaOut = state.directory.createOutput(metaFileName, state.context);
// TODO write meta information about the writer
CodecUtil.writeHeader(metaOut, META_CODEC_NAME, VERSION_CURRENT);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
@Override
public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException {
return rawVectorWriter.addField(fieldInfo);
}
@Override
public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException {
rawVectorWriter.flush(maxDoc, sortMap);
}
@Override
public void finish() throws IOException {
rawVectorWriter.finish();
}
@Override
public void close() throws IOException {
rawVectorWriter.close();
}
@Override
public long ramBytesUsed() {
return rawVectorWriter.ramBytesUsed();
}
@Override
public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException {
rawVectorWriter.mergeOneField(fieldInfo, mergeState);
}
}
private static final class FieldsReader extends KnnVectorsReader {
KnnVectorsReader rawVectorReader;
FieldsReader(SegmentReadState state) throws IOException {
// read in the meta information
final String metaFileName = IndexFileNames.segmentFileName(
state.segmentInfo.name,
state.segmentSuffix,
VECTOR_FORMAT_INFO_EXTENSION
);
int versionMeta = -1;
try (var metaIn = state.directory.openChecksumInput(metaFileName)) {
Throwable priorE = null;
Map<String, FlatVectorsReader> readers = null;
try {
versionMeta = CodecUtil.checkIndexHeader(
metaIn,
META_CODEC_NAME,
VERSION_START,
VERSION_CURRENT,
state.segmentInfo.getId(),
state.segmentSuffix
);
String innerFormatName = metaIn.readString();
//TODO load format
FlatVectorsFormat format = new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer());
rawVectorReader = new Lucene99HnswVectorsReader(state, format.fieldsReader(state));
} catch (Throwable exception) {
priorE = exception;
} finally {
CodecUtil.checkFooter(metaIn, priorE);
}
}
}
@Override
public void checkIntegrity() throws IOException {
rawVectorReader.checkIntegrity();
}
@Override
public FloatVectorValues getFloatVectorValues(String field) throws IOException {
return rawVectorReader.getFloatVectorValues(field);
}
@Override
public ByteVectorValues getByteVectorValues(String field) throws IOException {
return rawVectorReader.getByteVectorValues(field);
}
@Override
public void search(String field, float[] target, KnnCollector knnCollector, AcceptDocs acceptDocs) throws IOException {
rawVectorReader.search(field, target, knnCollector, acceptDocs);
}
@Override
public void search(String field, byte[] target, KnnCollector knnCollector, AcceptDocs acceptDocs) throws IOException {
rawVectorReader.search(field, target, knnCollector, acceptDocs);
}
@Override
public void close() throws IOException {
rawVectorReader.close();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! I'll see how far that takes me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it works! One thing I'm not sure about is merging - currently it just records a single flat vector format for the segment. I'm not sure if we need to handle merging segments with different flat formats, or even how we do that?
server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java
Outdated
Show resolved
Hide resolved
17a7944
to
56f98e4
Compare
56f98e4
to
f27430e
Compare
5535ba1
to
f0bac52
Compare
f0bac52
to
ddcf44f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are at a good phase 0 here.
Though I don't think we should allow on_disk_rescore
to be set to true on things that don't support it yet.
This reverts commit e4ef744.
77f4b14
to
604be7d
Compare
604be7d
to
46f5531
Compare
9ba6546
to
d4231a2
Compare
e660793
to
0fe89c5
Compare
Adds classes to form a generic layer to hnsw and disk bbq flat vector storage, allowing the flat vector format to be swapped out without changing the HNSW/DiskBBQ format. The format to use for flat vectors is stored in the top-level metadata, and loaded from the top-level format via a string key (as the relevant formats are not necessarily registered with SPI).
The previous DirectIO JVM option is removed, pending an
on_disk_rescore
index option being addedThis does not change the actual formats that are used by ES, but this infrastructure will be built on in later PRs. Until then, direct IO is not available