-
Notifications
You must be signed in to change notification settings - Fork 25.6k
T digest field mapper #137546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T digest field mapper #137546
Conversation
| /** | ||
| * Verify that, at a range of compression values, the size of the produced digest is not much larger than 10 times the compression | ||
| */ | ||
| public void testCompression() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tangential to the main point of this PR, but part of the exploratory work and seems like a reasonable test in general.
|
|
||
| // Supported tdigest types. | ||
| protected enum Type { | ||
| public enum Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding the visibility here so we can use it as a parameter to the field.
|
|
||
| public static class Builder extends FieldMapper.Builder { | ||
| private static final int DEFAULT_COMPRESSION = 100; | ||
| private static final int MAXIMUM_COMPRESSION = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10000 is probably way too high, but it gives us good headroom. 1000 is already quite a lot of accuracy.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| ); | ||
| this.compression = Parameter.intParam("compression", false, m -> toType(m).compression, DEFAULT_COMPRESSION).addValidator(c -> { | ||
| if (c <= 0 || c > MAXIMUM_COMPRESSION) { | ||
| throw new IllegalArgumentException("compression must be a positive integer between 1 and " + MAXIMUM_COMPRESSION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe list the passed value as well.
| b.startObject(); | ||
|
|
||
| value.reset(binaryValue); | ||
| b.startArray("centroids"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
centroids is def more accurate.. still, I wonder if we should use values here, for consistency with the existing histogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the accuracy in naming, personally. It's not like we expect users to be looking at the raw data most of the time, so this should be pretty hidden either way.
| b.endArray(); | ||
|
|
||
| value.reset(binaryValue); | ||
| b.startArray("counts"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move centroids and counts to static constants that are also shared with the parser class.
| private static final ParseField VALUES_FIELD = new ParseField("centroids"); | ||
|
|
||
| /** | ||
| * A parsed histogram field, can represent either a T-Digest or a HDR histogram. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No HDR here..
|
|
||
| /** | ||
| * Parses an XContent object into a histogram. | ||
| * The parse is expected to point at the next token after {@link XContentParser.Token#START_OBJECT}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The parse is expected to point at the next token after {@link XContentParser.Token#START_OBJECT}. | |
| * The parser is expected to point at the next token after {@link XContentParser.Token#START_OBJECT}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this typo was copy pasted from me, could you also fix it in ExponentialHistogramParser and HistogramParser while at it?
| * @param values the centroids, guaranteed to be distinct and in increasing order | ||
| * @param counts the counts, guaranteed to be non-negative and of the same length as values | ||
| */ | ||
| public record ParsedHistogram(List<Double> values, List<Long> counts) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be consistent with values vs centroids.
...nalytics/src/test/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapperTests.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| protected Object getSampleValueForDocument() { | ||
| // TODO - In hybrid mode, this will not even build a t-digest. Let's test with bigger data | ||
| return Map.of("centroids", new double[] { 2, 3 }, "counts", new int[] { 0, 4 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add randomization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyntheticSourceExample below has logic for generating random histograms, can be moved to a util and reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to just use random histogram. Most of the tests should take actual, valid t-digests, and we can do specific tests for things that don't fit elsewhere. At any rate, I'd been planning to do this in a follow up PR (thus the todo), but I can do it now if you want.
kkrik-es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, just a few nits.
| public class TDigestParser { | ||
|
|
||
| private static final ParseField COUNTS_FIELD = new ParseField("counts"); | ||
| private static final ParseField VALUES_FIELD = new ParseField("centroids"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going for a different member-name on the former values field, you should probably also rename the constant:
| private static final ParseField VALUES_FIELD = new ParseField("centroids"); | |
| private static final ParseField CENTROIDS_FIELD = new ParseField("centroids"); |
Co-authored-by: Kostas Krikellas <[email protected]>
…d-mapper' into t-digest-field-mapper
Add a T-Digest specific field type to Elasticsearch. This allows for storing pre-computed percentile sketches using the t-digest algorithm, for later aggregation. For this initial PR, it is mostly just a copy of the existing Histogram field mapper. I've added field level parameters for the t-digest implementation algorithm and compression parameters, with sensible defaults. This first iteration doesn't do any new input validation, so if a user sends a huge list of values, we'll just store it right now. In the future, we'd like to put some safeguards around that. Co-authored-by: Kostas Krikellas <[email protected]> --------- Co-authored-by: Kostas Krikellas <[email protected]>
Part of #137649
This PR adds a T-Digest specific field type to Elasticsearch. This allows for storing pre-computed percentile sketches using the t-digest algorithm, for later aggregation.
For this initial PR, it is mostly just a copy of the existing Histogram field mapper. I've added field level parameters for the t-digest implementation algorithm and compression parameters, with sensible defaults. This first iteration doesn't do any new input validation, so if a user sends a huge list of values, we'll just store it right now. In the future, we'd like to put some safeguards around that.
I'm labeling this
>non-issue, as it's still behind a feature flag and thus we don't want it in the release notes. That said, it will eventually be an>enhancement, so there's no plan to back port it.