Skip to content

Conversation

@not-napoleon
Copy link
Member

@not-napoleon not-napoleon commented Nov 3, 2025

Part of #137649

This PR adds a T-Digest specific field type to Elasticsearch. This allows for storing pre-computed percentile sketches using the t-digest algorithm, for later aggregation.

For this initial PR, it is mostly just a copy of the existing Histogram field mapper. I've added field level parameters for the t-digest implementation algorithm and compression parameters, with sensible defaults. This first iteration doesn't do any new input validation, so if a user sends a huge list of values, we'll just store it right now. In the future, we'd like to put some safeguards around that.

I'm labeling this >non-issue, as it's still behind a feature flag and thus we don't want it in the release notes. That said, it will eventually be an >enhancement, so there's no plan to back port it.

/**
* Verify that, at a range of compression values, the size of the produced digest is not much larger than 10 times the compression
*/
public void testCompression() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tangential to the main point of this PR, but part of the exploratory work and seems like a reasonable test in general.


// Supported tdigest types.
protected enum Type {
public enum Type {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding the visibility here so we can use it as a parameter to the field.


public static class Builder extends FieldMapper.Builder {
private static final int DEFAULT_COMPRESSION = 100;
private static final int MAXIMUM_COMPRESSION = 10000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000 is probably way too high, but it gives us good headroom. 1000 is already quite a lot of accuracy.

@not-napoleon not-napoleon marked this pull request as ready for review November 5, 2025 18:53
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

);
this.compression = Parameter.intParam("compression", false, m -> toType(m).compression, DEFAULT_COMPRESSION).addValidator(c -> {
if (c <= 0 || c > MAXIMUM_COMPRESSION) {
throw new IllegalArgumentException("compression must be a positive integer between 1 and " + MAXIMUM_COMPRESSION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe list the passed value as well.

b.startObject();

value.reset(binaryValue);
b.startArray("centroids");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

centroids is def more accurate.. still, I wonder if we should use values here, for consistency with the existing histogram.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the accuracy in naming, personally. It's not like we expect users to be looking at the raw data most of the time, so this should be pretty hidden either way.

b.endArray();

value.reset(binaryValue);
b.startArray("counts");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move centroids and counts to static constants that are also shared with the parser class.

private static final ParseField VALUES_FIELD = new ParseField("centroids");

/**
* A parsed histogram field, can represent either a T-Digest or a HDR histogram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No HDR here..


/**
* Parses an XContent object into a histogram.
* The parse is expected to point at the next token after {@link XContentParser.Token#START_OBJECT}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The parse is expected to point at the next token after {@link XContentParser.Token#START_OBJECT}.
* The parser is expected to point at the next token after {@link XContentParser.Token#START_OBJECT}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this typo was copy pasted from me, could you also fix it in ExponentialHistogramParser and HistogramParser while at it?

* @param values the centroids, guaranteed to be distinct and in increasing order
* @param counts the counts, guaranteed to be non-negative and of the same length as values
*/
public record ParsedHistogram(List<Double> values, List<Long> counts) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with values vs centroids.

@Override
protected Object getSampleValueForDocument() {
// TODO - In hybrid mode, this will not even build a t-digest. Let's test with bigger data
return Map.of("centroids", new double[] { 2, 3 }, "counts", new int[] { 0, 4 });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add randomization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyntheticSourceExample below has logic for generating random histograms, can be moved to a util and reused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to just use random histogram. Most of the tests should take actual, valid t-digests, and we can do specific tests for things that don't fit elsewhere. At any rate, I'd been planning to do this in a follow up PR (thus the todo), but I can do it now if you want.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just a few nits.

public class TDigestParser {

private static final ParseField COUNTS_FIELD = new ParseField("counts");
private static final ParseField VALUES_FIELD = new ParseField("centroids");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going for a different member-name on the former values field, you should probably also rename the constant:

Suggested change
private static final ParseField VALUES_FIELD = new ParseField("centroids");
private static final ParseField CENTROIDS_FIELD = new ParseField("centroids");

@not-napoleon not-napoleon enabled auto-merge (squash) November 6, 2025 14:37
@not-napoleon not-napoleon merged commit d78c7e6 into elastic:main Nov 6, 2025
34 checks passed
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
Add a T-Digest specific field type to Elasticsearch. This allows for storing pre-computed percentile sketches using the t-digest algorithm, for later aggregation.

For this initial PR, it is mostly just a copy of the existing Histogram field mapper. I've added field level parameters for the t-digest implementation algorithm and compression parameters, with sensible defaults. This first iteration doesn't do any new input validation, so if a user sends a huge list of values, we'll just store it right now. In the future, we'd like to put some safeguards around that.

Co-authored-by: Kostas Krikellas <[email protected]>

---------

Co-authored-by: Kostas Krikellas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants