-
Notifications
You must be signed in to change notification settings - Fork 735
tablet metrics - less proto operations + batching #29347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
⚪ |
|
🟢 |
|
⚪ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes tablet metrics handling in Hive by introducing a lightweight TMetrics struct to replace frequent protobuf operations and adding a batching mechanism for database updates.
Key changes:
- Introduces
TMetricsstruct as a non-protobuf alternative toNKikimrTabletBase::TMetricsfor in-memory operations - Replaces direct field access via protobuf getters/setters with direct struct member access
- Adds batching infrastructure (
ProcessMetricsQueue,EnqueueUpdateMetrics,TTxProcessMetrics) to defer and batch database writes - Removes immediate protobuf serialization to database in
tx__update_tablet_metrics.cpp
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/core/protos/counters_hive.proto | Adds new transaction type TXTYPE_PROCESS_METRICS for batched metrics processing |
| ydb/core/mind/hive/ya.make | Registers new tx__process_metrics.cpp transaction file |
| ydb/core/mind/hive/hive.h | Defines TMetrics struct with scalar fields and vector fields for throughput/iops records |
| ydb/core/mind/hive/hive.cpp | Implements TMetrics::operator+= and ToProto() for converting to protobuf |
| ydb/core/mind/hive/hive_impl.h | Updates function signatures to use TMetrics instead of protobuf, adds batching infrastructure |
| ydb/core/mind/hive/hive_impl.cpp | Implements batching logic, updates aggregate functions to work with TMetrics |
| ydb/core/mind/hive/hive_events.h | Adds TEvProcessMetrics event for triggering batched processing |
| ydb/core/mind/hive/tablet_info.h | Changes ResourceValues from protobuf to TMetrics, adds UpdateMetricsEnqueued flag |
| ydb/core/mind/hive/tablet_info.cpp | Updates resource tracking to use direct field access instead of protobuf operations |
| ydb/core/mind/hive/node_info.h | Updates function signatures to accept TMetrics instead of protobuf |
| ydb/core/mind/hive/node_info.cpp | Updates resource tracking functions to work with TMetrics |
| ydb/core/mind/hive/tx__update_tablet_metrics.cpp | Removes immediate protobuf serialization to database (moved to batching) |
| ydb/core/mind/hive/tx__seize_tablets.cpp | Updates to use ToProto() instead of CopyFrom() |
| ydb/core/mind/hive/tx__create_tablet.cpp | Updates to use ToProto() instead of CopyFrom()/MergeFrom() |
| ydb/core/mind/hive/hive_statics.cpp | Adds overloads for resource value formatting/aggregation functions for TMetrics |
| ydb/core/mind/hive/monitoring.cpp | Adds JSON serialization for TMetrics, removes debug tooltip with proto string |
| ydb/core/mind/hive/hive_ut.cpp | Adds performance test for batched metrics updates |
| ydb/core/mind/hive/hive_impl_ut.cpp | Updates test to use direct field access |
| ydb/core/mind/hive/fill.cpp | Simplifies logging by removing resource value details |
| ydb/core/mind/hive/drain.cpp | Simplifies logging by removing resource value details |
| ydb/core/mind/hive/balancer.cpp | Simplifies logging by removing resource value details |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ui64 GetReadThroughput(const NKikimrTabletBase::TMetrics& values); | ||
| ui64 GetWriteThroughput(const NKikimrTabletBase::TMetrics& values); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TMetrics overloads for GetReadThroughput and GetWriteThroughput are implemented in hive_statics.cpp but not declared in the header. Consider adding:
ui64 GetReadThroughput(const TMetrics& values);
ui64 GetWriteThroughput(const TMetrics& values);This improves code clarity and allows the functions to be used elsewhere if needed.
| Self->GetDefaultResourceValuesForTabletType(tablet.Type).ToProto(&resourceValues); | ||
| BLOG_D("THive::TTxCreateTablet::Execute; Default resources after merge for type " << tablet.Type << ": {" << resourceValues.ShortDebugString() << "}"); | ||
| if (IsValidObjectId(tablet.ObjectId)) { | ||
| resourceValues.MergeFrom(Self->GetDefaultResourceValuesForObject(tablet.ObjectId)); | ||
| Self->GetDefaultResourceValuesForObject(tablet.ObjectId).ToProto(&resourceValues); | ||
| BLOG_D("THive::TTxCreateTablet::Execute; Default resources after merge for object " << tablet.ObjectId << ": {" << resourceValues.ShortDebugString() << "}"); | ||
| } | ||
| // TODO: provide Hive with resource profile used by the tablet instead of default one. | ||
| resourceValues.MergeFrom(Self->GetDefaultResourceValuesForProfile(tablet.Type, "default")); | ||
| Self->GetDefaultResourceValuesForProfile(tablet.Type, "default").ToProto(&resourceValues); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code used MergeFrom on lines 448 and 452 (as shown in the diff), which merges protobuf fields by taking the maximum or combining values. The new ToProto method only sets non-zero values and doesn't implement merge semantics. This changes the behavior: instead of merging default values from type, object, and profile, this code now overwrites with each successive call. To fix this, you should either:
- Accumulate metrics in a
TMetricsobject first (usingoperator+=or a merge method), then callToProtoonce at the end, or - Implement merge semantics in
ToProto(e.g., only overwrite if the proto field is not already set).
| ReadThroughput += other.ReadThroughput; | ||
| WriteThroughput += other.WriteThroughput; | ||
| ReadIops += other.ReadIops; | ||
| WriteIops += other.WriteIops; |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator+= implementation doesn't handle the vector fields (GroupReadThroughput, GroupWriteThroughput, GroupReadIops, GroupWriteIops). This means aggregating metrics with += will lose these fields. Consider either:
- Adding logic to merge/append these vectors (depending on the intended semantics), or
- Documenting that this operator is only meant for aggregating scalar fields if that's intentional.
| WriteIops += other.WriteIops; | |
| WriteIops += other.WriteIops; | |
| GroupReadThroughput.insert(GroupReadThroughput.end(), other.GroupReadThroughput.begin(), other.GroupReadThroughput.end()); | |
| GroupWriteThroughput.insert(GroupWriteThroughput.end(), other.GroupWriteThroughput.begin(), other.GroupWriteThroughput.end()); | |
| GroupReadIops.insert(GroupReadIops.end(), other.GroupReadIops.begin(), other.GroupReadIops.end()); | |
| GroupWriteIops.insert(GroupWriteIops.end(), other.GroupWriteIops.begin(), other.GroupWriteIops.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
оно действительно intentional, никакого смысла в таком "мерже" векторов бы не было
| TTabletInfo* tablet = Self->FindTablet(tabletId, followerId); | ||
| if (tablet != nullptr && metrics.HasResourceUsage()) { | ||
| tablet->UpdateResourceUsage(metrics.GetResourceUsage()); | ||
| const NKikimrTabletBase::TMetrics& metrics(tablet->GetResourceValues()); | ||
|
|
||
| db.Table<Schema::Metrics>().Key(tabletId, followerId).Update<Schema::Metrics::ProtoMetrics>(metrics); | ||
|
|
||
| db.Table<Schema::Metrics>().Key(tabletId, followerId).Update<Schema::Metrics::MaximumCPU>(tablet->GetResourceMetricsAggregates().MaximumCPU); | ||
| db.Table<Schema::Metrics>().Key(tabletId, followerId).Update<Schema::Metrics::MaximumMemory>(tablet->GetResourceMetricsAggregates().MaximumMemory); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the ProtoMetrics database update is problematic. The code in tx__load_everything.cpp (line 774) still tries to load Schema::Metrics::ProtoMetrics from the database during startup. Without persisting these metrics, tablets will lose their resource values after a restart, which could lead to incorrect balancing decisions.
If the intention is to batch these updates in the new TTxProcessMetrics transaction, ensure that:
EnqueueUpdateMetrics()is called fromUpdateResourceUsage()- The
TTxProcessMetricstransaction properly persists the metrics to the database - The batching interval is reasonable to avoid excessive data loss on restart
ydb/core/mind/hive/hive_impl.cpp
Outdated
| tabletInfo.SetRestartsPerPeriod(follower.GetRestartsPerPeriod(restartsBarrierTime)); | ||
| if (req.GetReturnMetrics()) { | ||
| tabletInfo.MutableMetrics()->CopyFrom(follower.GetResourceValues()); | ||
| //tabletInfo.MutableMetrics()->CopyFrom(follower.GetResourceValues()); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for returning follower metrics is commented out. If this is intentional (e.g., follower metrics are no longer being tracked), this line should be removed entirely rather than left as a comment. If this is unintentional, it should be updated to:
follower.GetResourceValues().ToProto(tabletInfo.MutableMetrics());| //tabletInfo.MutableMetrics()->CopyFrom(follower.GetResourceValues()); | |
| follower.GetResourceValues().ToProto(tabletInfo.MutableMetrics()); |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
⚪ Ya make output | Test bloat | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
0c1dba6 to
bc9b995
Compare
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
Changelog entry
...
Changelog category
Description for reviewers
...