Skip to content

Conversation

@vporyadke
Copy link
Collaborator

Changelog entry

...

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

...

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

2025-11-21 16:54:49 UTC Pre-commit check linux-x86_64-release-asan for 5190f5b has started.
2025-11-21 16:55:26 UTC Artifacts will be uploaded here
2025-11-21 16:57:21 UTC ya make is running...
🔴 2025-11-21 16:58:17 UTC Build failed, see the logs. Also see fail summary

@github-actions
Copy link

🟢 2025-11-21 16:55:17 UTC The validation of the Pull Request description is successful.

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

2025-11-21 16:56:01 UTC Pre-commit check linux-x86_64-relwithdebinfo for 5190f5b has started.
2025-11-21 16:56:19 UTC Artifacts will be uploaded here
2025-11-21 16:58:27 UTC ya make is running...
🔴 2025-11-21 16:59:37 UTC Build failed, see the logs. Also see fail summary

Copilot finished reviewing on behalf of vporyadke November 21, 2025 16:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes tablet metrics handling in Hive by introducing a lightweight TMetrics struct to replace frequent protobuf operations and adding a batching mechanism for database updates.

Key changes:

  • Introduces TMetrics struct as a non-protobuf alternative to NKikimrTabletBase::TMetrics for in-memory operations
  • Replaces direct field access via protobuf getters/setters with direct struct member access
  • Adds batching infrastructure (ProcessMetricsQueue, EnqueueUpdateMetrics, TTxProcessMetrics) to defer and batch database writes
  • Removes immediate protobuf serialization to database in tx__update_tablet_metrics.cpp

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ydb/core/protos/counters_hive.proto Adds new transaction type TXTYPE_PROCESS_METRICS for batched metrics processing
ydb/core/mind/hive/ya.make Registers new tx__process_metrics.cpp transaction file
ydb/core/mind/hive/hive.h Defines TMetrics struct with scalar fields and vector fields for throughput/iops records
ydb/core/mind/hive/hive.cpp Implements TMetrics::operator+= and ToProto() for converting to protobuf
ydb/core/mind/hive/hive_impl.h Updates function signatures to use TMetrics instead of protobuf, adds batching infrastructure
ydb/core/mind/hive/hive_impl.cpp Implements batching logic, updates aggregate functions to work with TMetrics
ydb/core/mind/hive/hive_events.h Adds TEvProcessMetrics event for triggering batched processing
ydb/core/mind/hive/tablet_info.h Changes ResourceValues from protobuf to TMetrics, adds UpdateMetricsEnqueued flag
ydb/core/mind/hive/tablet_info.cpp Updates resource tracking to use direct field access instead of protobuf operations
ydb/core/mind/hive/node_info.h Updates function signatures to accept TMetrics instead of protobuf
ydb/core/mind/hive/node_info.cpp Updates resource tracking functions to work with TMetrics
ydb/core/mind/hive/tx__update_tablet_metrics.cpp Removes immediate protobuf serialization to database (moved to batching)
ydb/core/mind/hive/tx__seize_tablets.cpp Updates to use ToProto() instead of CopyFrom()
ydb/core/mind/hive/tx__create_tablet.cpp Updates to use ToProto() instead of CopyFrom()/MergeFrom()
ydb/core/mind/hive/hive_statics.cpp Adds overloads for resource value formatting/aggregation functions for TMetrics
ydb/core/mind/hive/monitoring.cpp Adds JSON serialization for TMetrics, removes debug tooltip with proto string
ydb/core/mind/hive/hive_ut.cpp Adds performance test for batched metrics updates
ydb/core/mind/hive/hive_impl_ut.cpp Updates test to use direct field access
ydb/core/mind/hive/fill.cpp Simplifies logging by removing resource value details
ydb/core/mind/hive/drain.cpp Simplifies logging by removing resource value details
ydb/core/mind/hive/balancer.cpp Simplifies logging by removing resource value details

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 141 to 142
ui64 GetReadThroughput(const NKikimrTabletBase::TMetrics& values);
ui64 GetWriteThroughput(const NKikimrTabletBase::TMetrics& values);
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TMetrics overloads for GetReadThroughput and GetWriteThroughput are implemented in hive_statics.cpp but not declared in the header. Consider adding:

ui64 GetReadThroughput(const TMetrics& values);
ui64 GetWriteThroughput(const TMetrics& values);

This improves code clarity and allows the functions to be used elsewhere if needed.

Copilot uses AI. Check for mistakes.
Comment on lines +445 to +452
Self->GetDefaultResourceValuesForTabletType(tablet.Type).ToProto(&resourceValues);
BLOG_D("THive::TTxCreateTablet::Execute; Default resources after merge for type " << tablet.Type << ": {" << resourceValues.ShortDebugString() << "}");
if (IsValidObjectId(tablet.ObjectId)) {
resourceValues.MergeFrom(Self->GetDefaultResourceValuesForObject(tablet.ObjectId));
Self->GetDefaultResourceValuesForObject(tablet.ObjectId).ToProto(&resourceValues);
BLOG_D("THive::TTxCreateTablet::Execute; Default resources after merge for object " << tablet.ObjectId << ": {" << resourceValues.ShortDebugString() << "}");
}
// TODO: provide Hive with resource profile used by the tablet instead of default one.
resourceValues.MergeFrom(Self->GetDefaultResourceValuesForProfile(tablet.Type, "default"));
Self->GetDefaultResourceValuesForProfile(tablet.Type, "default").ToProto(&resourceValues);
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code used MergeFrom on lines 448 and 452 (as shown in the diff), which merges protobuf fields by taking the maximum or combining values. The new ToProto method only sets non-zero values and doesn't implement merge semantics. This changes the behavior: instead of merging default values from type, object, and profile, this code now overwrites with each successive call. To fix this, you should either:

  1. Accumulate metrics in a TMetrics object first (using operator+= or a merge method), then call ToProto once at the end, or
  2. Implement merge semantics in ToProto (e.g., only overwrite if the proto field is not already set).

Copilot uses AI. Check for mistakes.
ReadThroughput += other.ReadThroughput;
WriteThroughput += other.WriteThroughput;
ReadIops += other.ReadIops;
WriteIops += other.WriteIops;
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator+= implementation doesn't handle the vector fields (GroupReadThroughput, GroupWriteThroughput, GroupReadIops, GroupWriteIops). This means aggregating metrics with += will lose these fields. Consider either:

  1. Adding logic to merge/append these vectors (depending on the intended semantics), or
  2. Documenting that this operator is only meant for aggregating scalar fields if that's intentional.
Suggested change
WriteIops += other.WriteIops;
WriteIops += other.WriteIops;
GroupReadThroughput.insert(GroupReadThroughput.end(), other.GroupReadThroughput.begin(), other.GroupReadThroughput.end());
GroupWriteThroughput.insert(GroupWriteThroughput.end(), other.GroupWriteThroughput.begin(), other.GroupWriteThroughput.end());
GroupReadIops.insert(GroupReadIops.end(), other.GroupReadIops.begin(), other.GroupReadIops.end());
GroupWriteIops.insert(GroupWriteIops.end(), other.GroupWriteIops.begin(), other.GroupWriteIops.end());

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

оно действительно intentional, никакого смысла в таком "мерже" векторов бы не было

Comment on lines 28 to 33
TTabletInfo* tablet = Self->FindTablet(tabletId, followerId);
if (tablet != nullptr && metrics.HasResourceUsage()) {
tablet->UpdateResourceUsage(metrics.GetResourceUsage());
const NKikimrTabletBase::TMetrics& metrics(tablet->GetResourceValues());

db.Table<Schema::Metrics>().Key(tabletId, followerId).Update<Schema::Metrics::ProtoMetrics>(metrics);

db.Table<Schema::Metrics>().Key(tabletId, followerId).Update<Schema::Metrics::MaximumCPU>(tablet->GetResourceMetricsAggregates().MaximumCPU);
db.Table<Schema::Metrics>().Key(tabletId, followerId).Update<Schema::Metrics::MaximumMemory>(tablet->GetResourceMetricsAggregates().MaximumMemory);
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the ProtoMetrics database update is problematic. The code in tx__load_everything.cpp (line 774) still tries to load Schema::Metrics::ProtoMetrics from the database during startup. Without persisting these metrics, tablets will lose their resource values after a restart, which could lead to incorrect balancing decisions.

If the intention is to batch these updates in the new TTxProcessMetrics transaction, ensure that:

  1. EnqueueUpdateMetrics() is called from UpdateResourceUsage()
  2. The TTxProcessMetrics transaction properly persists the metrics to the database
  3. The batching interval is reasonable to avoid excessive data loss on restart

Copilot uses AI. Check for mistakes.
tabletInfo.SetRestartsPerPeriod(follower.GetRestartsPerPeriod(restartsBarrierTime));
if (req.GetReturnMetrics()) {
tabletInfo.MutableMetrics()->CopyFrom(follower.GetResourceValues());
//tabletInfo.MutableMetrics()->CopyFrom(follower.GetResourceValues());
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for returning follower metrics is commented out. If this is intentional (e.g., follower metrics are no longer being tracked), this line should be removed entirely rather than left as a comment. If this is unintentional, it should be updated to:

follower.GetResourceValues().ToProto(tabletInfo.MutableMetrics());
Suggested change
//tabletInfo.MutableMetrics()->CopyFrom(follower.GetResourceValues());
follower.GetResourceValues().ToProto(tabletInfo.MutableMetrics());

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

github-actions bot commented Nov 24, 2025

2025-11-24 10:43:13 UTC Pre-commit check linux-x86_64-release-asan for e752f22 has started.
2025-11-24 10:43:31 UTC Artifacts will be uploaded here
2025-11-24 10:45:42 UTC ya make is running...
🟡 2025-11-24 12:42:03 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14240 14081 0 114 32 13

🟢 2025-11-24 12:42:13 UTC Build successful.
🟡 2025-11-24 12:42:38 UTC ydbd size 3.8 GiB changed* by +147.1 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 8efc84b merge: e752f22 diff diff %
ydbd size 4 110 164 488 Bytes 4 110 315 144 Bytes +147.1 KiB +0.004%
ydbd stripped size 1 527 755 920 Bytes 1 527 787 216 Bytes +30.6 KiB +0.002%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Nov 24, 2025

2025-11-24 10:43:51 UTC Pre-commit check linux-x86_64-relwithdebinfo for e752f22 has started.
2025-11-24 10:44:09 UTC Artifacts will be uploaded here
2025-11-24 10:46:19 UTC ya make is running...
🟡 2025-11-24 12:58:18 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
41504 38637 0 4 2837 26

2025-11-24 12:58:30 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-11-24 13:09:04 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
56 (only retried tests) 40 0 1 0 15

2025-11-24 13:09:11 UTC ya make is running... (failed tests rerun, try 3)
🟢 2025-11-24 13:16:44 UTC Tests successful.

Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
32 (only retried tests) 17 0 0 0 15

🟢 2025-11-24 13:16:51 UTC Build successful.
🟡 2025-11-24 13:17:18 UTC ydbd size 2.3 GiB changed* by +102.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 1dd9d52 merge: e752f22 diff diff %
ydbd size 2 455 293 832 Bytes 2 455 398 664 Bytes +102.4 KiB +0.004%
ydbd stripped size 523 292 888 Bytes 523 304 472 Bytes +11.3 KiB +0.002%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Nov 26, 2025

2025-11-26 10:18:36 UTC Pre-commit check linux-x86_64-release-asan for 2b183da has started.
2025-11-26 10:18:52 UTC Artifacts will be uploaded here
2025-11-26 10:20:59 UTC ya make is running...
🟡 2025-11-26 12:04:57 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13465 13351 0 74 29 11

🟢 2025-11-26 12:05:07 UTC Build successful.
🟡 2025-11-26 12:05:32 UTC ydbd size 3.8 GiB changed* by +139.1 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 462c794 merge: 2b183da diff diff %
ydbd size 4 111 743 728 Bytes 4 111 886 160 Bytes +139.1 KiB +0.003%
ydbd stripped size 1 528 202 992 Bytes 1 528 226 096 Bytes +22.6 KiB +0.002%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Nov 26, 2025

2025-11-26 10:19:03 UTC Pre-commit check linux-x86_64-relwithdebinfo for 2b183da has started.
2025-11-26 10:19:21 UTC Artifacts will be uploaded here
2025-11-26 10:21:32 UTC ya make is running...
🟡 2025-11-26 12:33:03 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
41621 38752 0 4 2838 27

2025-11-26 12:33:17 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-11-26 12:42:57 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
60 (only retried tests) 46 0 0 0 14

🟢 2025-11-26 12:43:04 UTC Build successful.
🟡 2025-11-26 12:43:23 UTC ydbd size 2.3 GiB changed* by +102.1 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 462c794 merge: 2b183da diff diff %
ydbd size 2 456 362 792 Bytes 2 456 467 384 Bytes +102.1 KiB +0.004%
ydbd stripped size 523 441 336 Bytes 523 452 024 Bytes +10.4 KiB +0.002%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@vporyadke vporyadke self-assigned this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant