Skip to content

Conversation

@frrist
Copy link
Member

@frrist frrist commented Aug 13, 2022

This PR fixes #1031

The previous implementation diffed previous and current partitions directly. This was incorrect since miners
may compact their partitions resulting in sectors moving to different partitions, which the previous code proceeded to record as invalid events. Instead, we now collect all sectors from all miner partitions and diff the sets directly.

- previous implementation diffed previous and current partitions, this was incorrect since miners
may compact their partitions resulting in sectors moving to different partitions. Instead we
collect all sectors from all miner partitions and diff the sets directly.
@frrist frrist self-assigned this Aug 13, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2022

Codecov Report

Merging #1032 (5881258) into master (2a5df7c) will decrease coverage by 1.1%.
The diff coverage is 31.6%.

@@           Coverage Diff            @@
##           master   #1032     +/-   ##
========================================
- Coverage    35.6%   34.4%   -1.2%     
========================================
  Files          44      44             
  Lines        2881    2925     +44     
========================================
- Hits         1027    1008     -19     
- Misses       1750    1821     +71     
+ Partials      104      96      -8     

Copy link
Contributor

@placer14 placer14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me.

A thought: It would probably be easy to write a query which proves the invariant isn't in our exports which we could periodically run over our exported data.

Can you outline/include how this data bug was detected originally? I can see how we can run tests like these in CI. (Ideas @kasteph?)

Edit: Would the check be as simple as taking all Termination events and making sure that these sectors never have additional events after the Termination occurs?

faulted := bitfield.New()
recovered := bitfield.New()
recovering := bitfield.New()
// SectorStates contains a set of bitfields for active, live, fault, and recovering sectors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to mention how this bitfield represents each type of sector represented here. (a packed binary list of sector IDs?)

return out, nil

// previous faulty sectors minus current active sectors are sectors recovered this epoch.
recovered, err := bitfield.IntersectBitField(previous.Faulty, current.Active)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and code don't match. I assume we only want to see which previous faults overlap with current actives... this makes sense as being "Recovered". Let's update this comment.

// previous faulty sectors which match (intersect) active sectors are sectors recovered this epoch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, bad copy paste

@frrist
Copy link
Member Author

frrist commented Aug 16, 2022

A thought: It would probably be easy to write a query which proves the invariant isn't in our exports which we could periodically run over our exported data.

@davidgasquez is working on this as a part of the dbt workflow.

Edit: Would the check be as simple as taking all Termination events and making sure that these sectors never have additional events after the Termination occurs?

That would be part of the check, but there are more cases, for example - sectors cannot recover before becoming faulted, sectors cannot terminate before being added, faulted sectors cannot become faulty, etc.

A sector lifecycle is a state machine, so we'll want to validate all sector events correspond to valid states.

@placer14
Copy link
Contributor

My approval is sticky. Good catch on accidental context sharing. 🙇

@frrist frrist merged commit 45f15be into master Aug 23, 2022
@frrist frrist deleted the frrist/fix-1031 branch August 23, 2022 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fault/recover event still happen after a termination event

4 participants