- 
                Notifications
    
You must be signed in to change notification settings  - Fork 52
 
feat: add LSM timeline support with timeline layout management for old and new table versions #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
          Codecov Report❌ Patch coverage is  Additional details and impacted files@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
- Coverage   86.72%   84.63%   -2.09%     
==========================================
  Files          49       52       +3     
  Lines        2637     2747     +110     
==========================================
+ Hits         2287     2325      +38     
- Misses        350      422      +72     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few note to reviewers.
        
          
                crates/core/src/timeline/layout.rs
              
                Outdated
          
        
      | #[derive(Debug, Clone)] | ||
| pub enum TimelineLayoutType { | ||
| Active(super::active::ActiveTimeline), | ||
| Lsm(super::lsm::LsmTimeline), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think of Timeline as a top level interface where callers can interact with, the layout is an internal config that resolved based on table version and timeline layout version (as you resolved in create_layout(). It'd be a bit odd to see a "Type" struct having a Timeline instance basically for choosing a Timeline to use.
How about we design it like:
Timelineremains as the top-level interface- When a Timeline is created, it resolves 
HudiConfigs, and create an internal model callTimelineLoaderbased on the configs TimelineLoadercan be an enum, it has these items:
LayoutOneActive
LayoutOneArchived
LayoutTwoActive
LayoutTwoCompacted
Each loader will implement the logic to load its intended commit files.
- When some APIs are invoked and need to load archived or compacted commits, the Timeline instance will use the needed 
Loaders to do IO Timelineis responsible for collecting and stitching theLoaders' return data and returning final data to callers- We also avoid calling a LSM timeline since LSM tree is just one impl of Timeline which should be hidden away from Timeline's level
 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, better suggestion overall! Few things if you can clarify:
We also avoid calling a LSM timeline since LSM tree is just one impl of Timeline which should be hidden away from Timeline's level
Should the LSM tree logic be a separate module that the loaders use, or should the loaders themselves be the LSM tree implementations? So, in the context of TimelineLoader::LayoutTwoActive, would this variant contain an LSMTree instance, or would it implement LSM tree logic directly?
Also, should the current Timeline::load_instants() method remain as-is, or would you prefer a different public API? For example, another API that takes a boolean to decide whether to load archived/compacted timeline. Alternatively, we could just add another API keeping load_instants() as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope cool! keeping lsm tree specific in lsm_tree.rs is good. may not need to worry about re-using it for now - it can be used by the loaders exclusively, but somewhat positioned for future re-use
Looking closer - so load_instants() is actually a more generic helper function, it's the using of a selector in one of the init functions (new_from_storage()) that defines the behavior of timeline's instants loading. So depends on where load_instants() gets called, we just need to pass in the desired TimelineSelector
…d and new table versions Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
This commit introduces the foundational architecture for supporting LSM timelines in hudi-rs, enabling compatibility with both v6 (active timeline) and v8 (LSM timeline) Hudi table formats. The key changes include: - A `TimelineLoader` enum to abstract different timeline storage formats. - A `TimelineBuilder` to handle configuration resolution and loader instantiation based on `hoodie.table.version` and `hoodie.timeline.layout.version`. - Refactored `Timeline` to use the new loader architecture while maintaining backward compatibility with existing APIs. - Graceful fallback for invalid table configurations to ensure test stability. Signed-off-by: Sagar Sumit <[email protected]>
Implements the orchestration logic within the `Timeline` struct to support querying both active and archived/compacted timeline data. Adds a new `load_instants_with_archive` method which queries the active loader and, if requested, the archived loader, then merges the results. Existing `get_completed_*` methods are updated to use this new orchestration layer while maintaining their original behavior. Signed-off-by: Sagar Sumit <[email protected]>
- Renamed lsm.rs to lsm_tree.rs to better reflect its purpose - Expanded LSMTree struct with core logic for manifest reading - Integrated LSMTree with TimelineLoader for v8 timeline support - Updated TimelineLoader to handle LayoutTwoActive using the new LSM tree logic Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
| } | ||
| 
               | 
          ||
| pub struct LSMTree { | ||
| storage: Arc<Storage>, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to take any hudi configs for LSM tree specific settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll pass configs into the LSM reader when adding archived reads to honor timeline-specific settings.
Signed-off-by: Sagar Sumit <[email protected]>
- Made load_archived_instants pub(crate) - Removed include_archived from public surface; added internal load_instants_internal used by Timeline methods - LayoutTwoActive now sorts results, ignores .crc, and drops the _ prefix filter - LSM manifest docs - Dropped v5 support in builder - Added TimelineSelector::has_time_filter() so Timeline knows when a time range is requested. - Updated Timeline::load_instants(...) to automatically include archived results when a time filter is present and an archived loader exists. Active is always queried first, archived appended if available. Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
ea11bad    to
    4f2eded      
    Compare
  
    | 
           resuming review on this over the next 2 days  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have some suggestions for code simplification. once addressed, we should be good to move forward
| if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") { | ||
| continue; | ||
| } | ||
| match selector.try_create_instant(file_info.name.as_str()) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this API needs to handle v8 instants as the completed ones have completion ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope this is the only blocker for merging. if we need to pass v8 timeline loading test cases, this parsing needs to be there.
| // Config wiring: if archived read not enabled, return behavior based on policy | ||
| let storage = match self { | ||
| TimelineLoader::LayoutOneArchived(storage) | ||
| | TimelineLoader::LayoutTwoArchived(storage) => storage.clone(), | ||
| _ => return Ok(Vec::new()), // Active loaders don't have archived parts | ||
| }; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this storage clone does not need happen here?
        
          
                crates/core/src/timeline/loader.rs
              
                Outdated
          
        
      | let enabled = configs | ||
| .get_or_default(TimelineArchivedReadEnabled) | ||
| .to::<bool>(); | ||
| if !enabled { | ||
| let behavior = configs | ||
| .get_or_default(TimelineArchivedUnavailableBehavior) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we just need 1 config? TimelineArchivedUnavailableBehavior default to continue will just either load archived or do nothing. even for internal config, we need to keep it simple for easy maintenance
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
Description
Implements LSM timeline reading capability for the hudi-rs core to support both v6 and v8 timeline formats, following Java implementation patterns with timeline layout abstraction. This is the foundation PR and we will follow up with future work on LSM timeline history reading, support for multiple layers of timeline compaction and manifest parsing for compacted timeline files.
Core Implementation
TimelineLoaderto abstract different timeline storage formats.TimelineBuilderto handle configuration resolution and loader instantiation based on table configuration.Timelineto support querying both active and archived/compacted timeline data.LayoutOneActive(reads from.hoodie/directory) andLayoutOneArchived(future)LayoutTwoActive(foundation for.hoodie/timeline/) andLayoutTwoCompacted(future)Configuration Integration
hoodie.table.versionandhoodie.timeline.layout.versionRelated to #377
How are the changes test-covered