Skip to content

csv-managed is a Rust command-line utility for high‑performance exploration and transformation of CSV data at scale, emphasizing streaming, typed operations, and reproducible workflows via schema and index files.

License

Notifications You must be signed in to change notification settings

softwaresalt/csv-managed

Repository files navigation

csv-managed

csv-managed is a high‑performance Rust CLI for exploring, validating, transforming, and indexing very large CSV/TSV (and future delimited) datasets using streaming, typed schemas, and multi‑variant indexes.

Feature Matrix (Concise)

Area Highlights
Delimiters & Encodings Comma/tab/pipe/semicolon/custom; independent input/output encoding; stdin/stdout streaming
Schema Discovery Sample or full scan inference; diff, overrides, placeholder normalization, snapshots
Header Detection Automatic header/headerless with synthetic field_#; force via --assume-header
Datatype Transformations Ordered datatype_mappings chains (parse, round, trim, case) before final typing
Decimal & Currency Fixed decimal(p,s) (≤28 precision) and currency scale (2 or 4) enforcement
Indexing & Sorting Multi-variant B-Tree index; longest matching prefix acceleration; covering expansion
Filtering & Derivation Typed comparisons + Evalexpr expressions; temporal helpers; positional aliases
Verification Streaming per-cell type enforcement; tiered invalid reporting
Statistics & Frequency Numeric + temporal metrics; distinct counts with --frequency / --top
Append & Pipelines Multi-file union with schema consistency; efficient chained stdin workflows
Boolean & Table Output Configurable boolean formats; elastic preview/table rendering
Snapshots Layout/inference regression guard (--snapshot)
Error & Logging Contextual failures; debug logging for inference/index/mappings

Extended details moved to dedicated docs (see Documentation Map).


Global Documentation Table of Contents

Core (This README)

  1. Feature Matrix
  2. Global Documentation TOC
  3. Quick Start
  4. Installation
  5. Core Concepts (Brief)
  6. Datatypes
  7. Expressions Overview
  8. Indexes & Sorting Overview
  9. Streaming & Pipelines Overview
  10. Command Guide
  11. Advanced Topics
  12. Roadmap
  13. Contributing
  14. License
  15. Support

Deep Dives (Docs Directory)

  1. Schema Inference Internals
  2. Schema Command Examples
  3. Datatype Mappings Deep Dive
  4. Statistics & Frequency Deep Dive
  5. Header Detection & FAQ
  6. Naming Conventions
  7. Snapshots vs Verification
  8. Expressions Reference & Extended Examples
  9. Indexing & Sorting Guide
  10. Pipelines & Multi-Stage Patterns
  11. Encoding Normalization
  12. Boolean Formatting & Table Output
  13. Operational Notes (Perf / Errors / Logging / Testing)
  14. CLI Help Reference

Quick Cross-Reference

Capability Primary Doc
Inference algorithm details schema-inference
Placeholder / NA handling schema-inference, schema-examples
Decimal & Currency rules schema-inference, datatype-mappings
Mapping strategies & strategies matrix datatype-mappings
Overrides vs mappings vs replacements schema-examples, datatype-mappings
Header detection heuristic header-detection
Naming / snake_case rationale naming-conventions
Snapshot vs verify comparison snapshots-and-verification
Invalid reporting tiers snapshots-and-verification
Index variant design & covering indexing-and-sorting
Streaming pipeline safety (header shape) pipelines
Encoding normalization patterns encoding-normalization
Boolean output modes boolean-formatting
Statistics aggregation & frequency counting stats
Expressions functions, quoting, bucketing expressions
Performance & logging guidance operations
CLI option reference cli-help

Use this TOC as a hub: internal anchors for quick orientation; deep dives for authoritative detail.


Documentation Map

Topic Doc
Expressions (full reference & examples) expressions
Indexing & Sorting internals indexing-and-sorting
Multi-stage pipelines & header shape rules pipelines
Schema inference internals schema-inference
Schema command usage examples schema-examples
Header detection algorithm & FAQ header-detection
Naming conventions (snake_case rationale) naming-conventions
Snapshots vs verification + reporting tiers snapshots-and-verification
Boolean formatting & table output boolean-formatting
Encoding normalization pipelines encoding-normalization
Datatype mappings & transformation strategies datatype-mappings
Statistics & frequency metrics stats
Operational notes (performance, errors, logging, testing) operations
CLI flag reference (captured help output) cli-help

Roadmap/backlog: see the roadmap.


Quick Start

# 1. Infer schema
./target/release/csv-managed.exe schema infer -i ./data/orders.csv -o ./data/orders-schema.yml --sample-rows 0
# 2. Build indexes
./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx --spec default=order_date:asc,customer_id:asc --spec recent=order_date:desc -m ./data/orders-schema.yml
# 3. Typed processing (filters / derives / sort)
./target/release/csv-managed.exe process -i ./data/orders.csv -m ./data/orders-schema.yml -x ./data/orders.idx --index-variant default --sort order_date:asc,customer_id:asc --filter "status = shipped" --derive 'total_with_tax=amount*1.0825' --row-numbers -o ./data/orders_filtered.csv
# 4. Stats (numeric & temporal)
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml
# 5. Frequency counts
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml --frequency --top 10
# 6. Preview (no file output allowed with --preview)
./target/release/csv-managed.exe process -i ./data/orders.csv --preview --limit 15

See extended examples in collapsible sections throughout this README.


Installation

cargo build --release

Binary (Windows): target\release\csv-managed.exe

From crates.io:

cargo install csv-managed

Local path dev install:

cargo install --path .

Helper command (wraps cargo install):

./target/release/csv-managed.exe install --locked

Environment logging examples:

$env:RUST_LOG='info'
set RUST_LOG=info

Core Concepts (Brief)

Schemas declare column order, types, optional renames, mapping chains, and replacements. Per-cell flow: raw → mappings → replacements → final parse. See docs/schema-inference.md and docs/schema-examples.md.

Header detection, naming guidance, and FAQ: docs/header-detection.md, docs/naming-conventions.md.

Overrides vs mappings vs replacements decision table: docs/schema-examples.md.

Verification tiers + snapshot comparison: docs/snapshots-and-verification.md.

Snapshot Internals (Deep Dive)

Snapshot includes: header+type hash (SHA-256), textual inference table, observation summaries. Hash changes on any header reorder or type change. Regenerate intentionally after approved inference adjustments.


Datatypes (Supported)

Type Examples Notes
String any UTF‑8 Post-mapping names usable in expressions
Integer 42, -7 64-bit signed
Float 3.14, 2 f64 (integers accepted)
Boolean true/false, yes/no, 1/0 Input variants normalized; output format selectable
Date 2024-08-01, 08/01/2024 Canonical YYYY-MM-DD
DateTime 2024-08-01T13:45:00 Naive (no TZ)
Time 06:00:00, 14:30 Canonical HH:MM:SS
Currency $12.34, 123.4567 Enforce 2 or 4 scale; symbol stripped
Decimal 123.4567, (1,234.50) Fixed precision/scale ≤28
Guid RFC 4122 hyphenated or 32hex Case-insensitive

Expressions & Derived Logic (Overview)

Derived columns: --derive name=expr • Filters: --filter, --filter-expr • Positional aliases: c0, c1, ...row_number when --row-numbers enabled.

Quick Cheat Sheets

Pattern Example Description
Arithmetic total_with_tax=amount*1.0825 Multiply numeric column
Conditional flag high_value=if(amount>1000,1,0) 1/0 indicator
Date diff ship_lag=date_diff_days(shipped_at,ordered_at) Days between dates
Time diff window=time_diff_seconds(end_time,start_time) Seconds difference
Concat channel_tag=concat(channel,"-",region) Combine strings
Guid passthrough id_copy=id Duplicate column
Row number row_index=row_number Sequential index
Aspect --filter --filter-expr
Operators Basic typed comparisons Full Evalexpr syntax
Logic AND via repetition AND/OR, nested if
Temporal helpers Direct typed compare date_diff_days, etc.
Complexity Concise Arbitrary expression

Full Expression Reference

Temporal Helpers: date_add, date_sub, date_diff_days, date_format, datetime_add_seconds, datetime_diff_seconds, datetime_format, datetime_to_date, datetime_to_time, time_add_seconds, time_diff_seconds.

Pitfalls:

  • PowerShell quoting: wrap whole expression in single quotes, internal literals in double quotes.
  • c0 is first column (0-based). Verify mapping order.
  • row_number exists only if --row-numbers set.
  • Use helpers, not raw string comparisons, for temporal correctness.
  • Mapping chains precede replacements which precede final parse; expressions see normalized values.

Function Index (alphabetical): concat, date_add, date_diff_days, date_format, date_sub, datetime_add_seconds, datetime_diff_seconds, datetime_format, datetime_to_date, datetime_to_time, if, time_add_seconds, time_diff_seconds.

Debugging: Increase logging with RUST_LOG=csv_managed=debug. Future deep expression tracing may emit expr: prefixed debug lines.


Indexes & Sorting (Overview)

Indexes store byte offsets keyed by concatenated column values. A single .idx contains multiple named variants (different column sequences and directions). process chooses the variant with the longest matching prefix for a requested --sort unless --index-variant pins a specific one.

Building:

./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx \
  --spec default=order_date:asc,customer_id:asc \
  --spec recent=order_date:desc -m ./data/orders-schema.yml

Covering (--covering): Generate systematic direction/prefix permutations from a concise pattern (e.g. geo=date:asc|desc,customer:asc).

Fallback: When no index variant matches the entire sort signature, an in-memory stable multi-column sort executes (still streaming transforms earlier/later as possible).


Streaming & Pipelines (Overview)

Use -i - to read from stdin; schema strongly recommended for typed semantics. Each stage must explicitly declare stdin usage. Avoid header shape changes between typed stages unless you also provide a matching updated schema.

Core guidelines:

  • Keep early projections narrow.
  • Apply filters before sorting or heavy derives.
  • Normalize encodings up front (--input-encoding / --output-encoding).
  • Use --preview --limit for fast inspection; remove before chaining downstream.

Extended Pipeline Examples & Troubleshooting

Filter then stats:

Get-Content .\tests\data\big_5_players_stats_2023_2024.csv | \
  .\target\release\csv-managed.exe process -i - --schema .\tests\data\big_5_players_stats-schema.yml \
  --filter "Performance_Gls >= 10" --limit 40 | \
  .\target\release\csv-managed.exe stats -i - --schema .\tests\data\big_5_players_stats-schema.yml -C Performance_Gls

Append with one streamed input:

Get-Content .\tests\data\big_5_players_stats_2023_2024.csv | \
  .\target\release\csv-managed.exe append -i - -i .\tmp\big_5_preview.csv \
  --schema .\tests\data\big_5_players_stats-schema.yml -o .\tmp\players_union.csv

Encoding normalization:

Get-Content .\tmp\big_5_windows1252.csv | \
  .\target\release\csv-managed.exe process -i - --input-encoding windows-1252 \
  --schema .\tests\data\big_5_players_stats-schema.yml --columns Player --columns Squad --limit 5 --table

Troubleshooting:

Symptom Cause Fix
Hang Upstream not producing Add --preview --limit to inspect
Column not found Rename/mapping changed Re-check schema columns / header output
Zero stats rows Filters excluded all rows Relax/remove filters
Invalid datatype downstream Schema mismatch Supply correct schema per stage

Command Guide (Summary)

Concise flag references; see concept sections for deep behavior.

schema

Probe, infer, verify, list columns, diff and snapshot inference output.

Sub/Flag Summary
probe Inference preview table (no file)
infer Inference + optional write (-o) + diff/snapshot integration
verify Streaming type & replacement validation
columns Tabular listing of schema columns
--snapshot Layout regression guard
--diff <schema> Unified diff vs existing schema
--assume-header Override header detection
--mapping Emit mapping scaffold & snake_case suggestions
--replace-template Inject empty replace arrays

process

Transform & emit rows: filtering, derives, column selection, sorting (indexed or fallback), boolean formatting, row numbering, preview/table output.

stats

Numeric & temporal summary metrics; --frequency for distinct counts; filter integration.

append

Concatenate multiple CSV inputs enforcing header/schema consistency.

index

Build multi-variant B-tree index files (--spec, --covering) for accelerated sort alignment.

install

Wrapper around cargo install csv-managed (version / force / locked / root flags).

schema columns

List schema-declared columns and datatypes (resolves renames).


Advanced Topics

Performance Considerations

  • Indexed sort avoids retaining all rows in memory.
  • Early filtering diminishes downstream CPU & sort footprint.
  • Median requires buffering column values; limit wide median usage on huge datasets.
  • Decimal & currency parsing add overhead—declare only where needed.

Error Handling

anyhow contexts annotate origin (I/O, parse, schema, expression). Fast failure on unknown columns, invalid expressions, header mismatches, precision overflow, unsupported mapping strategies.

Logging

Set RUST_LOG=csv_managed=debug for phase insights (inference voting, index selection, mapping application). Higher verbosity may impact throughput—toggle only when diagnosing.

Testing

Run cargo test. Integration tests cover inference, indexing, process flags, piping, stats. Use assert_cmd for pipeline locking. Add new tests for any behavior that changes output formatting (update snapshots intentionally).


Roadmap

See consolidated backlog & release planning in [.plan/backlog.md](.plan/backlog.md) for upcoming features (join redesign, primary key indexes, batch definition ingestion, additional file formats).


Contributing

  1. Fork & branch (feat/<name>).
  2. Add unit + integration tests.
  3. cargo fmt && cargo clippy && cargo test must pass.
  4. Update README sections or move features from roadmap to implemented list.

License

See LICENSE.

Support

Open issues for bugs, enhancements, or documentation gaps. Pull requests welcome.

Join the community conversation, ask questions, or propose ideas in GitHub Discussions.


Documentation Notes

Deep dive sections removed from README and relocated to docs/. Use the Documentation Map above for full references.


About

csv-managed is a Rust command-line utility for high‑performance exploration and transformation of CSV data at scale, emphasizing streaming, typed operations, and reproducible workflows via schema and index files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published