csv-managed is a high‑performance Rust CLI for exploring, validating, transforming, and indexing very large CSV/TSV (and future delimited) datasets using streaming, typed schemas, and multi‑variant indexes.
| Area | Highlights |
|---|---|
| Delimiters & Encodings | Comma/tab/pipe/semicolon/custom; independent input/output encoding; stdin/stdout streaming |
| Schema Discovery | Sample or full scan inference; diff, overrides, placeholder normalization, snapshots |
| Header Detection | Automatic header/headerless with synthetic field_#; force via --assume-header |
| Datatype Transformations | Ordered datatype_mappings chains (parse, round, trim, case) before final typing |
| Decimal & Currency | Fixed decimal(p,s) (≤28 precision) and currency scale (2 or 4) enforcement |
| Indexing & Sorting | Multi-variant B-Tree index; longest matching prefix acceleration; covering expansion |
| Filtering & Derivation | Typed comparisons + Evalexpr expressions; temporal helpers; positional aliases |
| Verification | Streaming per-cell type enforcement; tiered invalid reporting |
| Statistics & Frequency | Numeric + temporal metrics; distinct counts with --frequency / --top |
| Append & Pipelines | Multi-file union with schema consistency; efficient chained stdin workflows |
| Boolean & Table Output | Configurable boolean formats; elastic preview/table rendering |
| Snapshots | Layout/inference regression guard (--snapshot) |
| Error & Logging | Contextual failures; debug logging for inference/index/mappings |
Extended details moved to dedicated docs (see Documentation Map).
- Feature Matrix
- Global Documentation TOC
- Quick Start
- Installation
- Core Concepts (Brief)
- Datatypes
- Expressions Overview
- Indexes & Sorting Overview
- Streaming & Pipelines Overview
- Command Guide
- Advanced Topics
- Roadmap
- Contributing
- License
- Support
- Schema Inference Internals
- Schema Command Examples
- Datatype Mappings Deep Dive
- Statistics & Frequency Deep Dive
- Header Detection & FAQ
- Naming Conventions
- Snapshots vs Verification
- Expressions Reference & Extended Examples
- Indexing & Sorting Guide
- Pipelines & Multi-Stage Patterns
- Encoding Normalization
- Boolean Formatting & Table Output
- Operational Notes (Perf / Errors / Logging / Testing)
- CLI Help Reference
| Capability | Primary Doc |
|---|---|
| Inference algorithm details | schema-inference |
| Placeholder / NA handling | schema-inference, schema-examples |
| Decimal & Currency rules | schema-inference, datatype-mappings |
| Mapping strategies & strategies matrix | datatype-mappings |
| Overrides vs mappings vs replacements | schema-examples, datatype-mappings |
| Header detection heuristic | header-detection |
| Naming / snake_case rationale | naming-conventions |
| Snapshot vs verify comparison | snapshots-and-verification |
| Invalid reporting tiers | snapshots-and-verification |
| Index variant design & covering | indexing-and-sorting |
| Streaming pipeline safety (header shape) | pipelines |
| Encoding normalization patterns | encoding-normalization |
| Boolean output modes | boolean-formatting |
| Statistics aggregation & frequency counting | stats |
| Expressions functions, quoting, bucketing | expressions |
| Performance & logging guidance | operations |
| CLI option reference | cli-help |
Use this TOC as a hub: internal anchors for quick orientation; deep dives for authoritative detail.
| Topic | Doc |
|---|---|
| Expressions (full reference & examples) | expressions |
| Indexing & Sorting internals | indexing-and-sorting |
| Multi-stage pipelines & header shape rules | pipelines |
| Schema inference internals | schema-inference |
| Schema command usage examples | schema-examples |
| Header detection algorithm & FAQ | header-detection |
| Naming conventions (snake_case rationale) | naming-conventions |
| Snapshots vs verification + reporting tiers | snapshots-and-verification |
| Boolean formatting & table output | boolean-formatting |
| Encoding normalization pipelines | encoding-normalization |
| Datatype mappings & transformation strategies | datatype-mappings |
| Statistics & frequency metrics | stats |
| Operational notes (performance, errors, logging, testing) | operations |
| CLI flag reference (captured help output) | cli-help |
Roadmap/backlog: see the roadmap.
# 1. Infer schema
./target/release/csv-managed.exe schema infer -i ./data/orders.csv -o ./data/orders-schema.yml --sample-rows 0
# 2. Build indexes
./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx --spec default=order_date:asc,customer_id:asc --spec recent=order_date:desc -m ./data/orders-schema.yml
# 3. Typed processing (filters / derives / sort)
./target/release/csv-managed.exe process -i ./data/orders.csv -m ./data/orders-schema.yml -x ./data/orders.idx --index-variant default --sort order_date:asc,customer_id:asc --filter "status = shipped" --derive 'total_with_tax=amount*1.0825' --row-numbers -o ./data/orders_filtered.csv
# 4. Stats (numeric & temporal)
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml
# 5. Frequency counts
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml --frequency --top 10
# 6. Preview (no file output allowed with --preview)
./target/release/csv-managed.exe process -i ./data/orders.csv --preview --limit 15See extended examples in collapsible sections throughout this README.
cargo build --releaseBinary (Windows): target\release\csv-managed.exe
From crates.io:
cargo install csv-managedLocal path dev install:
cargo install --path .Helper command (wraps cargo install):
./target/release/csv-managed.exe install --lockedEnvironment logging examples:
$env:RUST_LOG='info'set RUST_LOG=infoSchemas declare column order, types, optional renames, mapping chains, and replacements. Per-cell flow: raw → mappings → replacements → final parse. See docs/schema-inference.md and docs/schema-examples.md.
Header detection, naming guidance, and FAQ: docs/header-detection.md, docs/naming-conventions.md.
Overrides vs mappings vs replacements decision table: docs/schema-examples.md.
Verification tiers + snapshot comparison: docs/snapshots-and-verification.md.
Snapshot includes: header+type hash (SHA-256), textual inference table, observation summaries. Hash changes on any header reorder or type change. Regenerate intentionally after approved inference adjustments.
| Type | Examples | Notes |
|---|---|---|
| String | any UTF‑8 | Post-mapping names usable in expressions |
| Integer | 42, -7 |
64-bit signed |
| Float | 3.14, 2 |
f64 (integers accepted) |
| Boolean | true/false, yes/no, 1/0 |
Input variants normalized; output format selectable |
| Date | 2024-08-01, 08/01/2024 |
Canonical YYYY-MM-DD |
| DateTime | 2024-08-01T13:45:00 |
Naive (no TZ) |
| Time | 06:00:00, 14:30 |
Canonical HH:MM:SS |
| Currency | $12.34, 123.4567 |
Enforce 2 or 4 scale; symbol stripped |
| Decimal | 123.4567, (1,234.50) |
Fixed precision/scale ≤28 |
| Guid | RFC 4122 hyphenated or 32hex | Case-insensitive |
Derived columns: --derive name=expr • Filters: --filter, --filter-expr • Positional aliases: c0, c1, ... • row_number when --row-numbers enabled.
| Pattern | Example | Description |
|---|---|---|
| Arithmetic | total_with_tax=amount*1.0825 |
Multiply numeric column |
| Conditional flag | high_value=if(amount>1000,1,0) |
1/0 indicator |
| Date diff | ship_lag=date_diff_days(shipped_at,ordered_at) |
Days between dates |
| Time diff | window=time_diff_seconds(end_time,start_time) |
Seconds difference |
| Concat | channel_tag=concat(channel,"-",region) |
Combine strings |
| Guid passthrough | id_copy=id |
Duplicate column |
| Row number | row_index=row_number |
Sequential index |
| Aspect | --filter |
--filter-expr |
|---|---|---|
| Operators | Basic typed comparisons | Full Evalexpr syntax |
| Logic | AND via repetition | AND/OR, nested if |
| Temporal helpers | Direct typed compare | date_diff_days, etc. |
| Complexity | Concise | Arbitrary expression |
Temporal Helpers: date_add, date_sub, date_diff_days, date_format, datetime_add_seconds, datetime_diff_seconds, datetime_format, datetime_to_date, datetime_to_time, time_add_seconds, time_diff_seconds.
Pitfalls:
- PowerShell quoting: wrap whole expression in single quotes, internal literals in double quotes.
c0is first column (0-based). Verify mapping order.row_numberexists only if--row-numbersset.- Use helpers, not raw string comparisons, for temporal correctness.
- Mapping chains precede replacements which precede final parse; expressions see normalized values.
Function Index (alphabetical): concat, date_add, date_diff_days, date_format, date_sub, datetime_add_seconds, datetime_diff_seconds, datetime_format, datetime_to_date, datetime_to_time, if, time_add_seconds, time_diff_seconds.
Debugging: Increase logging with RUST_LOG=csv_managed=debug. Future deep expression tracing may emit expr: prefixed debug lines.
Indexes store byte offsets keyed by concatenated column values. A single .idx contains multiple named variants (different column sequences and directions). process chooses the variant with the longest matching prefix for a requested --sort unless --index-variant pins a specific one.
Building:
./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx \
--spec default=order_date:asc,customer_id:asc \
--spec recent=order_date:desc -m ./data/orders-schema.ymlCovering (--covering): Generate systematic direction/prefix permutations from a concise pattern (e.g. geo=date:asc|desc,customer:asc).
Fallback: When no index variant matches the entire sort signature, an in-memory stable multi-column sort executes (still streaming transforms earlier/later as possible).
Use -i - to read from stdin; schema strongly recommended for typed semantics. Each stage must explicitly declare stdin usage. Avoid header shape changes between typed stages unless you also provide a matching updated schema.
Core guidelines:
- Keep early projections narrow.
- Apply filters before sorting or heavy derives.
- Normalize encodings up front (
--input-encoding/--output-encoding). - Use
--preview --limitfor fast inspection; remove before chaining downstream.
Filter then stats:
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv | \
.\target\release\csv-managed.exe process -i - --schema .\tests\data\big_5_players_stats-schema.yml \
--filter "Performance_Gls >= 10" --limit 40 | \
.\target\release\csv-managed.exe stats -i - --schema .\tests\data\big_5_players_stats-schema.yml -C Performance_GlsAppend with one streamed input:
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv | \
.\target\release\csv-managed.exe append -i - -i .\tmp\big_5_preview.csv \
--schema .\tests\data\big_5_players_stats-schema.yml -o .\tmp\players_union.csvEncoding normalization:
Get-Content .\tmp\big_5_windows1252.csv | \
.\target\release\csv-managed.exe process -i - --input-encoding windows-1252 \
--schema .\tests\data\big_5_players_stats-schema.yml --columns Player --columns Squad --limit 5 --tableTroubleshooting:
| Symptom | Cause | Fix |
|---|---|---|
| Hang | Upstream not producing | Add --preview --limit to inspect |
| Column not found | Rename/mapping changed | Re-check schema columns / header output |
| Zero stats rows | Filters excluded all rows | Relax/remove filters |
| Invalid datatype downstream | Schema mismatch | Supply correct schema per stage |
Concise flag references; see concept sections for deep behavior.
Probe, infer, verify, list columns, diff and snapshot inference output.
| Sub/Flag | Summary |
|---|---|
probe |
Inference preview table (no file) |
infer |
Inference + optional write (-o) + diff/snapshot integration |
verify |
Streaming type & replacement validation |
columns |
Tabular listing of schema columns |
--snapshot |
Layout regression guard |
--diff <schema> |
Unified diff vs existing schema |
--assume-header |
Override header detection |
--mapping |
Emit mapping scaffold & snake_case suggestions |
--replace-template |
Inject empty replace arrays |
Transform & emit rows: filtering, derives, column selection, sorting (indexed or fallback), boolean formatting, row numbering, preview/table output.
Numeric & temporal summary metrics; --frequency for distinct counts; filter integration.
Concatenate multiple CSV inputs enforcing header/schema consistency.
Build multi-variant B-tree index files (--spec, --covering) for accelerated sort alignment.
Wrapper around cargo install csv-managed (version / force / locked / root flags).
List schema-declared columns and datatypes (resolves renames).
- Indexed sort avoids retaining all rows in memory.
- Early filtering diminishes downstream CPU & sort footprint.
- Median requires buffering column values; limit wide median usage on huge datasets.
- Decimal & currency parsing add overhead—declare only where needed.
anyhow contexts annotate origin (I/O, parse, schema, expression). Fast failure on unknown columns, invalid expressions, header mismatches, precision overflow, unsupported mapping strategies.
Set RUST_LOG=csv_managed=debug for phase insights (inference voting, index selection, mapping application). Higher verbosity may impact throughput—toggle only when diagnosing.
Run cargo test. Integration tests cover inference, indexing, process flags, piping, stats. Use assert_cmd for pipeline locking. Add new tests for any behavior that changes output formatting (update snapshots intentionally).
See consolidated backlog & release planning in [.plan/backlog.md](.plan/backlog.md) for upcoming features (join redesign, primary key indexes, batch definition ingestion, additional file formats).
- Fork & branch (
feat/<name>). - Add unit + integration tests.
cargo fmt && cargo clippy && cargo testmust pass.- Update README sections or move features from roadmap to implemented list.
See LICENSE.
Open issues for bugs, enhancements, or documentation gaps. Pull requests welcome.
Join the community conversation, ask questions, or propose ideas in GitHub Discussions.
Deep dive sections removed from README and relocated to docs/. Use the Documentation Map above for full references.