csv-managed

csv-managed is a high‑performance Rust CLI for exploring, validating, transforming, and indexing very large CSV/TSV (and future delimited) datasets using streaming, typed schemas, and multi‑variant indexes.

Feature Matrix (Concise)

Area	Highlights
Delimiters & Encodings	Comma/tab/pipe/semicolon/custom; independent input/output encoding; stdin/stdout streaming
Schema Discovery	Sample or full scan inference; diff, overrides, placeholder normalization, snapshots
Header Detection	Automatic header/headerless with synthetic `field_#`; force via `--assume-header`
Datatype Transformations	Ordered `datatype_mappings` chains (parse, round, trim, case) before final typing
Decimal & Currency	Fixed `decimal(p,s)` (≤28 precision) and currency scale (2 or 4) enforcement
Indexing & Sorting	Multi-variant B-Tree index; longest matching prefix acceleration; covering expansion
Filtering & Derivation	Typed comparisons + Evalexpr expressions; temporal helpers; positional aliases
Verification	Streaming per-cell type enforcement; tiered invalid reporting
Statistics & Frequency	Numeric + temporal metrics; distinct counts with `--frequency` / `--top`
Append & Pipelines	Multi-file union with schema consistency; efficient chained stdin workflows
Boolean & Table Output	Configurable boolean formats; elastic preview/table rendering
Snapshots	Layout/inference regression guard (`--snapshot`)
Error & Logging	Contextual failures; debug logging for inference/index/mappings

Extended details moved to dedicated docs (see Documentation Map).

Global Documentation Table of Contents

Deep Dives (Docs Directory)

Quick Cross-Reference

Capability	Primary Doc
Inference algorithm details	schema-inference
Placeholder / NA handling	schema-inference, schema-examples
Decimal & Currency rules	schema-inference, datatype-mappings
Mapping strategies & strategies matrix	datatype-mappings
Overrides vs mappings vs replacements	schema-examples, datatype-mappings
Header detection heuristic	header-detection
Naming / snake_case rationale	naming-conventions
Snapshot vs verify comparison	snapshots-and-verification
Invalid reporting tiers	snapshots-and-verification
Index variant design & covering	indexing-and-sorting
Streaming pipeline safety (header shape)	pipelines
Encoding normalization patterns	encoding-normalization
Boolean output modes	boolean-formatting
Statistics aggregation & frequency counting	stats
Expressions functions, quoting, bucketing	expressions
Performance & logging guidance	operations
CLI option reference	cli-help

Use this TOC as a hub: internal anchors for quick orientation; deep dives for authoritative detail.

Documentation Map

Topic	Doc
Expressions (full reference & examples)	expressions
Indexing & Sorting internals	indexing-and-sorting
Multi-stage pipelines & header shape rules	pipelines
Schema inference internals	schema-inference
Schema command usage examples	schema-examples
Header detection algorithm & FAQ	header-detection
Naming conventions (snake_case rationale)	naming-conventions
Snapshots vs verification + reporting tiers	snapshots-and-verification
Boolean formatting & table output	boolean-formatting
Encoding normalization pipelines	encoding-normalization
Datatype mappings & transformation strategies	datatype-mappings
Statistics & frequency metrics	stats
Operational notes (performance, errors, logging, testing)	operations
CLI flag reference (captured help output)	cli-help

Roadmap/backlog: see the roadmap.

Quick Start

# 1. Infer schema
./target/release/csv-managed.exe schema infer -i ./data/orders.csv -o ./data/orders-schema.yml --sample-rows 0
# 2. Build indexes
./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx --spec default=order_date:asc,customer_id:asc --spec recent=order_date:desc -m ./data/orders-schema.yml
# 3. Typed processing (filters / derives / sort)
./target/release/csv-managed.exe process -i ./data/orders.csv -m ./data/orders-schema.yml -x ./data/orders.idx --index-variant default --sort order_date:asc,customer_id:asc --filter "status = shipped" --derive 'total_with_tax=amount*1.0825' --row-numbers -o ./data/orders_filtered.csv
# 4. Stats (numeric & temporal)
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml
# 5. Frequency counts
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml --frequency --top 10
# 6. Preview (no file output allowed with --preview)
./target/release/csv-managed.exe process -i ./data/orders.csv --preview --limit 15

See extended examples in collapsible sections throughout this README.

Installation

cargo build --release

Binary (Windows): target\release\csv-managed.exe

From crates.io:

cargo install csv-managed

Local path dev install:

cargo install --path .

Helper command (wraps cargo install):

./target/release/csv-managed.exe install --locked

Environment logging examples:

$env:RUST_LOG='info'

set RUST_LOG=info

Core Concepts (Brief)

Schemas declare column order, types, optional renames, mapping chains, and replacements. Per-cell flow: raw → mappings → replacements → final parse. See docs/schema-inference.md and docs/schema-examples.md.

Header detection, naming guidance, and FAQ: docs/header-detection.md, docs/naming-conventions.md.

Overrides vs mappings vs replacements decision table: docs/schema-examples.md.

Verification tiers + snapshot comparison: docs/snapshots-and-verification.md.

Snapshot Internals (Deep Dive)

Snapshot includes: header+type hash (SHA-256), textual inference table, observation summaries. Hash changes on any header reorder or type change. Regenerate intentionally after approved inference adjustments.

Datatypes (Supported)

Type	Examples	Notes
String	any UTF‑8	Post-mapping names usable in expressions
Integer	`42`, `-7`	64-bit signed
Float	`3.14`, `2`	f64 (integers accepted)
Boolean	`true/false`, `yes/no`, `1/0`	Input variants normalized; output format selectable
Date	`2024-08-01`, `08/01/2024`	Canonical `YYYY-MM-DD`
DateTime	`2024-08-01T13:45:00`	Naive (no TZ)
Time	`06:00:00`, `14:30`	Canonical `HH:MM:SS`
Currency	`$12.34`, `123.4567`	Enforce 2 or 4 scale; symbol stripped
Decimal	`123.4567`, `(1,234.50)`	Fixed precision/scale ≤28
Guid	RFC 4122 hyphenated or 32hex	Case-insensitive

Expressions & Derived Logic (Overview)

Derived columns: --derive name=expr • Filters: --filter, --filter-expr • Positional aliases: c0, c1, ... • row_number when --row-numbers enabled.

Quick Cheat Sheets

Pattern	Example	Description
Arithmetic	`total_with_tax=amount*1.0825`	Multiply numeric column
Conditional flag	`high_value=if(amount>1000,1,0)`	1/0 indicator
Date diff	`ship_lag=date_diff_days(shipped_at,ordered_at)`	Days between dates
Time diff	`window=time_diff_seconds(end_time,start_time)`	Seconds difference
Concat	`channel_tag=concat(channel,"-",region)`	Combine strings
Guid passthrough	`id_copy=id`	Duplicate column
Row number	`row_index=row_number`	Sequential index

Aspect	`--filter`	`--filter-expr`
Operators	Basic typed comparisons	Full Evalexpr syntax
Logic	AND via repetition	AND/OR, nested `if`
Temporal helpers	Direct typed compare	`date_diff_days`, etc.
Complexity	Concise	Arbitrary expression

Full Expression Reference

Temporal Helpers: date_add, date_sub, date_diff_days, date_format, datetime_add_seconds, datetime_diff_seconds, datetime_format, datetime_to_date, datetime_to_time, time_add_seconds, time_diff_seconds.

Pitfalls:

PowerShell quoting: wrap whole expression in single quotes, internal literals in double quotes.
c0 is first column (0-based). Verify mapping order.
row_number exists only if --row-numbers set.
Use helpers, not raw string comparisons, for temporal correctness.
Mapping chains precede replacements which precede final parse; expressions see normalized values.

Function Index (alphabetical): concat, date_add, date_diff_days, date_format, date_sub, datetime_add_seconds, datetime_diff_seconds, datetime_format, datetime_to_date, datetime_to_time, if, time_add_seconds, time_diff_seconds.

Debugging: Increase logging with RUST_LOG=csv_managed=debug. Future deep expression tracing may emit expr: prefixed debug lines.

Indexes & Sorting (Overview)

Indexes store byte offsets keyed by concatenated column values. A single .idx contains multiple named variants (different column sequences and directions). process chooses the variant with the longest matching prefix for a requested --sort unless --index-variant pins a specific one.

Building:

./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx \
  --spec default=order_date:asc,customer_id:asc \
  --spec recent=order_date:desc -m ./data/orders-schema.yml

Covering (--covering): Generate systematic direction/prefix permutations from a concise pattern (e.g. geo=date:asc|desc,customer:asc).

Fallback: When no index variant matches the entire sort signature, an in-memory stable multi-column sort executes (still streaming transforms earlier/later as possible).

Streaming & Pipelines (Overview)

Use -i - to read from stdin; schema strongly recommended for typed semantics. Each stage must explicitly declare stdin usage. Avoid header shape changes between typed stages unless you also provide a matching updated schema.

Core guidelines:

Keep early projections narrow.
Apply filters before sorting or heavy derives.
Normalize encodings up front (--input-encoding / --output-encoding).
Use --preview --limit for fast inspection; remove before chaining downstream.

Extended Pipeline Examples & Troubleshooting

Filter then stats:

Get-Content .\tests\data\big_5_players_stats_2023_2024.csv | \
  .\target\release\csv-managed.exe process -i - --schema .\tests\data\big_5_players_stats-schema.yml \
  --filter "Performance_Gls >= 10" --limit 40 | \
  .\target\release\csv-managed.exe stats -i - --schema .\tests\data\big_5_players_stats-schema.yml -C Performance_Gls

Append with one streamed input:

Get-Content .\tests\data\big_5_players_stats_2023_2024.csv | \
  .\target\release\csv-managed.exe append -i - -i .\tmp\big_5_preview.csv \
  --schema .\tests\data\big_5_players_stats-schema.yml -o .\tmp\players_union.csv

Encoding normalization:

Get-Content .\tmp\big_5_windows1252.csv | \
  .\target\release\csv-managed.exe process -i - --input-encoding windows-1252 \
  --schema .\tests\data\big_5_players_stats-schema.yml --columns Player --columns Squad --limit 5 --table

Troubleshooting:

Symptom	Cause	Fix
Hang	Upstream not producing	Add `--preview --limit` to inspect
Column not found	Rename/mapping changed	Re-check `schema columns` / header output
Zero stats rows	Filters excluded all rows	Relax/remove filters
Invalid datatype downstream	Schema mismatch	Supply correct schema per stage

Command Guide (Summary)

Concise flag references; see concept sections for deep behavior.

schema

Probe, infer, verify, list columns, diff and snapshot inference output.

Sub/Flag	Summary
`probe`	Inference preview table (no file)
`infer`	Inference + optional write (`-o`) + diff/snapshot integration
`verify`	Streaming type & replacement validation
`columns`	Tabular listing of schema columns
`--snapshot`	Layout regression guard
`--diff <schema>`	Unified diff vs existing schema
`--assume-header`	Override header detection
`--mapping`	Emit mapping scaffold & snake_case suggestions
`--replace-template`	Inject empty `replace` arrays

process

Transform & emit rows: filtering, derives, column selection, sorting (indexed or fallback), boolean formatting, row numbering, preview/table output.

stats

Numeric & temporal summary metrics; --frequency for distinct counts; filter integration.

append

Concatenate multiple CSV inputs enforcing header/schema consistency.

index

Build multi-variant B-tree index files (--spec, --covering) for accelerated sort alignment.

install

Wrapper around cargo install csv-managed (version / force / locked / root flags).

schema columns

List schema-declared columns and datatypes (resolves renames).

Advanced Topics

Performance Considerations

Indexed sort avoids retaining all rows in memory.
Early filtering diminishes downstream CPU & sort footprint.
Median requires buffering column values; limit wide median usage on huge datasets.
Decimal & currency parsing add overhead—declare only where needed.

Error Handling

anyhow contexts annotate origin (I/O, parse, schema, expression). Fast failure on unknown columns, invalid expressions, header mismatches, precision overflow, unsupported mapping strategies.

Logging

Set RUST_LOG=csv_managed=debug for phase insights (inference voting, index selection, mapping application). Higher verbosity may impact throughput—toggle only when diagnosing.

Testing

Run cargo test. Integration tests cover inference, indexing, process flags, piping, stats. Use assert_cmd for pipeline locking. Add new tests for any behavior that changes output formatting (update snapshots intentionally).

Roadmap

See consolidated backlog & release planning in [.plan/backlog.md](.plan/backlog.md) for upcoming features (join redesign, primary key indexes, batch definition ingestion, additional file formats).

Contributing

Fork & branch (feat/<name>).
Add unit + integration tests.
cargo fmt && cargo clippy && cargo test must pass.
Update README sections or move features from roadmap to implemented list.

License

See LICENSE.

Support

Open issues for bugs, enhancements, or documentation gaps. Pull requests welcome.

Join the community conversation, ask questions, or propose ideas in GitHub Discussions.

Documentation Notes

Deep dive sections removed from README and relocated to docs/. Use the Documentation Map above for full references.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
.plan		.plan
.specify		.specify
.todos		.todos
.vscode		.vscode
benches		benches
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
gemini.md		gemini.md

License

softwaresalt/csv-managed

Folders and files

Latest commit

History

Repository files navigation

csv-managed

Feature Matrix (Concise)

Global Documentation Table of Contents

Core (This README)

Deep Dives (Docs Directory)

Quick Cross-Reference

Documentation Map

Quick Start

Installation

Core Concepts (Brief)

Snapshot Internals (Deep Dive)

Datatypes (Supported)

Expressions & Derived Logic (Overview)

Quick Cheat Sheets

Full Expression Reference

Indexes & Sorting (Overview)

Streaming & Pipelines (Overview)

Extended Pipeline Examples & Troubleshooting

Command Guide (Summary)

schema

process

stats

append

index

install

schema columns

Advanced Topics

Performance Considerations

Error Handling

Logging

Testing

Roadmap

Contributing

License

Support

Documentation Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages