"Gentle Introduction to Arrow / Record Batches" #11336 #18051

sm4rtm4art · 2025-10-14T10:49:32Z

Which issue does this PR close?

Closes Add a "Gentle Introduction to Arrow / Record Batches" #11336

Since this is my first contribution, I suppose to mention @alamb , author of the Issue #11336

Could you please trigger the CI? Thanks!

Rationale for this change

The Arrow introduction guide (#11336) needed improvements to make it more accessible for newcomers while providing better navigation to advanced topics.

What changes are included in this PR?

Issue #11336 requested a gentle introduction to Apache Arrow and RecordBatches to help DataFusion users understand the foundational concepts. This PR enhances the existing Arrow introduction guide with clearer explanations, practical examples, visual aids, and comprehensive navigation links to make it more accessible for newcomers while providing pathways to advanced topics.

Was unsure if this fits to `docs/source/user-guide/dataframe.md'

Are these changes tested?

applyed prettier, like described.

Are there any user-facing changes?

Yes - improved documentation for the Arrow introduction guide at docs/source/user-guide/arrow-introduction.md

This adds a new user guide page addressing issue apache#11336 to provide a gentle introduction to Apache Arrow and RecordBatches for DataFusion users. The guide includes: - Explanation of Arrow as a columnar specification - Visual comparison of row vs columnar storage (with ASCII diagrams) - Rationale for RecordBatch-based streaming (memory + vectorization) - Practical examples: reading files, building batches, querying with MemTable - Clear guidance on when Arrow knowledge is needed (extension points) - Links back to DataFrame API and library user guide - Link to DataFusion Invariants for contributors who want to go deeper This helps users understand the foundation without getting overwhelmed, addressing feedback from PR apache#11290 that DataFrame examples 'throw people into the deep end of Arrow.'

…navigation - Add explanation of Arc and ArrayRef for Rust newcomers - Add visual diagram showing RecordBatch streaming through pipeline - Make common pitfalls more concrete with specific examples - Emphasize Arrow's unified type system as DataFusion's foundation - Add comprehensive API documentation links throughout document - Link to extension points guides (TableProvider, UDFs, custom operators) These improvements make the Arrow introduction more accessible for newcomers while providing clear navigation paths to advanced topics for users extending DataFusion. Addresses apache#11336

Run prettier to fix markdown link reference formatting (lowercase convention)

Apply prettier formatting to fix pre-existing formatting issues in: - query-optimizer.md - concepts-readings-events.md - scalar_functions.md

Jefffrey

Thanks for picking this up; I left a few comments, though my overall thoughts are that the guide as is feels a little disjointed in some of the information that is presented and is confusing to me as I don't know what preexisting knowledge it assumes of readers. Maybe the article would benefit from having a tighter focus and leaving more verbose details to external links (such as the Arrow docs).

Then again I'm not coming from a fresh user perspective so I'm a biased in that regard 😅

Jefffrey · 2025-10-14T14:13:18Z

docs/source/user-guide/dataframe.md

+---
+
+# REFERENCES
+


These links below aren't actually visible so don't think this header is necessary

Sorry, forgot this.
I wanted to implement the gentle introduction to Arraw in the Dataframe part based on the idea coming from the smalest part to a bigger one, but I guess this would be overwhelming and way to long for a user guid. In addition I've learned, that those references are not viible...

Jefffrey · 2025-10-14T14:28:25Z