Skip to content

Conversation

sm4rtm4art
Copy link

Which issue does this PR close?

Since this is my first contribution, I suppose to mention @alamb , author of the Issue #11336

Could you please trigger the CI? Thanks!

Rationale for this change

The Arrow introduction guide (#11336) needed improvements to make it more accessible for newcomers while providing better navigation to advanced topics.

What changes are included in this PR?

Issue #11336 requested a gentle introduction to Apache Arrow and RecordBatches to help DataFusion users understand the foundational concepts. This PR enhances the existing Arrow introduction guide with clearer explanations, practical examples, visual aids, and comprehensive navigation links to make it more accessible for newcomers while providing pathways to advanced topics.

Was unsure if this fits to `docs/source/user-guide/dataframe.md'

Are these changes tested?

applyed prettier, like described.

Are there any user-facing changes?

Yes - improved documentation for the Arrow introduction guide at docs/source/user-guide/arrow-introduction.md

Martin added 2 commits October 14, 2025 00:13
This adds a new user guide page addressing issue apache#11336 to provide
a gentle introduction to Apache Arrow and RecordBatches for DataFusion users.

The guide includes:
- Explanation of Arrow as a columnar specification
- Visual comparison of row vs columnar storage (with ASCII diagrams)
- Rationale for RecordBatch-based streaming (memory + vectorization)
- Practical examples: reading files, building batches, querying with MemTable
- Clear guidance on when Arrow knowledge is needed (extension points)
- Links back to DataFrame API and library user guide
- Link to DataFusion Invariants for contributors who want to go deeper

This helps users understand the foundation without getting overwhelmed,
addressing feedback from PR apache#11290 that DataFrame examples 'throw people
into the deep end of Arrow.'
…navigation

- Add explanation of Arc and ArrayRef for Rust newcomers
- Add visual diagram showing RecordBatch streaming through pipeline
- Make common pitfalls more concrete with specific examples
- Emphasize Arrow's unified type system as DataFusion's foundation
- Add comprehensive API documentation links throughout document
- Link to extension points guides (TableProvider, UDFs, custom operators)

These improvements make the Arrow introduction more accessible for
newcomers while providing clear navigation paths to advanced topics
for users extending DataFusion.

Addresses apache#11336
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 14, 2025
Martin added 3 commits October 14, 2025 14:25
Run prettier to fix markdown link reference formatting (lowercase convention)
Apply prettier formatting to fix pre-existing formatting issues in:
- query-optimizer.md
- concepts-readings-events.md
- scalar_functions.md
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up; I left a few comments, though my overall thoughts are that the guide as is feels a little disjointed in some of the information that is presented and is confusing to me as I don't know what preexisting knowledge it assumes of readers. Maybe the article would benefit from having a tighter focus and leaving more verbose details to external links (such as the Arrow docs).

Then again I'm not coming from a fresh user perspective so I'm a biased in that regard 😅

Comment on lines +114 to +117
---

# REFERENCES

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These links below aren't actually visible so don't think this header is necessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot this.
I wanted to implement the gentle introduction to Arraw in the Dataframe part based on the idea coming from the smalest part to a bigger one, but I guess this would be overwhelming and way to long for a user guid. In addition I've learned, that those references are not viible...

Comment on lines +69 to +72
- **Vectorized Execution**: Process entire columns at once using SIMD instructions
- **Better Compression**: Similar values stored together compress more efficiently
- **Cache Efficiency**: Scanning specific columns doesn't load unnecessary data
- **Zero-Copy Data Sharing**: Systems can share Arrow data without conversion overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be hesitant to mention compression here as being an in-memory format it isn't typically compressed (as compared to something like Parquet)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - I was thinking way more down the line, I was conflating storage format benefits with in-memory benefits. Arrow's columnar layout enables better compression when written to disk (like Parquet), but that's not relevant for the in-memory processing context. I'll remove this point or rephrase to focus on the actual in-memory benefits like cache efficiency and SIMD operations.

Comment on lines +96 to +100
**Key Properties**:

- Arrays are immutable (create new batches to modify data)
- NULL values tracked via efficient validity bitmaps
- Variable-length data (strings, lists) use offset arrays for efficient access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the last two properties are a bit mismatched here; they are instead properties of arrays and not recordbatches, but more importantly in a guide that is meant to be a gentle introduction, they seem to be placed here randomly. If someone were to read Variable-length data (strings, lists) use offset arrays for efficient access there isn't much to gleam from that information (that is relevant to the overall theme of the guide) 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I mixed RecordBatch properties with Array implementation details. These technical details don't help someone understand "why RecordBatches" at a conceptual level. I'll either:

  1. Remove these details entirely, OR
  2. Reframe as "What this means for users" (e.g., "Data is immutable, so operations create new batches rather than modifying existing ones")

The offset arrays detail especially doesn't belong in a gentle introduction.

Would you prefer I remove this section or refocus it on user-facing implications?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel it is worth mentioning the immutable aspect of RecordBatches/Arrays, as that is an important detail if you want to get hands on.

Comment on lines +86 to +90
### Why Not Process Single Rows?

- **Lost Vectorization**: Can't use SIMD instructions on single values
- **Poor Cache Utilization**: Jumping between rows defeats CPU cache optimization
- **High Overhead**: Managing individual rows has significant bookkeeping costs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section feels a bit misplaced, as some of these downsides were mentioned right above under Why this matters so it feels a little inconsistent to have the points stated again right below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right -
I essentially repeated the same points. My intention was to show the progression from "too big" (entire table) → "too small" (single rows) → "just right" (batches), but I see it reads as repetitive.

I'll consolidate into a single "Why batches are the sweet spot" section that covers both extremes concisely without redundancy.

Do you have suggestions, I might not see ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even simplify it to a single line like "it's more efficient to process data in batches etc." and have more focus on how/when you interact with the record batches directly, rather than having details on why DF uses recordbatches (if the point of the guide is to ease users into getting familiar with interacting with arrow api)


## What is a RecordBatch? (And Why Batch?)

A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema.
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays that form a common schema.

I'm not sure about this wording either, but it feels slightly wrong to call the schema as being shared by arrays 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the wording.

How about:

A RecordBatch represents a horizontal slice of a table—a collection of equal-length columnar arrays that conform to a defined schema.

This makes it clearer that the schema defines the structure, and the arrays conform to it, rather than "sharing" it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the wording of that 👍


Sometimes you need to create Arrow data programmatically rather than reading from files. This example shows the core building blocks: creating typed arrays (like [`Int32Array`] for numbers), defining a [`Schema`] that describes your columns, and assembling them into a [`RecordBatch`].

You'll notice [`Arc`] ([Atomically Reference Counted](https://doc.rust-lang.org/std/sync/struct.Arc.html)) is used frequently—this is how Arrow enables efficient, zero-copy data sharing. Instead of copying data, different parts of the query engine can safely share read-only references to the same underlying memory. [`ArrayRef`] is simply a type alias for `Arc<dyn Array>`, representing a reference to any Arrow array type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording implies Arc is the key to Arrow, though it can be misleading considering that's more of an implementation detail on the Rust side 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - Arc is a Rust implementation detail, not core to understanding Arrow conceptually. I included it because users will see Arc/ArrayRef in code examples, but I'm giving it too much emphasis. I'll either:

  1. Move the Arc explanation to a small note: "Note: You'll see Arc in Rust code - it's how Rust safely shares data between threads"
  2. Remove it entirely and let users learn about Arc when they actually need to write code

Which approach would you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean toward the latter but I don't know how user friendly that might turn out 🤔

Maybe just add a small footnote about DataFusion being built around async + having pointers to the arrays themselves = use of Arc frequently to wrap these data structures

2. [Library User Guide: DataFrame API](../library-user-guide/using-the-dataframe-api.md) - Detailed examples and patterns
3. [Custom Table Providers](../library-user-guide/custom-table-providers.md) - When you need Arrow knowledge

## Further reading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can trim some of these references; for example including IPC is probably unnecessary for the goal of this guide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed Thank you for helping me tighter focus and leaving more verbose details to external links -

IPC is too deep for this guide's scope. I'll trim the references to focus on:

  • Main Arrow documentation (for those wanting to go deeper)
  • DataFusion-specific references (MemTable, TableProvider, DataFrame)
  • The academic paper (for those interested in the theory)

I'll remove IPC, memory layout internals, and other implementation-focused references.

Comment on lines +239 to +249
## Next Steps: Working with DataFrames

Now that you understand Arrow's RecordBatch format, you're ready to work with DataFusion's high-level APIs. The [DataFrame API](dataframe.md) provides a familiar, ergonomic interface for building queries without needing to think about Arrow internals most of the time.

The DataFrame API handles all the Arrow details under the hood - reading files into RecordBatches, applying transformations, and producing results. You only need to drop down to the Arrow level when implementing custom data sources, UDFs, or other extension points.

**Recommended reading order:**

1. [DataFrame API](dataframe.md) - High-level query building interface
2. [Library User Guide: DataFrame API](../library-user-guide/using-the-dataframe-api.md) - Detailed examples and patterns
3. [Custom Table Providers](../library-user-guide/custom-table-providers.md) - When you need Arrow knowledge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to have a Next steps about working with the DataFrame API, given this guide itself is meant to be an introduction to Arrow for DataFusion users who may not need to use Arrow directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point - if users don't need Arrow directly, why guide them to DataFrames?

My thinking was: users reading this guide are trying to understand the foundation before using DataFusion. But you're right that it creates a circular path. Would it be better to:

  • Remove "Next Steps" entirely, OR
  • Reframe as "When you'll encounter Arrow" focusing on the extension points where Arrow knowledge becomes necessary?

The second option would reinforce that most users can stay at the DataFrame level. (See first comment of the dataframe.md, where I first wanted to implement the introduction to Arrow)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If going with the second option, it might be a little odd to put it at the end of an article that is introducing users to recordbatch/arrow api as usually you'd provide a justification/reason upfront for why you might need this (to highlight why users would need to read the guide in the first place) 🤔

Maybe can just have the recommended reading links, but put some descriptions for each link so users would know why they might be interested in checking out the links (e.g. "understand arrow internals", "creating your own udf efficiently")

- **interval**: Bin interval.
- **expression**: Time expression to operate on. Can be a constant, column, or function.
- **origin-timestamp**: Optional. Starting point used to determine bin boundaries. If not specified defaults 1970-01-01T00:00:00Z (the UNIX epoch in UTC). The following intervals are supported:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is autogenerated, if you wanna change the doc please change the userdoc for pub struct DateBinFunc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, In my first attemp, I used prettier on all files and it "fixed" this one... Thought of doing good by fixing this.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sm4rtm4art I would delegate RecordBatch description,. pros/cons and examples to Arrow itself, otherwise it would be quite complicated to keep documentation in sync. WDYT?

@sm4rtm4art
Copy link
Author

sm4rtm4art commented Oct 16, 2025

Thank you both for your valuable feedback! @Jefffrey @comphead

My intention was to create a gentle introduction that gives users just enough context about Arrow to understand DataFusion betterbetter, without diving into implementation details. I hear your point about it feeling disjointed. To address this while keeping the gentle introduction approach, I propose:

  1. Tighten the narrative flow: Focus on a single journey - "Why DataFusion uses Arrow" → "What is a RecordBatch conceptually" → "When you'll encounter it"
  2. Move technical details to footnotes or links: Keep implementation details (like Arc, offset arrays) as brief notes or external links
  3. Clarify assumed knowledge upfront: Add a brief "Who this guide is for" section stating we assume basic DataFusion knowledge but no Arrow background

The goal is to give users mental models, not implementation knowledge. Would this approach address your concerns about focus?

@comphead: I understand your maintenance concerns. My approach would be to provide:

  1. Conceptual overview - Brief explanation of what RecordBatch is and why it matters to DataFusion users
  2. Practical code example - Most simples example showing how it looks in practice (like the current "build a RecordBatch" example)
  3. Direct links to Arrow docs - For readers who want deeper technical details

This way, we give users enough context to understand DataFusion without duplicating Arrow's technical documentation. The guide would serve as a bridge - explaining the "why" and showing the "what it looks like", while Arrow's docs handle the detailed "how it works internally".

Would this three-part approach (concept → example → link to details) work for you? It keeps our maintenance burden low while still providing value to users who encounter RecordBatch in DataFusion code.

Hope I don't get overboard with text.

Edit:
One thing I didn’t call out that I think would be valuable is a short note on data types. Readers hit RecordBatch alongside Arrow’s type system (e.g., Int64, Utf8, Timestamp[TZ], List, Struct) and null semantics. Should this live here as a small “Data types at a glance” box, or should we open a separate issue/page and keep this PR focused on RecordBatch?

If included here, I’d keep it tight:

  • Logical vs physical types (one-sentence overview)
  • Common mappings to SQL types
  • Nulls and casting/promotion gotchas
  • Link to Arrow docs for full details (timestamps/time zones, nested types)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a "Gentle Introduction to Arrow / Record Batches"

3 participants