Skip to content

[Python][Types] Type stub improvements for better coverage with Arrow IPC and compute operations #48711

@rustyconover

Description

@rustyconover

While integrating pyarrow-stubs into a project using Arrow IPC streaming, I encountered several type annotation gaps that required workarounds (# type: ignore comments or cast() calls). This issue documents these gaps to help improve stub coverage.

Environment:

  • pyarrow-stubs version: 17.11
  • pyarrow version: 19.0.1
  • mypy version: 1.14.1
  • Python version: 3.12

Issues Found

  1. pa.PythonFile constructor doesn't accept standard file-like objects

Problem: The PythonFile constructor signature is too restrictive. It doesn't accept IO[bytes] or BufferedIOBase objects without explicit casting.

Workaround required:

import io
from typing import cast
import pyarrow as pa

# This requires a cast:
stdin_sink = pa.PythonFile(cast(io.IOBase, proc.stdin))

# Similarly for stdout:
pa.PythonFile(cast(io.IOBase, sys.stdout.buffer), mode="w")

Expected: PythonFile.__init__ should accept IO[bytes], BufferedIOBase, or a typing.BinaryIO union.


  1. pa.BufferReader incompatible with pa.ipc.read_schema()

Problem: When passing a BufferReader to ipc.read_schema(), mypy reports an argument type error.

Workaround required:

output_schema_bytes: bytes = ...
output_schema = pa.ipc.read_schema(pa.BufferReader(output_schema_bytes))  # type: ignore[arg-type]

Expected: ipc.read_schema() should accept BufferReader (or its parent NativeFile) as a valid input type.


  1. pa.schema() field list typing is overly restrictive

Problem: Creating a schema from a list of tuples [("name", pa.string())] or pa.Field objects causes type errors.

Workaround required:

from typing import Any

def make_schema(fields: list[Any]) -> pa.Schema:
    """Helper to avoid mypy errors with field lists."""
    return pa.schema(fields)

# Usage:
schema = make_schema([("x", pa.int64()), ("y", pa.string())])
schema = make_schema([pa.field("x", pa.int64())])

Expected: pa.schema() should accept:

  • list[tuple[str, DataType]]
  • list[Field]
  • Iterable[tuple[str, DataType] | Field]

  1. pyarrow.compute.filter() missing RecordBatch overload

Problem: pc.filter() works with RecordBatch at runtime but the stubs only define overloads for Array and ChunkedArray.

Workaround required:

import pyarrow.compute as pc

batch: pa.RecordBatch = ...
mask: pa.BooleanArray = ...
result = pc.filter(batch, mask)  # type: ignore[call-overload]

Expected: Add overload for RecordBatch:

@overload
def filter(
    values: RecordBatch,
    selection_filter: Array | ChunkedArray,
    /,
    null_selection_behavior: Literal["drop", "emit_null"] = ...,
) -> RecordBatch: ...

  1. pa.Scalar generic requires TYPE_CHECKING import pattern

Problem: Using pa.Scalar[T] as a type annotation at runtime raises errors because Scalar isn't subscriptable at runtime in older patterns.

Current pattern required:

from typing import TYPE_CHECKING, Any

if TYPE_CHECKING:
    from pyarrow import Scalar

# Then use as:
positional: tuple[Scalar[Any] | None, ...] = ()
named: dict[str, Scalar[Any]] = {}

This is a minor issue but worth noting for documentation.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions