Skip to content

Commit b5e4b0e

Browse files
authored
fix: be more cautious when guessing what a backend can open (#10804)
* fix: be more more caution when claiming a backend can open a URL * add whats new entry * fixes from review * more caution in scipy netcdf backend * correct suffix detection for scipy backend * stricter URL detection for netcdf/dap * no query params for h5netcdf * scipy no urls * don't try to read magic numbers for remote uris * review comments * fix windows failures * docs on backend resolution * more complete table * no horizontal scroll on table * fix whats new header * correct description * case insensitivity to DAP: vs dap: * thredds * move import * claude import rules * has_pydap instead of requires pydap
1 parent f1b11fc commit b5e4b0e

File tree

10 files changed

+390
-25
lines changed

10 files changed

+390
-25
lines changed

CLAUDE.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,18 @@ pre-commit run --all-files # Includes ruff and other checks
2020
uv run dmypy run # Type checking with mypy
2121
```
2222

23+
## Code Style Guidelines
24+
25+
### Import Organization
26+
27+
- **Always place imports at the top of the file** in the standard import section
28+
- Never add imports inside functions or nested scopes unless there's a specific
29+
reason (e.g., circular import avoidance, optional dependencies in TYPE_CHECKING)
30+
- Group imports following PEP 8 conventions:
31+
1. Standard library imports
32+
2. Related third-party imports
33+
3. Local application/library specific imports
34+
2335
## GitHub Interaction Guidelines
2436

2537
- **NEVER impersonate the user on GitHub**, always sign off with something like

doc/user-guide/io.rst

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,182 @@ You can learn more about using and developing backends in the
112112
linkStyle default font-size:18pt,stroke-width:4
113113

114114

115+
.. _io.backend_resolution:
116+
117+
Backend Selection
118+
-----------------
119+
120+
When opening a file or URL without explicitly specifying the ``engine`` parameter,
121+
xarray automatically selects an appropriate backend based on the file path or URL.
122+
The backends are tried in order: **netcdf4 → h5netcdf → scipy → pydap → zarr**.
123+
124+
.. note::
125+
You can customize the order in which netCDF backends are tried using the
126+
``netcdf_engine_order`` option in :py:func:`~xarray.set_options`:
127+
128+
.. code-block:: python
129+
130+
# Prefer h5netcdf over netcdf4
131+
xr.set_options(netcdf_engine_order=['h5netcdf', 'netcdf4', 'scipy'])
132+
133+
See :ref:`options` for more details on configuration options.
134+
135+
The following tables show which backend will be selected for different types of URLs and files.
136+
137+
.. important::
138+
✅ means the backend will **guess it can open** the URL or file based on its path, extension,
139+
or magic number, but this doesn't guarantee success. For example, not all Zarr stores are
140+
xarray-compatible.
141+
142+
❌ means the backend will not attempt to open it.
143+
144+
Remote URL Resolution
145+
~~~~~~~~~~~~~~~~~~~~~
146+
147+
.. list-table::
148+
:header-rows: 1
149+
:widths: 50 10 10 10 10 10
150+
151+
* - URL
152+
- :ref:`netcdf4 <io.netcdf>`
153+
- :ref:`h5netcdf <io.hdf5>`
154+
- :ref:`scipy <io.netcdf>`
155+
- :ref:`pydap <io.opendap>`
156+
- :ref:`zarr <io.zarr>`
157+
* - ``https://example.com/store.zarr``
158+
- ❌
159+
- ❌
160+
- ❌
161+
- ❌
162+
- ✅
163+
* - ``https://example.com/data.nc``
164+
- ✅
165+
- ✅
166+
- ❌
167+
- ❌
168+
- ❌
169+
* - ``http://example.com/data.nc?var=temp``
170+
- ✅
171+
- ❌
172+
- ❌
173+
- ❌
174+
- ❌
175+
* - ``http://example.com/dap4/data.nc?var=x``
176+
- ✅
177+
- ❌
178+
- ❌
179+
- ✅
180+
- ❌
181+
* - ``dap2://opendap.nasa.gov/dataset``
182+
- ❌
183+
- ❌
184+
- ❌
185+
- ✅
186+
- ❌
187+
* - ``https://example.com/DAP4/data``
188+
- ❌
189+
- ❌
190+
- ❌
191+
- ✅
192+
- ❌
193+
* - ``http://test.opendap.org/dap4/file.nc4``
194+
- ✅
195+
- ✅
196+
- ❌
197+
- ✅
198+
- ❌
199+
* - ``https://example.com/DAP4/data.nc``
200+
- ✅
201+
- ✅
202+
- ❌
203+
- ✅
204+
- ❌
205+
206+
Local File Resolution
207+
~~~~~~~~~~~~~~~~~~~~~
208+
209+
For local files, backends first try to read the file's **magic number** (first few bytes).
210+
If the magic number **cannot be read** (e.g., file doesn't exist, no permissions), they fall
211+
back to checking the file **extension**. If the magic number is readable but invalid, the
212+
backend returns False (does not fall back to extension).
213+
214+
.. list-table::
215+
:header-rows: 1
216+
:widths: 40 20 10 10 10 10
217+
218+
* - File Path
219+
- Magic Number
220+
- :ref:`netcdf4 <io.netcdf>`
221+
- :ref:`h5netcdf <io.hdf5>`
222+
- :ref:`scipy <io.netcdf>`
223+
- :ref:`zarr <io.zarr>`
224+
* - ``/path/to/file.nc``
225+
- ``CDF\x01`` (netCDF3)
226+
- ✅
227+
- ❌
228+
- ✅
229+
- ❌
230+
* - ``/path/to/file.nc4``
231+
- ``\x89HDF\r\n\x1a\n`` (HDF5/netCDF4)
232+
- ✅
233+
- ✅
234+
- ❌
235+
- ❌
236+
* - ``/path/to/file.nc.gz``
237+
- ``\x1f\x8b`` + ``CDF`` inside
238+
- ❌
239+
- ❌
240+
- ✅
241+
- ❌
242+
* - ``/path/to/store.zarr/``
243+
- (directory)
244+
- ❌
245+
- ❌
246+
- ❌
247+
- ✅
248+
* - ``/path/to/file.nc``
249+
- *(no magic number)*
250+
- ✅
251+
- ✅
252+
- ✅
253+
- ❌
254+
* - ``/path/to/file.xyz``
255+
- ``CDF\x01`` (netCDF3)
256+
- ✅
257+
- ❌
258+
- ✅
259+
- ❌
260+
* - ``/path/to/file.xyz``
261+
- ``\x89HDF\r\n\x1a\n`` (HDF5/netCDF4)
262+
- ✅
263+
- ✅
264+
- ❌
265+
- ❌
266+
* - ``/path/to/file.xyz``
267+
- *(no magic number)*
268+
- ❌
269+
- ❌
270+
- ❌
271+
- ❌
272+
273+
.. note::
274+
Remote URLs ending in ``.nc`` are **ambiguous**:
275+
276+
- They could be netCDF files stored on a remote HTTP server (readable by ``netcdf4`` or ``h5netcdf``)
277+
- They could be OPeNDAP/DAP endpoints (readable by ``netcdf4`` with DAP support or ``pydap``)
278+
279+
These interpretations are fundamentally incompatible. If xarray's automatic
280+
selection chooses the wrong backend, you must explicitly specify the ``engine`` parameter:
281+
282+
.. code-block:: python
283+
284+
# Force interpretation as a DAP endpoint
285+
ds = xr.open_dataset("http://example.com/data.nc", engine="pydap")
286+
287+
# Force interpretation as a remote netCDF file
288+
ds = xr.open_dataset("https://example.com/data.nc", engine="netcdf4")
289+
290+
115291
.. _io.netcdf:
116292

117293
netCDF
@@ -1213,6 +1389,8 @@ See for example : `ncdata usage examples`_
12131389
.. _Ncdata: https://ncdata.readthedocs.io/en/latest/index.html
12141390
.. _ncdata usage examples: https://github.com/pp-mo/ncdata/tree/v0.1.2?tab=readme-ov-file#correct-a-miscoded-attribute-in-iris-input
12151391

1392+
.. _io.opendap:
1393+
12161394
OPeNDAP
12171395
-------
12181396

doc/user-guide/options.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Xarray offers a small number of configuration options through :py:func:`set_opti
1818

1919
2. Control behaviour during operations: ``arithmetic_join``, ``keep_attrs``, ``use_bottleneck``.
2020
3. Control colormaps for plots:``cmap_divergent``, ``cmap_sequential``.
21-
4. Aspects of file reading: ``file_cache_maxsize``, ``warn_on_unclosed_files``.
21+
4. Aspects of file reading: ``file_cache_maxsize``, ``netcdf_engine_order``, ``warn_on_unclosed_files``.
2222

2323

2424
You can set these options either globally

doc/whats-new.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
.. _whats-new:
44

5+
56
What's New
67
==========
78

@@ -32,6 +33,11 @@ Bug Fixes
3233
- Fix h5netcdf backend for format=None, use same rule as netcdf4 backend (:pull:`10859`).
3334
By `Kai Mühlbauer <https://github.com/kmuehlbauer>`_
3435

36+
- ``netcdf4`` and ``pydap`` backends now use stricter URL detection to avoid incorrectly claiming
37+
remote URLs. The ``pydap`` backend now only claims URLs with explicit DAP protocol indicators
38+
(``dap2://`` or ``dap4://`` schemes, or ``/dap2/`` or ``/dap4/`` in the URL path). This prevents
39+
both backends from claiming remote Zarr stores and other non-DAP URLs without an explicit
40+
``engine=`` argument. (:pull:`10804`). By `Ian Hunt-Isaak <https://github.com/ianhi>`_.
3541

3642
Documentation
3743
~~~~~~~~~~~~~
@@ -67,12 +73,12 @@ New features
6773

6874
Bug fixes
6975
~~~~~~~~~
70-
7176
- Fix error raised when writing scalar variables to Zarr with ``region={}``
7277
(:pull:`10796`).
7378
By `Stephan Hoyer <https://github.com/shoyer>`_.
7479

7580

81+
7682
.. _whats-new.2025.09.1:
7783

7884
v2025.09.1 (September 29, 2025)

xarray/backends/h5netcdf_.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -494,10 +494,16 @@ class H5netcdfBackendEntrypoint(BackendEntrypoint):
494494
supports_groups = True
495495

496496
def guess_can_open(self, filename_or_obj: T_PathFileOrDataStore) -> bool:
497+
from xarray.core.utils import is_remote_uri
498+
497499
filename_or_obj = _normalize_filename_or_obj(filename_or_obj)
498-
magic_number = try_read_magic_number_from_file_or_path(filename_or_obj)
499-
if magic_number is not None:
500-
return magic_number.startswith(b"\211HDF\r\n\032\n")
500+
501+
# Try to read magic number for local files only
502+
is_remote = isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj)
503+
if not is_remote:
504+
magic_number = try_read_magic_number_from_file_or_path(filename_or_obj)
505+
if magic_number is not None:
506+
return magic_number.startswith(b"\211HDF\r\n\032\n")
501507

502508
if isinstance(filename_or_obj, str | os.PathLike):
503509
_, ext = os.path.splitext(filename_or_obj)

xarray/backends/netCDF4_.py

Lines changed: 26 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@
5050
FrozenDict,
5151
close_on_error,
5252
is_remote_uri,
53+
strip_uri_params,
5354
try_read_magic_number_from_path,
5455
)
5556
from xarray.core.variable import Variable
@@ -701,21 +702,34 @@ class NetCDF4BackendEntrypoint(BackendEntrypoint):
701702
supports_groups = True
702703

703704
def guess_can_open(self, filename_or_obj: T_PathFileOrDataStore) -> bool:
704-
if isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj):
705-
return True
705+
# Helper to check if magic number is netCDF or HDF5
706+
def _is_netcdf_magic(magic: bytes) -> bool:
707+
return magic.startswith((b"CDF", b"\211HDF\r\n\032\n"))
708+
709+
# Helper to check if extension is netCDF
710+
def _has_netcdf_ext(path: str | os.PathLike, is_remote: bool = False) -> bool:
711+
path = str(path).rstrip("/")
712+
# For remote URIs, strip query parameters and fragments
713+
if is_remote:
714+
path = strip_uri_params(path)
715+
_, ext = os.path.splitext(path)
716+
return ext in {".nc", ".nc4", ".cdf"}
706717

707-
magic_number = (
708-
bytes(filename_or_obj[:8])
709-
if isinstance(filename_or_obj, bytes | memoryview)
710-
else try_read_magic_number_from_path(filename_or_obj)
711-
)
712-
if magic_number is not None:
713-
# netcdf 3 or HDF5
714-
return magic_number.startswith((b"CDF", b"\211HDF\r\n\032\n"))
718+
if isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj):
719+
# For remote URIs, check extension (accounting for query params/fragments)
720+
# Remote netcdf-c can handle both regular URLs and DAP URLs
721+
return _has_netcdf_ext(filename_or_obj, is_remote=True)
715722

716723
if isinstance(filename_or_obj, str | os.PathLike):
717-
_, ext = os.path.splitext(filename_or_obj)
718-
return ext in {".nc", ".nc4", ".cdf"}
724+
# For local paths, check magic number first, then extension
725+
magic_number = try_read_magic_number_from_path(filename_or_obj)
726+
if magic_number is not None:
727+
return _is_netcdf_magic(magic_number)
728+
# No magic number available, fallback to extension
729+
return _has_netcdf_ext(filename_or_obj)
730+
731+
if isinstance(filename_or_obj, bytes | memoryview):
732+
return _is_netcdf_magic(bytes(filename_or_obj[:8]))
719733

720734
return False
721735

xarray/backends/pydap_.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import os
34
from collections.abc import Iterable
45
from typing import TYPE_CHECKING, Any
56

@@ -209,7 +210,25 @@ class PydapBackendEntrypoint(BackendEntrypoint):
209210
url = "https://docs.xarray.dev/en/stable/generated/xarray.backends.PydapBackendEntrypoint.html"
210211

211212
def guess_can_open(self, filename_or_obj: T_PathFileOrDataStore) -> bool:
212-
return isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj)
213+
if not isinstance(filename_or_obj, str):
214+
return False
215+
216+
# Check for explicit DAP protocol indicators:
217+
# 1. DAP scheme: dap2:// or dap4:// (case-insensitive, may not be recognized by is_remote_uri)
218+
# 2. Remote URI with /dap2/ or /dap4/ in URL path (case-insensitive)
219+
# Note: We intentionally do NOT check for .dap suffix as that would match
220+
# file extensions like .dap which trigger downloads of binary data
221+
url_lower = filename_or_obj.lower()
222+
if url_lower.startswith(("dap2://", "dap4://")):
223+
return True
224+
225+
# For standard remote URIs, check for DAP indicators in path
226+
if is_remote_uri(filename_or_obj):
227+
return (
228+
"/dap2/" in url_lower or "/dap4/" in url_lower or "/dodsC/" in url_lower
229+
)
230+
231+
return False
213232

214233
def open_dataset(
215234
self,

0 commit comments

Comments
 (0)