Skip to content

Conversation

Halpph
Copy link

@Halpph Halpph commented Feb 19, 2025

  • Tests pass
  • ruff format
  • README.md updated (if relevant)
  •  CHANGELOG.md entry added

{ "pattern": "^.$" }
],
"description": "Only for format = json. How multiple json documents are delimited within one file"
"description": "For JSON format, only 'new_line' or 'array' is allowed to indicate how multiple JSON documents are delimited. For CSV format, any single character can be used as the delimiter between columns. Only valid for CSV."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jochenchrist should I not modify this one and just change the one in the other repo then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't the file here, but propose a change in the other repo.

We still can continue to use the provided keys in the data_contract_specification (as custom extensions)

stefanedwards and others added 5 commits February 26, 2025 15:33
Allows duckdb to load the csv file correctly and lets SodaCL check for field presence.

This fix does not check for incorrect ordering of columns.
fix: Typo in datacontract.schema
"rich>=13.7,<13.10",
"sqlglot>=26.6.0,<27.0.0",
"duckdb==1.1.2",
"fsspec",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file datacontract\engines\soda\connections\duckdb.py, the method sniff_csv_header uses duckdb.from_csv_auto to read a csv file as a stream.

Without fsspec, the following fails:

return duckdb.from_csv_auto(io.BytesIO(header_line), **csv_params).columns
E       duckdb.duckdb.InvalidInputException: Invalid Input Error: This operation could not be completed because required module 'fsspec' is not installed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename is not in line with the naming conventions (test__.py
Please move these tests to test_test_clocal_csv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the tests do not test the test routine. They test whether duckdb can handle mismatched column specifications when reading csv-files.

Do you still want them in test_test_local_csv.py?

@dmaresma
Copy link
Contributor

Hi there is also a regression, introduce by the sniff_csv_header,

when using a with open(model_path, 'rb') the model_path could be an abfss:// , s3:// blob or datalake store url, that is not supported by the python io open() function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants