Skip to content

Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification #4856

@pajoma

Description

@pajoma

Current Behavior

The query results are encoded in UTF-8:

public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"),
     StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR);

The specification says:

Systems providing these formats should note that the content types for CSV is text/csv and for TSV text/tab-separated-values. Being text/*, the default character set is US-ASCII. The charset parameter should be used in conjunction with SPARQL Results; UTF-8 is recommended: text/csv; charset=utf-8 and text/tab-separated-values; charset=utf-8.

But the mimetype exposed by RDF4J is "text/csv" (in SparqlMimeTypes)

public static final String CSV_VALUE = "text/csv";

UTF-8 is obviously the correct choice, but standard clients like the python requests library are assuming "ISO-8859-1" for the Content Type "text/csv".

I can modify the rest controllers to not use the standard RDF4J mimetypes, eg.

    @PostMapping(value = "/query", consumes = {MediaType.TEXT_PLAIN_VALUE, SparqlMimeTypes.SPARQL_QUERY_VALUE},
            produces = { SparqlMimeTypes.JSON_VALUE, SparqlMimeTypes.CSV_VALUE+ ";charset=UTF-8"}
    )
    @ResponseStatus(HttpStatus.OK)
    Flux<BindingSet> queryBindingsPost(@RequestBody String query) {...}

but then I have to map from "text/csv;charset=UTF-8" to "text/csv" everywhere else, to get the correct ResultWriters.

Expected Behavior

public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"), StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR); 

should be text/csv;charset=utf-8

If "text/csv" remains included, the SPARQLResultsCSVWriter should use "ISO-8859-1" as encoding (with a warning maybe?))

Steps To Reproduce

  1. Expose a sparql endpoint using the standard mimetypes defined in RDF4J
  2. Call it with the python requests library and see, that is encodes the result in "ISO-8859-1"
            response = requests.post(
                url=f"...",
                data=query.encode("utf-8"),
                headers={
                    "X-API-KEY": api_key,
                    "Content-Type": "text/plain",
                    "Accept": "text/csv",
                    "X-Application": scope,
                },
            )
   
            enc = response.encoding  # is "ISO-8859-1", but in reality it is "UTF-8"

Version

4.3.8

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    specificationissues related to compliance to standards and external specs🐞 bugissue is a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions