This repository was archived by the owner on Oct 24, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 7
Blog entry about defects found in public JSON schemas. #40
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
ea20da5
add a small blog
39cef2c
expand, add some context
64377d6
add caveats and other expansions
8d07bca
proofreading
b8a47bd
proofreading
020e39c
add reference to added example
a80eced
more proof reading
489ebfd
add link and further caveats
3e4a48a
typo--
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
--- | ||
title: "An Analysis of JSON Schema Defects" | ||
date: 2023-09-26 | ||
tags: | ||
- Specification | ||
type: Opinion | ||
cover: /img/posts/2023/analysis-of-json-schema-defects/cover.webp | ||
authors: | ||
- name: Fabien Coelho | ||
photo: /img/avatars/fabien.jpg | ||
link: https://www.linkedin.com/in/fabien-coelho-65433a18/ | ||
byline: Professor in CS | ||
- name: Claire Medrala | ||
photo: /img/avatars/claire.jpg | ||
link: https://www.linkedin.com/in/claire-medrala/ | ||
byline: Research Engineer | ||
excerpt: Evidence suggests that schemas are hard to write, and possible changes to the spec | ||
--- | ||
|
||
## Context | ||
|
||
While teaching back-end programming at [Mines Paris](https://minesparis.psl.eu/), | ||
an engineering school which is part of [PSL University](https://psl.eu/), we have | ||
looked at how JSON data could be validated when transfered from a front-end (eg react-native) | ||
to a back-end (eg a REST API with Flask) and to storage (eg a Postgres database). | ||
|
||
We have stumbled upon JSON Schema, and our investigation leads to an *academic* study | ||
which analyses many schemas, finds common defects, and proposes changes to the spec | ||
which would rule out syntactically most of these defects, at the price of some | ||
contraints. | ||
|
||
More precisely, the methodology consisted in: | ||
|
||
- reading all versions of the specs (yes, really!), | ||
- collecting all the public schemas we could find (especially aggregating corpura from prior academic studies), | ||
- writing several tools to analyze schemas and report *definite* or *probable* defects, | ||
- looking at the reported defects to try to guess *why* these defects are there | ||
(most of the time some typo, a misplaced `}`, some type errors…), | ||
- thinking about what changes in the spec could rule out these schemas, while | ||
still allowing to describe useful JSON data structures. | ||
|
||
Overall, the quality of publicly available schemas is… not great: | ||
Over **60%** of schemas are shown to have some type of defects, resulting in | ||
the worst case in unintended data to be validated, possibly risking system breakage | ||
or even cybersecurity issues. | ||
|
||
The changes we recommend go beyond [Last Breaking Change](/blog/posts/the-last-breaking-change), | ||
and somehow change the philosophy of the specification, so can be perceived as controversial. | ||
However they reach their target, which is to turn most defects into errors. | ||
Although the added restrictions would require to update some existing schemas, we found | ||
that a significant number of public schemas already conform to our proposed restrictions. | ||
|
||
## Common Defects | ||
|
||
Defects come mostly from JSON Schema lax independent keywords and loose defaults: | ||
With JSON Schema, there is *no* constraint on where you put valid keywords, and | ||
unknown keywords are silently ignored for ensuring *upward* compatibility. | ||
As a result, mistyping, misnaming, misspelling or misplacing a keyword simply | ||
results in the keyword being silently ignored, and these unintentional errors | ||
tend to stay in schemas without being ever detected. | ||
|
||
In the worst case, schemas may not be satisfiable at all. | ||
Consider for instance this schema extract (line 48037 of | ||
[Ansible 2.5](https://github.com/miniHive/schemastore-analysis/blob/master/JSON/Ansible_2.5.json)), | ||
where both allowed values are integers, which mean that it will always fail | ||
when checking that they are also strings: | ||
|
||
```json | ||
{ | ||
"type": "string", | ||
"enum": [ 80, 443 ] | ||
} | ||
``` | ||
|
||
Other defects often manifest themselves as ignored keywords. | ||
Consider the following schema extract (line 614 of | ||
[.NET Template](https://json.schemastore.org/template.json)), where `uniqueItems` | ||
applies to a string, thus is always ignored, and should have been attached to | ||
the upper level: | ||
|
||
```json | ||
{ | ||
"type": "array", | ||
"items": { | ||
"type": "string", | ||
"uniqueItems": true | ||
} | ||
} | ||
``` | ||
|
||
Or this extract (line 55 of | ||
[Azure Device Update Manifest](https://json.schemastore.org/azure-deviceupdate-manifest-definitions-4.0.json)), | ||
where `propertyNames` applies to a string thus is also always ignored, and | ||
should also be moved up to be effective. | ||
|
||
```json | ||
{ | ||
"type": "object", | ||
"additionalProperties": { | ||
"type": "string", | ||
"propertyNames": { | ||
"minLength": 1, | ||
"maxLength": 32 | ||
} | ||
} | ||
} | ||
``` | ||
|
||
Or this extract (line 443 of [Fly](https://json.schemastore.org/fly.json)), where | ||
the misplaced `additionalProperties` is taken as a forbidden property name instead | ||
of applying to the surrounding object. | ||
|
||
```json | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"image": { "type": "string" }, | ||
"additionalProperties": false | ||
} | ||
} | ||
``` | ||
|
||
We have found many such issues in our corpus of *57,800* distinct schemas. | ||
This could be significantly improved with limited although bold changes to | ||
the spec. | ||
|
||
## Recommendations | ||
|
||
Based on these evidence, we recommend to tighten the JSON Schema specification | ||
by adding restrictions to keyword occurences. The strictest version of these | ||
proposed changes are: | ||
|
||
- type declarations, either explicit (`type`), implicit (`enum`, `const`, `$ref`), | ||
or through combinators (`allOf`, `anyOf`, `oneOf`) should be **mandatory** and appear | ||
only **once**, i.e. these keywords should be **exclusive**. | ||
- type declarations should be simple scalars, i.e. union could only be achieved | ||
with combinators. | ||
- type-specific keywords must appear only with their `type`, at the same level. | ||
- unknown keywords must be rejected, although there should be some allowance for extensions, | ||
eg with prefixed property names such as `x-*`. | ||
- about 20 seldom-used keywords could be removed, for various reasons: | ||
implementation complexity for `dynamicRef` and `dynamicAnchor`, | ||
understanding complexity for `if`/`then`/`else` (which can in most cases be removed), | ||
underusage for some others. | ||
|
||
Note that other syntactic and semantic changes could help reduce the number of defects | ||
by ruling out some cases but allowing others. Our proposal is simple (constraints | ||
are in the syntax, all conformant tool would enforce it) and effective (most | ||
defects are ruled out). | ||
|
||
With these rules, the first three examples above become illegal. | ||
We think that such changes result in schema descriptions which are easier to | ||
understand and maintain, and that validation could be more efficient. | ||
|
||
Although some description tricks are not possible anymore with these restrictions, | ||
we believe that they bring a significant overall software engineering benefit. | ||
Moreover, many existing schemas already conform to these restrictive rules and | ||
would not need to be changed at all. | ||
|
||
## Caveats | ||
|
||
This is an academic study, done by people who are fully *independent* from the | ||
JSON Schema community and the companies that support services around it. | ||
|
||
From an academic perspective, it is very hard to dismiss some data because | ||
it could be interpreted as if we would have kept only data which support some point of view, | ||
which would constitute a bias. Thus we collected and analyzed all the schemas we could find. | ||
If someone can provide other public sources, we will be very happy to rerun our | ||
analysis and update our figures. In particular, we would love to extract schemas | ||
from OpenAPI and other specs, but we have not found a simple way to scrap these yet. | ||
|
||
Note that there is no magic: we can only analyse data that we can access. | ||
Maybe the public schemas we found are somehow not representative, and the | ||
picture could be different if we could access privately held schemas. | ||
Well, we cannot say anything about what we cannot see! | ||
|
||
Our study provides a first analysis of the causes of defects, say a typo, | ||
a misplacement… which we believe go undetected in projects because they are | ||
*allowed* by the spec, thus we tackle the issue from this perspective. | ||
The spec changes we propose to rule these out may possibly break some use cases. | ||
However, which would be broken without a possible solution is unclear. | ||
|
||
## References | ||
|
||
- [Research Paper](https://www.cri.minesparis.psl.eu/classement/doc/A-794.pdf) | ||
- [Corpus](https://github.com/clairey-zx81/yac) | ||
- [Tools](https://github.com/clairey-zx81/json-schema-stats). | ||
|
||
_Cover photo by [Arnold Francisca](https://unsplash.com/@clark_fransa) on [Unsplash](https://unsplash.com/photos/f77Bh3inUpE)_ |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could also be found with a linter mode, which has been proposed here - https://github.com/orgs/json-schema-org/discussions/323 and json-schema-org/json-schema-spec#1079
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers.