Skip to content

Clarify how JSON Schema works with a superset of the defined data model #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 6, 2020

Conversation

awwright
Copy link
Member

Successor to #942

@awwright awwright force-pushed the 970 branch 4 times, most recently from c1602b8 to a7ab9fe Compare August 17, 2020 06:15
@awwright awwright requested a review from handrews August 17, 2020 06:19
(similar to how "number" is a superset of "integer"),
or a type disjoint from the core data model altogether (like "binary").
Existing keywords may be updated to define behavior for the new range of values,
but vocabularies that intend to be compatible with JSON Schema Validation should not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/should not/MUST NOT/ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to review the spec again, but I'd guess vocabularies are allowed to not use the validation vocabulary.

I wonder if it would save implementation effort to not allow this?

Copy link
Member

@Relequestual Relequestual Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is true. My feeling is that the basic use of type should be pretty universal, and SHOULD NOT makes it clear you better have a good reason to change the behavior of existing types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @handrews had shown elsewhere in this thread that we would not be permitting updating existing keywords, so MUST NOT is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I read this comment, made a comment, before reading the rest of the discussion related to this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awwright it is definitely true that a dialect can omit the validation vocabulary. (I'm trying to promoted the "dialect" term, see #902).

We need to be careful here of the distinction between type (the keyword in the validation vocabulary) and the more generic notion of JSON Schema data model. I'm a little confused by:

New data model types may be defined as either supersets of an existing type (similar to how "number" is a superset of "integer")

as "number" is both a data model type and a value for type, while "integer" is not a data model type, just a value for type. "integer" is a refinement/shortcut type assertion for the data model type "number" plus "multipleOf": 1 (more or less, I don't recall the exact language for "integer" offhand).

A binary value would be outside of the JSON Schema data model, and therefore be an extension (not refinement or shortcut) to that data model. Because the type keyword only has keywords relevant to the JSON data model, you would not be able to use type with an extension to the data model and have the assertion pass.


An important consideration here is that when we say "this keyword applies to numbers" or "this keyword applies to strings" we are talking about the JSON data model. We can't say "this keyword applies to integers" because the only way to determine that something is in integer is to use type: integer. The data model is not aware of integers.

My slight preference after reading all this is to basically add one additional thing to the JSON Schema data model: none of the above. Let's not get into allowing people to make specific extensions to the data model because the implications of that are potentially complex. But we can explicitly allow people to use JSON Schema with data that doesn't fit the data model, assuming their implementation supports it somehow.

With this approach, you could add a keyword that works for "values outside the primary JSON Schema data model" or whatever we want to call it. But you can't add keywords for various different sorts of things that might show up that way.

In practice with OpenAPI, these "instances" are being documented and handed off to something else to interpret based on the JSON Schema contentMediaType keyword. It's a way to say, from within JSON Schema, that the thing is not actually JSON and shouldn't be interpreted as such, but also shouldn't be an error.

(BTW, much thanks to @awwright for picking this up and to everyone for comments- this is going to be a much more solid change than my original idea when it's all done).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up thoughts, I often a use case for wanting to validate a JSON structure in javascript which sometimes contains a function... You can't convert them to JSON, but, in js, it could be part of the data you want to validate, where you don't care about what the function is, just that it's a function.

as they will always pass or always fail.
</t>
<t>
A custom vocabulary may define support for a superset of the core data model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • does this paragraph belong in the section "Non-JSON Instances"?

  • does this paragraph mean that a custom vocabulary may be expressed in terms of another custom vocabulary? Or that this is not permitted and the core vocabulary only may be used? (I strongly prefer the latter, as implementations could otherwise be stuck in a recursive chain of "load vocabulary X first to understand vocabulary Y". I'd at least want someone to show an example of this being done (e.g. in an appendix) to demonstrate that it allowed and provide a potential test case for implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this paragraph belong in the section "Non-JSON Instances"?

If an implementations adds support for a superset of the data model, that means it's supporting data that cannot be represented by application/json. The way for a schema to declare using this superset would be with a custom $schema keyword.

does this paragraph mean that a custom vocabulary may be expressed in terms of another custom vocabulary

I haven't really thought that far out, but that seems like a concern for the vocabularies section.

(I hope I'm understanding the question correctly.)

for example, to make use of the "const" keyword.
</t>
<t>
New data model types may be defined as either supersets of an existing type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify if that means that existing keywords can be extended, or if new keywords must be created to extend functionality. This paragraph almost implies that a vocabulary might extend the existing 'type' keyword to accept more values than number, integer, boolean, string, null, object and array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph almost implies that a vocabulary might extend the existing 'type' keyword to accept more values than number, integer, boolean, string, null, object and array.

This is exactly the intent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think that's terrible. It means that the implementation of every core keyword now needs to check if another vocabulary has extended that keyword. For example: when the validator vocabulary evaluates "type": "my_new_type"`, it can no longer generate an error "unrecognized type 'my_new_type'" but instead.. what? silently ignore it? call the other vocabulary's evaluation code to handle it? Is it going to know what vocabulary that is?

Copy link
Member Author

@awwright awwright Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabulary behavior aside, it makes sense to me that if I'm going to produce an implementation that's a superset of JSON Schema, that implementation will allow schemas like { type: "binary" }. For example, CBOR has byte arrays, or support for CDDL.

I'll take another look and see if I can address this though.

I feel like this has come up once or twice before, or is this just déjà vu?

Copy link
Member

@karenetheridge karenetheridge Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense to me that if I'm going to produce an implementation that's a superset of JSON Schema...

I agree. Perhaps 'type' is the only keyword where it might make sense to expand the syntax of an existing keyword. If so, we can handle this thusly:

  • update the specification for 'type' to indicate it should explicitly ignore any unrecognized type names, to allow for new vocabularies to add their own 'type' keyword (and indicate that those vocabularies should also ignore all the core types and any other unrecognized values) - which would allow vocabularies to share the 'type' keyword between them)
  • explicitly document that vocabularies cannot use the same keyword name as a core vocabulary (with the single exception of 'type')
  • leave as undefined behaviour what happens when two custom vocabularies choose to use the same keyword (some vocabularies may choose to error on a collision; others may wish to have overlapping behaviour as with 'type').

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge we didn't really carve out an exception for format. It's always been defined as having an open-ended set of values and we are forced to continue supporting that for the time being. No one should ever create a keyword like format again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@handrews I well understand the difficult nature of format :) I am saying we should be explicit in the spec (given the questions and different understandings shown earlier in this thread) that it is not generally permitted for a vocabulary to extend the syntax, range of permitted values or other functionality of an existing core keyword. Especially in the context of extending the defined data model set (the topic of the PR), the spec should give direction that new data types will need vocabularies to define new keywords to work with them, rather than extend existing keywords.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read on the above comments are, this PR is off, and the key is...

To the extent that this interacts with type, it just removes any implicit assumption that the absence of type is the same as "type": ["null", "boolean", "number", "array", "object"]. If type is absent and you have a parsed instance, no further type checking is done as an assertion. - @handrews

In addition, +1 to what @handrews further said...

What that really means is that a keyword can't fail an assertion in one module and pass in another, and expect the result to be a pass. If any handler fails, the result is a failure. Just like adding keywords to a schema object, constraints can only be added, never removed.

@awwright do you feel you've misread the issue here?
Do you feel a new PR would make the most sense or do you want to continue to modify this one?

Copy link
Member Author

@awwright awwright Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking, let's edit this down to make the minimal set of changes to satisfy OAI (they're not doing anything with validation keywords on binary data, as I understand). And then, I can take some more time and examine vocabularies better, and things like CBOR that claim to be supersets of JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awwright sounds good. And yes, that's correct about OAI- this is the fundamental use case:

Its useful to be able to use "contentMediaType" in contexts where binary media types are present. It does not make sense to use "type": "string" with binary data, because such data is not a JSON string no matter how much you try to map it into the data model.

A schema without a type can effectively be a hint not to worry about a detailed mapping. It is a way to use JSON Schema across media types without excluding binary content.

contentMediaType is strictly an annotation as of 2019-09. Even contentSchema is an annotation. It's just that the value of annotation is a schema, so after you parse the media type (somehow, JSON Schema doesn't care how) you can apply that schema to it.

@awwright
Copy link
Member Author

awwright commented Aug 23, 2020

I took the validation paragraphs out, so the behavior of validation keywords remains undefined.

                         The schema itself may only be expressible in this superset;
                         for example, to make use of the "const" keyword.
                     </t>
-                    <t>
-                        New data model types may be defined as either supersets of an existing type
-                        (similar to how "number" is a superset of "integer"),
-                        or a type disjoint from the core data model altogether (like "binary").
-                        Existing keywords may be updated to define behavior for the new range of values,
-                        but vocabularies that intend to be compatible with JSON Schema Validation should not
-                        change the behavior for existing types.
-                    </t>
-                    <t>
-                        Drawing on the "binary" example,
-                        a custom vocabulary may define what it means to use a "binary" data type
-                        in the "const" keyword, or the meaning of "minLength" when the instance is a "binary",
-                        but for the sake of consistency, it should not change the behavior when the instance is a string.
-                    </t>
                 </section>
             </section>

@handrews
Copy link
Contributor

handrews commented Sep 6, 2020

Fixes #941 (linking issue). Can we merge this now?

@awwright
Copy link
Member Author

awwright commented Sep 6, 2020

@handrews Go ahead

@handrews handrews merged commit 30f3158 into json-schema-org:master Sep 6, 2020
@awwright awwright deleted the 970 branch January 14, 2022 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants