Clarify how JSON Schema works with a superset of the defined data model #970

awwright · 2020-08-17T05:39:26Z

Successor to #942

Relequestual · 2020-08-17T12:40:11Z

jsonschema-core.xml

+                        (similar to how "number" is a superset of "integer"),
+                        or a type disjoint from the core data model altogether (like "binary").
+                        Existing keywords may be updated to define behavior for the new range of values,
+                        but vocabularies that intend to be compatible with JSON Schema Validation should not


/should not/MUST NOT/ ?

I'll have to review the spec again, but I'd guess vocabularies are allowed to not use the validation vocabulary.

I wonder if it would save implementation effort to not allow this?

~~Yes this is true. My feeling is that the basic use of type should be pretty universal, and SHOULD NOT makes it clear you better have a good reason to change the behavior of existing types.~~

I thought @handrews had shown elsewhere in this thread that we would not be permitting updating existing keywords, so MUST NOT is correct.

Correct. I read this comment, made a comment, before reading the rest of the discussion related to this issue.

@awwright it is definitely true that a dialect can omit the validation vocabulary. (I'm trying to promoted the "dialect" term, see #902).

We need to be careful here of the distinction between type (the keyword in the validation vocabulary) and the more generic notion of JSON Schema data model. I'm a little confused by:

New data model types may be defined as either supersets of an existing type (similar to how "number" is a superset of "integer")

as "number" is both a data model type and a value for type, while "integer" is not a data model type, just a value for type. "integer" is a refinement/shortcut type assertion for the data model type "number" plus "multipleOf": 1 (more or less, I don't recall the exact language for "integer" offhand).

A binary value would be outside of the JSON Schema data model, and therefore be an extension (not refinement or shortcut) to that data model. Because the type keyword only has keywords relevant to the JSON data model, you would not be able to use type with an extension to the data model and have the assertion pass.

An important consideration here is that when we say "this keyword applies to numbers" or "this keyword applies to strings" we are talking about the JSON data model. We can't say "this keyword applies to integers" because the only way to determine that something is in integer is to use type: integer. The data model is not aware of integers.

My slight preference after reading all this is to basically add one additional thing to the JSON Schema data model: none of the above. Let's not get into allowing people to make specific extensions to the data model because the implications of that are potentially complex. But we can explicitly allow people to use JSON Schema with data that doesn't fit the data model, assuming their implementation supports it somehow.

With this approach, you could add a keyword that works for "values outside the primary JSON Schema data model" or whatever we want to call it. But you can't add keywords for various different sorts of things that might show up that way.

In practice with OpenAPI, these "instances" are being documented and handed off to something else to interpret based on the JSON Schema contentMediaType keyword. It's a way to say, from within JSON Schema, that the thing is not actually JSON and shouldn't be interpreted as such, but also shouldn't be an error.

(BTW, much thanks to @awwright for picking this up and to everyone for comments- this is going to be a much more solid change than my original idea when it's all done).

Follow up thoughts, I often a use case for wanting to validate a JSON structure in javascript which sometimes contains a function... You can't convert them to JSON, but, in js, it could be part of the data you want to validate, where you don't care about what the function is, just that it's a function.

jsonschema-core.xml

karenetheridge · 2020-08-17T18:50:29Z

jsonschema-core.xml

+                        as they will always pass or always fail.
+                    </t>
+                    <t>
+                        A custom vocabulary may define support for a superset of the core data model.


does this paragraph belong in the section "Non-JSON Instances"?

does this paragraph mean that a custom vocabulary may be expressed in terms of another custom vocabulary? Or that this is not permitted and the core vocabulary only may be used? (I strongly prefer the latter, as implementations could otherwise be stuck in a recursive chain of "load vocabulary X first to understand vocabulary Y". I'd at least want someone to show an example of this being done (e.g. in an appendix) to demonstrate that it allowed and provide a potential test case for implementations.

does this paragraph belong in the section "Non-JSON Instances"?

If an implementations adds support for a superset of the data model, that means it's supporting data that cannot be represented by application/json. The way for a schema to declare using this superset would be with a custom $schema keyword.

does this paragraph mean that a custom vocabulary may be expressed in terms of another custom vocabulary

I haven't really thought that far out, but that seems like a concern for the vocabularies section.

(I hope I'm understanding the question correctly.)

karenetheridge · 2020-08-17T18:53:24Z

jsonschema-core.xml

+                        for example, to make use of the "const" keyword.
+                    </t>
+                    <t>
+                        New data model types may be defined as either supersets of an existing type


Please clarify if that means that existing keywords can be extended, or if new keywords must be created to extend functionality. This paragraph almost implies that a vocabulary might extend the existing 'type' keyword to accept more values than number, integer, boolean, string, null, object and array.

This paragraph almost implies that a vocabulary might extend the existing 'type' keyword to accept more values than number, integer, boolean, string, null, object and array.

This is exactly the intent.

FWIW I think that's terrible. It means that the implementation of every core keyword now needs to check if another vocabulary has extended that keyword. For example: when the validator vocabulary evaluates "type": "my_new_type"`, it can no longer generate an error "unrecognized type 'my_new_type'" but instead.. what? silently ignore it? call the other vocabulary's evaluation code to handle it? Is it going to know what vocabulary that is?

Vocabulary behavior aside, it makes sense to me that if I'm going to produce an implementation that's a superset of JSON Schema, that implementation will allow schemas like { type: "binary" }. For example, CBOR has byte arrays, or support for CDDL.

I'll take another look and see if I can address this though.

I feel like this has come up once or twice before, or is this just déjà vu?

it makes sense to me that if I'm going to produce an implementation that's a superset of JSON Schema...

I agree. Perhaps 'type' is the only keyword where it might make sense to expand the syntax of an existing keyword. If so, we can handle this thusly:

update the specification for 'type' to indicate it should explicitly ignore any unrecognized type names, to allow for new vocabularies to add their own 'type' keyword (and indicate that those vocabularies should also ignore all the core types and any other unrecognized values) - which would allow vocabularies to share the 'type' keyword between them)

explicitly document that vocabularies cannot use the same keyword name as a core vocabulary (with the single exception of 'type')

leave as undefined behaviour what happens when two custom vocabularies choose to use the same keyword (some vocabularies may choose to error on a collision; others may wish to have overlapping behaviour as with 'type').

@karenetheridge we didn't really carve out an exception for format. It's always been defined as having an open-ended set of values and we are forced to continue supporting that for the time being. No one should ever create a keyword like format again.

@handrews I well understand the difficult nature of format :) I am saying we should be explicit in the spec (given the questions and different understandings shown earlier in this thread) that it is not generally permitted for a vocabulary to extend the syntax, range of permitted values or other functionality of an existing core keyword. Especially in the context of extending the defined data model set (the topic of the PR), the spec should give direction that new data types will need vocabularies to define new keywords to work with them, rather than extend existing keywords.

My read on the above comments are, this PR is off, and the key is...

To the extent that this interacts with type, it just removes any implicit assumption that the absence of type is the same as "type": ["null", "boolean", "number", "array", "object"]. If type is absent and you have a parsed instance, no further type checking is done as an assertion. - @handrews

In addition, +1 to what @handrews further said...

What that really means is that a keyword can't fail an assertion in one module and pass in another, and expect the result to be a pass. If any handler fails, the result is a failure. Just like adding keywords to a schema object, constraints can only be added, never removed.

@awwright do you feel you've misread the issue here?
Do you feel a new PR would make the most sense or do you want to continue to modify this one?

I'm thinking, let's edit this down to make the minimal set of changes to satisfy OAI (they're not doing anything with validation keywords on binary data, as I understand). And then, I can take some more time and examine vocabularies better, and things like CBOR that claim to be supersets of JSON.

@awwright sounds good. And yes, that's correct about OAI- this is the fundamental use case:

Its useful to be able to use "contentMediaType" in contexts where binary media types are present. It does not make sense to use "type": "string" with binary data, because such data is not a JSON string no matter how much you try to map it into the data model.

A schema without a type can effectively be a hint not to worry about a detailed mapping. It is a way to use JSON Schema across media types without excluding binary content.

contentMediaType is strictly an annotation as of 2019-09. Even contentSchema is an annotation. It's just that the value of annotation is a schema, so after you parse the media type (somehow, JSON Schema doesn't care how) you can apply that schema to it.

awwright · 2020-08-23T17:33:43Z

I took the validation paragraphs out, so the behavior of validation keywords remains undefined.

                         The schema itself may only be expressible in this superset;
                         for example, to make use of the "const" keyword.
                     </t>
-                    <t>
-                        New data model types may be defined as either supersets of an existing type
-                        (similar to how "number" is a superset of "integer"),
-                        or a type disjoint from the core data model altogether (like "binary").
-                        Existing keywords may be updated to define behavior for the new range of values,
-                        but vocabularies that intend to be compatible with JSON Schema Validation should not
-                        change the behavior for existing types.
-                    </t>
-                    <t>
-                        Drawing on the "binary" example,
-                        a custom vocabulary may define what it means to use a "binary" data type
-                        in the "const" keyword, or the meaning of "minLength" when the instance is a "binary",
-                        but for the sake of consistency, it should not change the behavior when the instance is a string.
-                    </t>
                 </section>
             </section>

handrews · 2020-09-06T03:10:49Z

Fixes #941 (linking issue). Can we merge this now?

awwright · 2020-09-06T03:42:11Z

@handrews Go ahead

awwright force-pushed the 970 branch 4 times, most recently from c1602b8 to a7ab9fe Compare August 17, 2020 06:15

awwright requested a review from handrews August 17, 2020 06:19

Relequestual reviewed Aug 17, 2020

View reviewed changes

jsonschema-core.xml Show resolved Hide resolved

karenetheridge reviewed Aug 17, 2020

View reviewed changes

Clarify how JSON Schema works with a superset of the defined data model

8afae04

awwright force-pushed the 970 branch from a7ab9fe to 8afae04 Compare August 23, 2020 06:52

Relequestual assigned handrews Aug 23, 2020

handrews approved these changes Aug 23, 2020

View reviewed changes

Relequestual approved these changes Aug 23, 2020

View reviewed changes

handrews merged commit 30f3158 into json-schema-org:master Sep 6, 2020

handrews mentioned this pull request Sep 16, 2020

Allow "contentMediaType" without "type" #941

Closed

awwright deleted the 970 branch January 14, 2022 21:56

Uh oh!

Clarify how JSON Schema works with a superset of the defined data model #970

Clarify how JSON Schema works with a superset of the defined data model #970

Uh oh!

Conversation

awwright commented Aug 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Relequestual Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awwright Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karenetheridge Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awwright Aug 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awwright commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

handrews commented Sep 6, 2020

Uh oh!

awwright commented Sep 6, 2020

Uh oh!

Uh oh!

Relequestual Aug 19, 2020 •

edited

Loading

awwright Aug 17, 2020 •

edited

Loading

karenetheridge Aug 17, 2020 •

edited

Loading

awwright Aug 22, 2020 •

edited

Loading

awwright commented Aug 23, 2020 •

edited

Loading