Skip to content

Split $anchor out of $id, define canonical vs shadowed URI behavior #780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Sep 14, 2019

Conversation

handrews
Copy link
Contributor

@handrews handrews commented Aug 16, 2019

[EDIT: I pushed a commit Saturday just before 1PM california time that I think simplifies things considerably- see the commit log message for details.]

THERE ARE THREE COMMITS (originally, and then more added)
While this is all one coherent change, and the commits don't fully stand alone, I have organized them to make reviewing a bit easier.

Fixes #729. Fixes #726.

This implements the recent decision to strongly discourage using fragments that cross a base URI change (e.g. an "$id" appearance).

This is done by describing schemas identified by absolute URIs as resources (per the literal definition of Univeral Resource Identifier), and declaring the URI resulting from the "$id" to be the resource's canonical URI.

While URIs for subschemas with "$id" and their children can be constructed using the base URI from a parent schema, these URIs are non-canonical, and their behavior is undefined. This is on the grounds that switching between embedding and referencing schema resources should behave essentially identically.

A difficulty with how annotation collection works in the event of such as switch is noted in a CREF (related to #779)

This also adds the $anchor keyword in place of the fragment-only form of $id, and updates the schema identification examples for $anchor and canonical vs non-canonical URIs.

Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you were able to get this in!

I found a few errors in the example where fragments were incorrectly labeled as "undefined behavior".

@gregsdennis
Copy link
Member

gregsdennis commented Aug 16, 2019

Does the $anchor functionality differ from that of $id when $id's value is "anchor-like"? (i.e Is this existing functionality just split out to another keyword?) If so, does $id no longer support what $anchor now does?

@handrews
Copy link
Contributor Author

@gregsdennis "$anchor": "foo" has the same effect that "$id": "#foo" used to have.

You can no longer write "$id": "#foo", as the spec says that the value of $id MUST resolve to an absolute-URI (without a fragment).

But I agree that that's not entirely obvious from the current PR. I will add something to clarify the change. It might be a CREF rather than main text, and it will definitely be in the changelog... which I forgot to write. Lemme add that...

Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing my request for changes, because it was a misunderstanding. I'm sure the misunderstanding was due more to my preconceived notions of the concept getting in the way than it was about failure to explain things well. Others will have to decide if it's explained clearly.

@handrews handrews changed the base branch from canon-base to master August 16, 2019 16:24
@handrews
Copy link
Contributor Author

@jdesrosiers thanks- you spotted a legitimately confusing case, and that is very helpful! I'm honestly not sure what the right thing to do is with the fragment URI that expresses the location of the embedded schema resource in the parent schema resource. It does not seem right to say that it is undefined, so let's see how this wording holds up and whether I need to tweak it further.

Copy link
Member

@Relequestual Relequestual left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this makes sense.

I've made a few comments which are semantics only.

In terms of killing URI shadowing (Resolving #726), could you clarify for me, given the example, a reference which would be invalid? I was trying to think of one, but I can't quite wrangle my head around this...

Additionally, do we want to explicitly "NOT RECOMMEND[ED]" using the non-canonical URI given support is optional (If I read this correctly)

This lets us talk more clearly about what an "$id" really is, as
will be seen in the next commit.  It also introduces the idea
of a schema resource having a canonical URI, which is important
for explaining why JSON Pointer fragments relative to parent
base URIs (meaning that they "cross" a subschema "$id") should
have undefined behavior.
This implements the recent decision to strongly discourage using
fragments that cross a base URI change (e.g. an "$id" appearance).

This is done by describing schemas identified by absolute URIs
as resources (per the literal definition of Univeral Resource
Identifier), which was done in the previous commit, and declaring
the URI resulting from the "$id" to be the resource's canonical
URI.

While URIs for subschemas with "$id" and their chidren can be
constructed using the base URI from a parent schema, these
URIs are non-canonical, and their behavior is undefined.  This is
on the grounds that switching between embedding and referencing
schema resources should behave essentially identically.

A difficulty with how annotation collection works in the event
of such as switch is noted in a CREF.
This adds the $anchor keyword in place of the fragment-only
form of $id, and updates the schema identification examples
for $anchor and canonical vs non-canonical URIs.
$id cannot contain a fragment except for an empty one,
mostly because a lot of people do that out of habit it seems.

This also reworks the wording around non-canonical URIs to avoid
an overly broad "undefined behavior."  This still doesn't feel
ideal but is hopefully an improvement.
Now that the core spec explicitly defines this as the default,
let's call it that.  Because calling it the "core and validation"
meta-schema is annoying.
Since we're now very strongly discouraging fragments in $id,
let's not use them in $schema either.  It works either way,
but I like the consistency.

Stylistically, referring to "#" internally makes sense,
while using an absolute-URI per RFC 3986 (no fragment)
makes sense externally.
@handrews
Copy link
Contributor Author

The update I just pushed is simply a rebase onto master to fix the conflicts from merging the other PRs. The only conflict was the $schema and $id values of the vocab example (they were totally changed in the example PR, and I just had the trailing # in this PR).

There shouldn't be any significant change to the diff, as I made a point to keep the PRs as non-overlapping as possible.

@handrews
Copy link
Contributor Author

@gregsdennis I answered the one question you have- are you OK with this change now or have you not had time to finish reviewing?

@johandorland @Julian @KayEss are any of you planning to weigh in? If so, can you give an estimated date of when you might get to it? If not I'll just wait for at least @Relequestual and @awwright (and @philsturgeon if he ever gets back to a laptop from his bike tour 🚴 ).

This is the last big mandatory thing, and is a prerequisite for the last optional proposal I'm still working on, and we'd really like to get to publishing, so... reviews are needed! I know it's summer and folks are busy and on vacation and stuff, so letting me know when you might get to it would be very helpful.

@Julian
Copy link
Member

Julian commented Aug 23, 2019

Thumbs up from me! Thanks as usual @handrews (both for a good change and for managing to catch my attention :/)

@gregsdennis
Copy link
Member

My question is answered

@handrews
Copy link
Contributor Author

handrews commented Aug 24, 2019

[EDIT: never mind, instead of a new PR I split out a smaller part and added it to this PR as another commit- here is the message]

This further clarifies the schema resource concept and
how it is used. It provides an example to simplify
the explanation of how non-canonical URIs can be
problematic.

This also relaxes a restriction on "$schema" that has no functional
impact but avoids requiring implementations to detect and handle
a rather complex case (embedding schema resources with different
"$schema" values). And also means that embedding can work without
having to change the embedded value by removing the "$schema"
keyword.

Additional best practices for embedding will follow in a separate
commit.

This further clarifies the schema resource concept and
how it is used.  It provides an example to simplify
the explanation of how non-canonical URIs can be
problematic.

This also relaxes a restriction on "$schema" that has no functional
impact but avoids requiring implementations to detect and handle
a rather complex case (embedding schema resources with different
"$schema" values).  And also means that embedding can work without
having to change the embedded value by removing the "$schema"
keyword.

Additional best practices for embedding will follow in a separate
commit.
@Relequestual
Copy link
Member

I'll plan to review changes this week.

@Relequestual
Copy link
Member

Relequestual commented Sep 3, 2019

My suggestion of explicitly suggesting schema authors avoid non-canonical URIs still stands.
Edit, I worked out how to add suggestions as part of doing a review!

handrews and others added 2 commits September 4, 2019 20:42
We use "interoperability" for this throughout the spec.
Copy link
Member

@Relequestual Relequestual left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bop
Looks like it's good to go! 🚀

@awwright
Copy link
Member

awwright commented Sep 10, 2019

Our naming of "$anchor" here is a little bit unfortunate, in the rest of the Web it's backwards: "id" is a plain label (like a CNAME) and "anchor" is a URI.

(edit) Except apparently in RFC XML.

@awwright
Copy link
Member

By and large this seems like it'll be more accessible to Web developers than it was before; but the biggest hurdle I have is figuring out the significance of a "canonical URI".

Why not define a rel=self link, instead of a rel=canonical relationship?

@handrews
Copy link
Contributor Author

@awwright regarding our $anchor vs HTML's id (and I guess XML's? I've never looked): Yeah, that's unfortunate, and I noticed that. I'm not sure what to do about it. This part of our $id was never as widely used so it feels safer to change.

Also, if we tried to change "$id": "#foo" to "$id": "foo" (note the change from URI-reference to plain label) while changing "$id": "https://example.com/bar" to "$somethingElse": "https://example.com/bar" that would seem to be even more confusing, as no current use of $id would be left unchanged.

I'm open to suggestion, including leaving $id the way this PR leaves it but using a different name for $anchor, but I couldn't come up with anything better than this. At least $anchor is reasonably accurate in the sense that in older versions of HTML the idiom <a name="bar"> was referred to as an anchor. But I did not want to use $name as that seemed bad for some other reason that escapes me at the moment.

Would $name be better? And if so, should $recursiveAnchor be changed as well?

@handrews
Copy link
Contributor Author

@awwright

By and large this seems like it'll be more accessible to Web developers than it was before; but the biggest hurdle I have is figuring out the significance of a "canonical URI".

Why not define a rel=self link, instead of a rel=canonical relationship?

Can you elaborate on what that would look like? I found the "canonical" language helpful as the specific issue is understanding that, out of multiple possibly usable URIs, this is the one that should be used and the only one that can really be relied on.

Do you think that discussing this as a "self" relation instead is more accurate? And (at least as important), do you think that more people will find discussion in terms of "self" intuitive?

@awwright
Copy link
Member

I'll sit down later and compare the two link relations, but off the top of my head, rel=self is more accurate.

My understanding of rel=canonical is more like "these are technically different resources, but the preferred way to access it is via that one over there". Mostly for search engine optimization, so you can tell that two different pages actually contain the same blog article.

@handrews
Copy link
Contributor Author

@awwright thanks, I'll also take a look at the wording and see if using "self" makes sense to me. I do follow your reasoning here.

@handrews
Copy link
Contributor Author

To follow up on a discussion with @awwright on slack, this was resolved in favor of keeping "canonical" through the following reasoning:

After a discussion of how "self" and "canonical" might differ (a resource accessed through URI X might give X as its "self" but Y as its "canonical", which seems to be a thing that happens with web pages at least based on SEO guidance), @awwright asked:

Do we want to convey a “most correct URI”, if any of the URIs can suffice?

To which my response was yes, that was ultimate the decision from issue #726: "$id": Eliminate base URI shadowing

@awwright I'm taking your "I suppose that makes sense" from yesterday as approval since you didn't follow up with any further question or response. If that is an error, please let me know and I can make a new PR to address any remaining concerns.

@handrews handrews merged commit 8f47368 into json-schema-org:master Sep 14, 2019
@handrews
Copy link
Contributor Author

Also, while I agree that our naming of $id and $anchor is somewhat unfortunate, that's come up before and no one has had a better idea then or now. I wish we could figure something out, but we do want to preserve the more common uses of $id.

@handrews handrews deleted the canon branch April 26, 2020 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make $id conform to RFC 3986 suggestion for base URI elements "$id": Eliminate base URI shadowing
7 participants