Skip to content

Conversation

khwilliamson
Copy link
Contributor

This set of commits targets S_scan_ident in toke.c.

I've added a bunch of comments, and done minor refactoring and simplification, pushing all the real parsing into parse_ident(), which this calls in several places. That makes it easier to maintain going forward.

I've noticed for some time what appeared to be a bug, and it turned out I was right; fixed with a test. It could not handle a UTF-8 identifier with the second character not being one that is suitable for being a first character.

The second bug is more subtle, also noticed from code reading. S_intuit_more() calls S_scan_ident() just to see if the string it is looking at might possibly be an identifier. The problem is that scan_ident didn't think anyone would call it without actually expecting it to take various actions. I've added a check-only mode, so that you can call scan_ident without side effects. I do not know what the potential harms were from those side effects.

  • This set of changes does not require a perldelta entry.

The soft hypen is treated specially in toke.c
This is in preparation for passing other options to this function
This function is complicated, without enough documentation for me to
understand the subtleties; I only studied it enough to change things I
needed to, or which became obvious to me in the process.  Other things
remain undocumented by this commit.

Some of the white space gives improper indentation which will fit a
future commit.

This commit also remove redundant parentheses in one statement
That this had to be true was not obvious to me without studying closely
the code before it.  Adding an assertion will result in others deciding
they don't have to figure it out.
It's clearer to handle the short case first, and put the much longer
case afterwards.
It's clearer to handle the short case first, and put the much longer
case afterwards.
These were declared far above, due to C89 that is no longer a
constraint.
This check that the code just below won't look beyond the end of the
buffer, is rendered redundant by the "_safe" macro which does the
check itself.
By setting a variable in advance, we can merge two loops into one.
Save the value from the first time into a variable
I don't know what I was thinking when I recently thought these needed to
be in a different order.  The conjuctions are all &&, so might as well
do the simpler things first
S_scan_ident would like to call this function, already having looked at
the first character of an identifier, and deciding it is legal.  It
wants this function to finish the scan.

This commit adds a flag to S_parse_ident to accommodate this.
This fixes a bug in this function, in which it required the second
character in an identifier to be IDStart, instead of IDCont.  This
hasn't been caught because most identifiers are ASCII, and generally for
the purposes of this function in the ASCII range, all \w characters can
be IDStart.
There is a bug here in which this function is called from S_intuit_more
just to see if there is an identifier in the string it is looking at.
But that call can have "subtle implications on parsing" (according to
the long-standing comments in it).  We need a way to call scan_ident
without side-effects.

This commit adds that capability.  The next will use it.
This fixes the bug that examining the parse buffer had side-effects.  I
don't know what the implications of that were.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant