S_scan_ident: Comments, refactor some; fix bugs #23866

khwilliamson · 2025-10-20T00:43:07Z

This set of commits targets S_scan_ident in toke.c.

I've added a bunch of comments, and done minor refactoring and simplification, pushing all the real parsing into parse_ident(), which this calls in several places. That makes it easier to maintain going forward.

I've noticed for some time what appeared to be a bug, and it turned out I was right; fixed with a test. It could not handle a UTF-8 identifier with the second character not being one that is suitable for being a first character.

The second bug is more subtle, also noticed from code reading. S_intuit_more() calls S_scan_ident() just to see if the string it is looking at might possibly be an identifier. The problem is that scan_ident didn't think anyone would call it without actually expecting it to take various actions. I've added a check-only mode, so that you can call scan_ident without side effects. I do not know what the potential harms were from those side effects.

This set of changes does not require a perldelta entry.

The soft hypen is treated specially in toke.c

This is in preparation for passing other options to this function

This function is complicated, without enough documentation for me to understand the subtleties; I only studied it enough to change things I needed to, or which became obvious to me in the process. Other things remain undocumented by this commit. Some of the white space gives improper indentation which will fit a future commit. This commit also remove redundant parentheses in one statement

This makes things clearer.

That this had to be true was not obvious to me without studying closely the code before it. Adding an assertion will result in others deciding they don't have to figure it out.

It's clearer to handle the short case first, and put the much longer case afterwards.

These were declared far above, due to C89 that is no longer a constraint.

This check that the code just below won't look beyond the end of the buffer, is rendered redundant by the "_safe" macro which does the check itself.

By setting a variable in advance, we can merge two loops into one.

Save the value from the first time into a variable

I don't know what I was thinking when I recently thought these needed to be in a different order. The conjuctions are all &&, so might as well do the simpler things first

S_scan_ident would like to call this function, already having looked at the first character of an identifier, and deciding it is legal. It wants this function to finish the scan. This commit adds a flag to S_parse_ident to accommodate this.

This fixes a bug in this function, in which it required the second character in an identifier to be IDStart, instead of IDCont. This hasn't been caught because most identifiers are ASCII, and generally for the purposes of this function in the ASCII range, all \w characters can be IDStart.

There is a bug here in which this function is called from S_intuit_more just to see if there is an identifier in the string it is looking at. But that call can have "subtle implications on parsing" (according to the long-standing comments in it). We need a way to call scan_ident without side-effects. This commit adds that capability. The next will use it.

This fixes the bug that examining the parse buffer had side-effects. I don't know what the implications of that were.

khwilliamson added 17 commits October 19, 2025 18:30

regen/unicode_constants: Create one for SHY

472c3d7

The soft hypen is treated specially in toke.c

S_scan_ident: Convert to flags parameter

a8bb10b

This is in preparation for passing other options to this function

S_scan_ident: Use mnemonic for soft hyphen code point

00a64da

S_scan_ident: Add a mnemonic instead of using -1

4398474

This makes things clearer.

S_scan_ident: Add an assertion

62b8c79

That this had to be true was not obvious to me without studying closely the code before it. Adding an assertion will result in others deciding they don't have to figure it out.

S_scan_ident: Swap conditionals order

74c1c8e

It's clearer to handle the short case first, and put the much longer case afterwards.

S_scan_ident: Swap another set of conditionals order

6577cc4

It's clearer to handle the short case first, and put the much longer case afterwards.

S_scan_ident: Move declaractions close to first use

9a733c9

These were declared far above, due to C89 that is no longer a constraint.

S_scan_ident: Remove unnecessary complexity

6063f3d

This check that the code just below won't look beyond the end of the buffer, is rendered redundant by the "_safe" macro which does the check itself.

S_scan_ident: Collapse two loops

bca9430

By setting a variable in advance, we can merge two loops into one.

S_scan_ident: Avoid a recalculation

76a13ec

Save the value from the first time into a variable

S_parse_ident: Swap order of conditionals

a940e99

I don't know what I was thinking when I recently thought these needed to be in a different order. The conjuctions are all &&, so might as well do the simpler things first

S_intuit_more: Call scan_ident in check-only mode

7b4e630

This fixes the bug that examining the parse buffer had side-effects. I don't know what the implications of that were.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S_scan_ident: Comments, refactor some; fix bugs #23866

S_scan_ident: Comments, refactor some; fix bugs #23866

khwilliamson commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

S_scan_ident: Comments, refactor some; fix bugs #23866

Are you sure you want to change the base?

S_scan_ident: Comments, refactor some; fix bugs #23866

Conversation

khwilliamson commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant