IDLE: define word/id chars in one place. #89855

terryjreedy · 2021-11-02T16:45:22Z

BPO	45692
Nosy	@terryjreedy, @serhiy-storchaka, @AlexWaygood
PRs	gh-89855: Improve support of non-ASCII identifiers in IDLE #29381

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/terryjreedy'
closed_at = None
created_at = <Date 2021-11-02.16:45:21.519>
labels = ['expert-IDLE', 'type-feature', '3.11']
title = 'IDLE: define word/id chars in one place.'
updated_at = <Date 2021-11-03.11:41:19.491>
user = 'https://github.com/terryjreedy'

bugs.python.org fields:

activity = <Date 2021-11-03.11:41:19.491>
actor = 'AlexWaygood'
assignee = 'terry.reedy'
closed = False
closed_date = None
closer = None
components = ['IDLE']
creation = <Date 2021-11-02.16:45:21.519>
creator = 'terry.reedy'
dependencies = []
files = []
hgrepos = []
issue_num = 45692
keywords = ['patch']
message_count = 12.0
messages = ['405516', '405519', '405522', '405526', '405532', '405536', '405538', '405541', '405543', '405546', '405548', '405608']
nosy_count = 3.0
nosy_names = ['terry.reedy', 'serhiy.storchaka', 'AlexWaygood']
pr_nums = ['29381']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue45692'
versions = ['Python 3.11']

terryjreedy · 2021-11-02T16:45:21Z

IDLE currently defines the same set of chars in 5 places with 5 names. (Listed by Serhiy Storchaka in bpo-45669.)
Lib/idlelib/autoexpand.py:20: wordchars = string.ascii_letters + string.digits + "_"
Lib/idlelib/undo.py:254: alphanumeric = string.ascii_letters + string.digits + "_"
Lib/idlelib/editor.py:809: IDENTCHARS = string.ascii_letters + string.digits + "_"
Lib/idlelib/hyperparser.py:13:_ASCII_ID_CHARS = frozenset(string.ascii_letters + string.digits + "_")
Lib/idlelib/autocomplete.py:33:ID_CHARS = string.ascii_letters + string.digits + "_"

I suspect that either a string or frozenset would work everywhere (check). I will pick a name after checking this. The single definition would go in the proposed utils.py, which is part of another issue and PR.

(Note: the utility tk index functions should also go there.)

serhiy-storchaka · 2021-11-02T16:59:17Z

This set is mostly outdated. In Python 2 it was a set of characters composing identifiers, but in Python 3 identifiers can contain non-ASCII characters.

serhiy-storchaka · 2021-11-02T17:08:14Z

Complete sets of characters which can be used in identifiers are too large:

>>> allchars = ''.join(map(chr, range(0x110000)))
>>> identstartchars = ''.join(c for c in allchars if c.isidentifier())
>>> identcontchars = ''.join(c for c in allchars if ('a' + c).isidentifier())
>>> len(identstartchars), len(identcontchars)
(131975, 135053)

serhiy-storchaka · 2021-11-02T17:32:33Z

This in an interesting problem. I am going to work on it at the next weekends.

terryjreedy · 2021-11-02T18:49:14Z

I checked for other possible ascii only problems and only found

config_key.py: 14: ALPHANUM_KEYS = tuple(string.ascii_lowercase + string.digits)
config_key.py: 39: if 'Shift' in modifiers and key in string.ascii_lowercase:
config_key.py: 15: PUNCTUATION_KEYS = tuple('~!@#%^&*()_-+={}[]|;:,.<>/?')
config_key.py: 20: AVAILABLE_KEYS = (ALPHANUM_KEYS + PUNCTUATION_KEYS + FUNCTION_KEYS +

terryjreedy · 2021-11-02T19:20:08Z

There have been occasional discussions about IDLE not being properly unicode aware in some of its functions. Discussions have foundered on these facts and no fix made.

The direct replacement string, your 'identcontchars', seems too big. We have always assumed that O(n) linear scans would be too slow.
A frozen set should give O(1) lookup, like fast enough, but would be even bigger.
The string methods operate on and scan through multiple chars, whereas IDLE wants to test 1 char at a time.
Even if the O(n*n) behavior of multiple calls is acceptible, there is no function for unicode continuation chars. s.idchars requires that the first character be a start char, which is to say, not a digit. s.alnum is false for '_'. (Otherwise, it would work.)

I would like to better this time. Possible responses to the blockers:

Correct; reject.
Maybe adding an elephant is better than keeping multiple IDLE features disabled for non-ascii users. How big?

>>> import sys
>>> fz = frozenset(c for c in map(chr, range(0x110000)) if ('a'+c).isidentifier)
>>> sys.getsizeof(fz)
33554648

Whoops, each 2 or 4 byte slice of the underlying array becomes 76 bytes + 8 bytes * size of hash array. Not practical either.

For at least some of the uses, the repeated calls may be fast enough.
We can synthesize s.isidcontinue with "c.isalnum() or c == '_'". "c.isidentifier() or c.isdigit()" would also work but should be slower.

Any other ideas? I will look at the use cases next.

terryjreedy · 2021-11-02T19:32:19Z

autoexpand.py, line 20, wordchars

'wordchars' is correct here since words beginning with digits can be expanded.

>>> s = '0x4f334'
>>> 0x4f334  # Hit alt-/ after 0 and enter
324404

Used in line 89 in method getprevword

        while i > 0 and line[i-1] in self.wordchars:
            i = i-1

Proposed replacement seems to work.

>>> i = len(s)
... while i > 0 and (c := s[i-1] or c == '_'):
...     i -= 1
... 
>>> i,c
(0, '0')

terryjreedy · 2021-11-02T20:01:09Z

autocomplete.py, line 33, ID_CHARS is only used on line 137 to find the prefix of an identifier when completions have been explicitly requested.

            while i and (curline[i-1] in ID_CHARS or ord(curline[i-1]) > 127):
                i -= 1
            comp_start = curline[i:j]

The completion is for a name or attribute depending on whether the preceding char, if any, is '.'. Here, the unicode fix was to accept all non-ascii as possible id chars. There is no harm as the completion box only has valid completions, and if the prefix given does not match anything, nothing is highlighted.

ID_CHARS could be moved to utils if the same ascii string is used in another module.

terryjreedy · 2021-11-02T20:23:06Z

undo.py, line 254, alphanumeric

Used in immediately following lines to classify chars as 'alphanumeric', 'newline', or 'punctuation' (the default). I believe I have only ever looked at this module to add the test code at the bottom. In any case, I don't know the effect of calling non-ascii chars punctuation, but suspect it is not the best thing. So I suspect that the autoexpand fix would be the best.

The classify method is only used on line 248 in the merge method above.
self.classify(self.chars[-1]) != self.classify(cmd.chars)
merge() is only used on l.124 in addcmd.

To figure out more, I would experiment identifiers without and with non-ascii and undo and redo and see what difference there is.

terryjreedy · 2021-11-02T20:38:02Z

editor.py, line 809, IDENTCHARS

Used in the immediately following def colorize_syntax_error on line 814.
if char and char in self.IDENTCHARS:
text.tag_add("ERROR", pos + " wordstart", pos)
I believe the intent is to color the part of an identifier that precedes the character marked. At the moment, I cannot think of how to trigger such a situation. I would have to add some console prints to investigate. Maybe this should get the autoexpand fix, but maybe it is dead code.

terryjreedy · 2021-11-02T20:48:05Z

hyperparser.py, line 13, _ASCII_ID_CHARS
line 15, _ASCII_ID_FIRST_CHARS, frozenset(string.ascii_letters + "_")

Only used on line 18 and 21 to create 128 item lookup tables. The point is to be fast as hyperparser scans multiple chars when invoked. The expandword fix and c.isidentifier() could replace the lookup. But would they bog down response time? We need to look at hyperparser use cases and do some testing.

AlexWaygood · 2021-11-03T11:41:19Z

The PR that proposes creating a new utility.py file is mine, linked to #89610. Would it make things easier if I split it into two PRs: one adding an empty util.py file, and the other making my proposed changes to support syntax highlighting for .pyi files?

EDIT by tjr: issue closed and util.py added.

terryjreedy added the 3.11 only security fixes label Nov 2, 2021

terryjreedy self-assigned this Nov 2, 2021

terryjreedy added topic-IDLE type-feature A feature request or enhancement 3.11 only security fixes labels Nov 2, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

terryjreedy added this to IDLE Issues Jul 23, 2022

erlend-aasland moved this to In Progress in IDLE Issues Jul 27, 2022

bedevere-bot mentioned this issue Sep 18, 2022

gh-89855: Improve support of non-ASCII identifiers in IDLE #29381

Draft

python deleted a comment from RawanFahad3 May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDLE: define word/id chars in one place. #89855

IDLE: define word/id chars in one place. #89855

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

serhiy-storchaka commented Nov 2, 2021

serhiy-storchaka commented Nov 2, 2021

serhiy-storchaka commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

AlexWaygood commented Nov 3, 2021 •

edited by terryjreedy

Loading

IDLE: define word/id chars in one place. #89855

IDLE: define word/id chars in one place. #89855

Comments

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

serhiy-storchaka commented Nov 2, 2021

serhiy-storchaka commented Nov 2, 2021

serhiy-storchaka commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

terryjreedy commented Nov 2, 2021

AlexWaygood commented Nov 3, 2021 • edited by terryjreedy Loading

AlexWaygood commented Nov 3, 2021 •

edited by terryjreedy

Loading