Skip to content

Commit 9932a22

Browse files
authored
bpo-33416: Add end positions to Python AST (GH-11605)
The majority of this PR is tediously passing `end_lineno` and `end_col_offset` everywhere. Here are non-trivial points: * It is not possible to reconstruct end positions in AST "on the fly", some information is lost after an AST node is constructed, so we need two more attributes for every AST node `end_lineno` and `end_col_offset`. * I add end position information to both CST and AST. Although it may be technically possible to avoid adding end positions to CST, the code becomes more cumbersome and less efficient. * Since the end position is not known for non-leaf CST nodes while the next token is added, this requires a bit of extra care (see `_PyNode_FinalizeEndPos`). Unless I made some mistake, the algorithm should be linear. * For statements, I "trim" the end position of suites to not include the terminal newlines and dedent (this seems to be what people would expect), for example in ```python class C: pass pass ``` the end line and end column for the class definition is (2, 8). * For `end_col_offset` I use the common Python convention for indexing, for example for `pass` the `end_col_offset` is 4 (not 3), so that `[0:4]` gives one the source code that corresponds to the node. * I added a helper function `ast.get_source_segment()`, to get source text segment corresponding to a given AST node. It is also useful for testing. An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.
1 parent 7a23680 commit 9932a22

19 files changed

+1406
-395
lines changed

Doc/library/ast.rst

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,21 @@ Node classes
6161

6262
.. attribute:: lineno
6363
col_offset
64+
end_lineno
65+
end_col_offset
6466

6567
Instances of :class:`ast.expr` and :class:`ast.stmt` subclasses have
66-
:attr:`lineno` and :attr:`col_offset` attributes. The :attr:`lineno` is
67-
the line number of source text (1-indexed so the first line is line 1) and
68-
the :attr:`col_offset` is the UTF-8 byte offset of the first token that
69-
generated the node. The UTF-8 offset is recorded because the parser uses
70-
UTF-8 internally.
68+
:attr:`lineno`, :attr:`col_offset`, :attr:`lineno`, and :attr:`col_offset`
69+
attributes. The :attr:`lineno` and :attr:`end_lineno` are the first and
70+
last line numbers of source text span (1-indexed so the first line is line 1)
71+
and the :attr:`col_offset` and :attr:`end_col_offset` are the corresponding
72+
UTF-8 byte offsets of the first and last tokens that generated the node.
73+
The UTF-8 offset is recorded because the parser uses UTF-8 internally.
74+
75+
Note that the end positions are not required by the compiler and are
76+
therefore optional. The end offset is *after* the last symbol, for example
77+
one can get the source segment of a one-line expression node using
78+
``source_line[node.col_offset : node.end_col_offset]``.
7179

7280
The constructor of a class :class:`ast.T` parses its arguments as follows:
7381

@@ -162,6 +170,18 @@ and classes for traversing abstract syntax trees:
162170
:class:`AsyncFunctionDef` is now supported.
163171

164172

173+
.. function:: get_source_segment(source, node, *, padded=False)
174+
175+
Get source code segment of the *source* that generated *node*.
176+
If some location information (:attr:`lineno`, :attr:`end_lineno`,
177+
:attr:`col_offset`, or :attr:`end_col_offset`) is missing, return ``None``.
178+
179+
If *padded* is ``True``, the first line of a multi-line statement will
180+
be padded with spaces to match its original position.
181+
182+
.. versionadded:: 3.8
183+
184+
165185
.. function:: fix_missing_locations(node)
166186

167187
When you compile a node tree with :func:`compile`, the compiler expects
@@ -173,14 +193,16 @@ and classes for traversing abstract syntax trees:
173193

174194
.. function:: increment_lineno(node, n=1)
175195

176-
Increment the line number of each node in the tree starting at *node* by *n*.
177-
This is useful to "move code" to a different location in a file.
196+
Increment the line number and end line number of each node in the tree
197+
starting at *node* by *n*. This is useful to "move code" to a different
198+
location in a file.
178199

179200

180201
.. function:: copy_location(new_node, old_node)
181202

182-
Copy source location (:attr:`lineno` and :attr:`col_offset`) from *old_node*
183-
to *new_node* if possible, and return *new_node*.
203+
Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`,
204+
and :attr:`end_col_offset`) from *old_node* to *new_node* if possible,
205+
and return *new_node*.
184206

185207

186208
.. function:: iter_fields(node)

Include/Python-ast.h

Lines changed: 160 additions & 114 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Include/node.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,14 @@ typedef struct _node {
1414
int n_col_offset;
1515
int n_nchildren;
1616
struct _node *n_child;
17+
int n_end_lineno;
18+
int n_end_col_offset;
1719
} node;
1820

1921
PyAPI_FUNC(node *) PyNode_New(int type);
2022
PyAPI_FUNC(int) PyNode_AddChild(node *n, int type,
21-
char *str, int lineno, int col_offset);
23+
char *str, int lineno, int col_offset,
24+
int end_lineno, int end_col_offset);
2225
PyAPI_FUNC(void) PyNode_Free(node *n);
2326
#ifndef Py_LIMITED_API
2427
PyAPI_FUNC(Py_ssize_t) _PyNode_SizeOf(node *n);
@@ -37,6 +40,7 @@ PyAPI_FUNC(Py_ssize_t) _PyNode_SizeOf(node *n);
3740
#define REQ(n, type) assert(TYPE(n) == (type))
3841

3942
PyAPI_FUNC(void) PyNode_ListTree(node *);
43+
void _PyNode_FinalizeEndPos(node *n); // helper also used in parsetok.c
4044

4145
#ifdef __cplusplus
4246
}

Lib/ast.py

Lines changed: 92 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -115,10 +115,10 @@ def _format(node):
115115

116116
def copy_location(new_node, old_node):
117117
"""
118-
Copy source location (`lineno` and `col_offset` attributes) from
119-
*old_node* to *new_node* if possible, and return *new_node*.
118+
Copy source location (`lineno`, `col_offset`, `end_lineno`, and `end_col_offset`
119+
attributes) from *old_node* to *new_node* if possible, and return *new_node*.
120120
"""
121-
for attr in 'lineno', 'col_offset':
121+
for attr in 'lineno', 'col_offset', 'end_lineno', 'end_col_offset':
122122
if attr in old_node._attributes and attr in new_node._attributes \
123123
and hasattr(old_node, attr):
124124
setattr(new_node, attr, getattr(old_node, attr))
@@ -133,31 +133,44 @@ def fix_missing_locations(node):
133133
recursively where not already set, by setting them to the values of the
134134
parent node. It works recursively starting at *node*.
135135
"""
136-
def _fix(node, lineno, col_offset):
136+
def _fix(node, lineno, col_offset, end_lineno, end_col_offset):
137137
if 'lineno' in node._attributes:
138138
if not hasattr(node, 'lineno'):
139139
node.lineno = lineno
140140
else:
141141
lineno = node.lineno
142+
if 'end_lineno' in node._attributes:
143+
if not hasattr(node, 'end_lineno'):
144+
node.end_lineno = end_lineno
145+
else:
146+
end_lineno = node.end_lineno
142147
if 'col_offset' in node._attributes:
143148
if not hasattr(node, 'col_offset'):
144149
node.col_offset = col_offset
145150
else:
146151
col_offset = node.col_offset
152+
if 'end_col_offset' in node._attributes:
153+
if not hasattr(node, 'end_col_offset'):
154+
node.end_col_offset = end_col_offset
155+
else:
156+
end_col_offset = node.end_col_offset
147157
for child in iter_child_nodes(node):
148-
_fix(child, lineno, col_offset)
149-
_fix(node, 1, 0)
158+
_fix(child, lineno, col_offset, end_lineno, end_col_offset)
159+
_fix(node, 1, 0, 1, 0)
150160
return node
151161

152162

153163
def increment_lineno(node, n=1):
154164
"""
155-
Increment the line number of each node in the tree starting at *node* by *n*.
156-
This is useful to "move code" to a different location in a file.
165+
Increment the line number and end line number of each node in the tree
166+
starting at *node* by *n*. This is useful to "move code" to a different
167+
location in a file.
157168
"""
158169
for child in walk(node):
159170
if 'lineno' in child._attributes:
160171
child.lineno = getattr(child, 'lineno', 0) + n
172+
if 'end_lineno' in child._attributes:
173+
child.end_lineno = getattr(child, 'end_lineno', 0) + n
161174
return node
162175

163176

@@ -213,6 +226,77 @@ def get_docstring(node, clean=True):
213226
return text
214227

215228

229+
def _splitlines_no_ff(source):
230+
"""Split a string into lines ignoring form feed and other chars.
231+
232+
This mimics how the Python parser splits source code.
233+
"""
234+
idx = 0
235+
lines = []
236+
next_line = ''
237+
while idx < len(source):
238+
c = source[idx]
239+
next_line += c
240+
idx += 1
241+
# Keep \r\n together
242+
if c == '\r' and idx < len(source) and source[idx] == '\n':
243+
next_line += '\n'
244+
idx += 1
245+
if c in '\r\n':
246+
lines.append(next_line)
247+
next_line = ''
248+
249+
if next_line:
250+
lines.append(next_line)
251+
return lines
252+
253+
254+
def _pad_whitespace(source):
255+
"""Replace all chars except '\f\t' in a line with spaces."""
256+
result = ''
257+
for c in source:
258+
if c in '\f\t':
259+
result += c
260+
else:
261+
result += ' '
262+
return result
263+
264+
265+
def get_source_segment(source, node, *, padded=False):
266+
"""Get source code segment of the *source* that generated *node*.
267+
268+
If some location information (`lineno`, `end_lineno`, `col_offset`,
269+
or `end_col_offset`) is missing, return None.
270+
271+
If *padded* is `True`, the first line of a multi-line statement will
272+
be padded with spaces to match its original position.
273+
"""
274+
try:
275+
lineno = node.lineno - 1
276+
end_lineno = node.end_lineno - 1
277+
col_offset = node.col_offset
278+
end_col_offset = node.end_col_offset
279+
except AttributeError:
280+
return None
281+
282+
lines = _splitlines_no_ff(source)
283+
if end_lineno == lineno:
284+
return lines[lineno].encode()[col_offset:end_col_offset].decode()
285+
286+
if padded:
287+
padding = _pad_whitespace(lines[lineno].encode()[:col_offset].decode())
288+
else:
289+
padding = ''
290+
291+
first = padding + lines[lineno].encode()[col_offset:].decode()
292+
last = lines[end_lineno].encode()[:end_col_offset].decode()
293+
lines = lines[lineno+1:end_lineno]
294+
295+
lines.insert(0, first)
296+
lines.append(last)
297+
return ''.join(lines)
298+
299+
216300
def walk(node):
217301
"""
218302
Recursively yield all descendant nodes in the tree starting at *node*

Lib/test/test_asdl_parser.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,14 +62,16 @@ def test_product(self):
6262

6363
def test_attributes(self):
6464
stmt = self.types['stmt']
65-
self.assertEqual(len(stmt.attributes), 2)
65+
self.assertEqual(len(stmt.attributes), 4)
6666
self.assertEqual(str(stmt.attributes[0]), 'Field(int, lineno)')
6767
self.assertEqual(str(stmt.attributes[1]), 'Field(int, col_offset)')
68+
self.assertEqual(str(stmt.attributes[2]), 'Field(int, end_lineno, opt=True)')
69+
self.assertEqual(str(stmt.attributes[3]), 'Field(int, end_col_offset, opt=True)')
6870

6971
def test_constructor_fields(self):
7072
ehandler = self.types['excepthandler']
7173
self.assertEqual(len(ehandler.types), 1)
72-
self.assertEqual(len(ehandler.attributes), 2)
74+
self.assertEqual(len(ehandler.attributes), 4)
7375

7476
cons = ehandler.types[0]
7577
self.assertIsInstance(cons, self.asdl.Constructor)

0 commit comments

Comments
 (0)