Skip to content

bpo-33416: Add end positions to Python AST #11605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jan 22, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ba4ba82
Some initial infra
ilevkivskyi Jan 6, 2019
3e343e3
Regenerate nodes
ilevkivskyi Jan 6, 2019
1684c17
Mindless implementation: known bugs, notably in fstrings
ilevkivskyi Jan 6, 2019
514d4ea
Some test fixes
ilevkivskyi Jan 7, 2019
3ab2516
More test fixes
ilevkivskyi Jan 7, 2019
1d3e352
Add a TODO
ilevkivskyi Jan 8, 2019
dbf9cc9
Switch to better algorithm for finding end position
ilevkivskyi Jan 13, 2019
a44207b
Merge remote-tracking branch 'upstream/master' into add-end-line-col
ilevkivskyi Jan 13, 2019
5af33da
Be consistent for line_num
ilevkivskyi Jan 13, 2019
ce7f5ce
Minor fixes; start adding tests
ilevkivskyi Jan 13, 2019
2171eb9
Update two failing tests
ilevkivskyi Jan 13, 2019
10cf4bd
Fix multiline strings
ilevkivskyi Jan 14, 2019
ed05305
Fix end position for if statement
ilevkivskyi Jan 15, 2019
58fbfa6
Adjust end positions in while and for
ilevkivskyi Jan 15, 2019
f2589ff
Add also with
ilevkivskyi Jan 15, 2019
7d5ca5e
Fix try end position (concludes fixing suites)
ilevkivskyi Jan 15, 2019
aa62e3c
Some formatting plus minor fixes
ilevkivskyi Jan 15, 2019
96a0ec0
More formatting; fix import from
ilevkivskyi Jan 16, 2019
c169025
Fix f-strings
ilevkivskyi Jan 16, 2019
553a772
Add few more tests
ilevkivskyi Jan 16, 2019
5cc01e9
Add final bunch of tests
ilevkivskyi Jan 17, 2019
dce260e
Update docstrings
ilevkivskyi Jan 17, 2019
9ba6604
Update docs
ilevkivskyi Jan 17, 2019
69a6280
Add get_source_segment() helper
ilevkivskyi Jan 17, 2019
4af426f
Consistent formatting in docstring; use new helper in tests
ilevkivskyi Jan 17, 2019
e5a12c3
Fix bug
ilevkivskyi Jan 17, 2019
0275a93
Split few long lines
ilevkivskyi Jan 17, 2019
f20635b
Add tests and docs gor the helper
ilevkivskyi Jan 18, 2019
c9da8f5
Fix missing comma
ilevkivskyi Jan 18, 2019
f97f38a
📜🤖 Added by blurb_it.
blurb-it[bot] Jan 18, 2019
70cc16c
Fix .rst warning and a smelly symbol
ilevkivskyi Jan 18, 2019
48936b9
Fix get_source_segment
ilevkivskyi Jan 19, 2019
ac5b5cb
More CR
ilevkivskyi Jan 19, 2019
4726f17
rst fixes
ilevkivskyi Jan 19, 2019
eeea87d
Remove unused vars
ilevkivskyi Jan 19, 2019
027a4ca
📜🤖 Added by blurb_it.
blurb-it[bot] Jan 19, 2019
ff361f2
Remove old NEWS file
ilevkivskyi Jan 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 31 additions & 9 deletions Doc/library/ast.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,21 @@ Node classes

.. attribute:: lineno
col_offset
end_lineno
end_col_offset

Instances of :class:`ast.expr` and :class:`ast.stmt` subclasses have
:attr:`lineno` and :attr:`col_offset` attributes. The :attr:`lineno` is
the line number of source text (1-indexed so the first line is line 1) and
the :attr:`col_offset` is the UTF-8 byte offset of the first token that
generated the node. The UTF-8 offset is recorded because the parser uses
UTF-8 internally.
:attr:`lineno`, :attr:`col_offset`, :attr:`lineno`, and :attr:`col_offset`
attributes. The :attr:`lineno` and :attr:`end_lineno` are the first and
last line numbers of source text span (1-indexed so the first line is line 1)
and the :attr:`col_offset` and :attr:`end_col_offset` are the corresponding
UTF-8 byte offsets of the first and last tokens that generated the node.
The UTF-8 offset is recorded because the parser uses UTF-8 internally.

Note that the end positions are not required by the compiler and are
therefore optional. The end offset is *after* the last symbol, for example
one can get the source segment of a one-line expression node using
``source_line[node.col_offset : node.end_col_offset]``.

The constructor of a class :class:`ast.T` parses its arguments as follows:

Expand Down Expand Up @@ -162,6 +170,18 @@ and classes for traversing abstract syntax trees:
:class:`AsyncFunctionDef` is now supported.


.. function:: get_source_segment(source, node, *, padded=False)

Get source code segment of the *source* that generated *node*.
If some location information (:attr:`lineno`, :attr:`end_lineno`,
:attr:`col_offset`, or :attr:`end_col_offset`) is missing, return ``None``.

If *padded* is ``True``, the first line of a multi-line statement will
be padded with spaces to match its original position.

.. versionadded:: 3.8


.. function:: fix_missing_locations(node)

When you compile a node tree with :func:`compile`, the compiler expects
Expand All @@ -173,14 +193,16 @@ and classes for traversing abstract syntax trees:

.. function:: increment_lineno(node, n=1)

Increment the line number of each node in the tree starting at *node* by *n*.
This is useful to "move code" to a different location in a file.
Increment the line number and end line number of each node in the tree
starting at *node* by *n*. This is useful to "move code" to a different
location in a file.


.. function:: copy_location(new_node, old_node)

Copy source location (:attr:`lineno` and :attr:`col_offset`) from *old_node*
to *new_node* if possible, and return *new_node*.
Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`,
and :attr:`end_col_offset`) from *old_node* to *new_node* if possible,
and return *new_node*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only affect the node, or the whole tree? I've never used this so I don't know what to expect from context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one (unlike some others) is non-recursive, one can even copy from a node of a different kind.



.. function:: iter_fields(node)
Expand Down
274 changes: 160 additions & 114 deletions Include/Python-ast.h

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion Include/node.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,14 @@ typedef struct _node {
int n_col_offset;
int n_nchildren;
struct _node *n_child;
int n_end_lineno;
int n_end_col_offset;
} node;

PyAPI_FUNC(node *) PyNode_New(int type);
PyAPI_FUNC(int) PyNode_AddChild(node *n, int type,
char *str, int lineno, int col_offset);
char *str, int lineno, int col_offset,
int end_lineno, int end_col_offset);
PyAPI_FUNC(void) PyNode_Free(node *n);
#ifndef Py_LIMITED_API
PyAPI_FUNC(Py_ssize_t) _PyNode_SizeOf(node *n);
Expand All @@ -37,6 +40,7 @@ PyAPI_FUNC(Py_ssize_t) _PyNode_SizeOf(node *n);
#define REQ(n, type) assert(TYPE(n) == (type))

PyAPI_FUNC(void) PyNode_ListTree(node *);
void _PyNode_FinalizeEndPos(node *n); // helper also used in parsetok.c

#ifdef __cplusplus
}
Expand Down
100 changes: 92 additions & 8 deletions Lib/ast.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,10 +115,10 @@ def _format(node):

def copy_location(new_node, old_node):
"""
Copy source location (`lineno` and `col_offset` attributes) from
*old_node* to *new_node* if possible, and return *new_node*.
Copy source location (`lineno`, `col_offset`, `end_lineno`, and `end_col_offset`
attributes) from *old_node* to *new_node* if possible, and return *new_node*.
"""
for attr in 'lineno', 'col_offset':
for attr in 'lineno', 'col_offset', 'end_lineno', 'end_col_offset':
if attr in old_node._attributes and attr in new_node._attributes \
and hasattr(old_node, attr):
setattr(new_node, attr, getattr(old_node, attr))
Expand All @@ -133,31 +133,44 @@ def fix_missing_locations(node):
recursively where not already set, by setting them to the values of the
parent node. It works recursively starting at *node*.
"""
def _fix(node, lineno, col_offset):
def _fix(node, lineno, col_offset, end_lineno, end_col_offset):
if 'lineno' in node._attributes:
if not hasattr(node, 'lineno'):
node.lineno = lineno
else:
lineno = node.lineno
if 'end_lineno' in node._attributes:
if not hasattr(node, 'end_lineno'):
node.end_lineno = end_lineno
else:
end_lineno = node.end_lineno
if 'col_offset' in node._attributes:
if not hasattr(node, 'col_offset'):
node.col_offset = col_offset
else:
col_offset = node.col_offset
if 'end_col_offset' in node._attributes:
if not hasattr(node, 'end_col_offset'):
node.end_col_offset = end_col_offset
else:
end_col_offset = node.end_col_offset
for child in iter_child_nodes(node):
_fix(child, lineno, col_offset)
_fix(node, 1, 0)
_fix(child, lineno, col_offset, end_lineno, end_col_offset)
_fix(node, 1, 0, 1, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function looks a bit suspicious.Shouldn't it at least ensure that (end_lineno, end_col_offset) > (lineno, col_offset)? (Again, I guess I don't know the use case.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used to compile manually generated AST to bytecode. Although technically end positions are not needed for the compiler (only start positions are recorded in the byte code), it is probably good to have them fixed. The algorithm for fixing is quite naive (take either a position of nearest root-side node if known, otherwise use start of the file), but it never intended to be robust, it exist just to allow compilation in those case where a user doesn't care about line numbers.

return node


def increment_lineno(node, n=1):
"""
Increment the line number of each node in the tree starting at *node* by *n*.
This is useful to "move code" to a different location in a file.
Increment the line number and end line number of each node in the tree
starting at *node* by *n*. This is useful to "move code" to a different
location in a file.
"""
for child in walk(node):
if 'lineno' in child._attributes:
child.lineno = getattr(child, 'lineno', 0) + n
if 'end_lineno' in child._attributes:
child.end_lineno = getattr(child, 'end_lineno', 0) + n
return node


Expand Down Expand Up @@ -213,6 +226,77 @@ def get_docstring(node, clean=True):
return text


def _splitlines_no_ff(source):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use .splitlines(keepends=True) (which considers \f as an endline) and then re-join those that aren't ending in \r or \n, I am not sure which approach is better.

"""Split a string into lines ignoring form feed and other chars.

This mimics how the Python parser splits source code.
"""
idx = 0
lines = []
next_line = ''
while idx < len(source):
c = source[idx]
next_line += c
idx += 1
# Keep \r\n together
if c == '\r' and idx < len(source) and source[idx] == '\n':
next_line += '\n'
idx += 1
if c in '\r\n':
lines.append(next_line)
next_line = ''

if next_line:
lines.append(next_line)
return lines


def _pad_whitespace(source):
"""Replace all chars except '\f\t' in a line with spaces."""
result = ''
for c in source:
if c in '\f\t':
result += c
else:
result += ' '
return result


def get_source_segment(source, node, *, padded=False):
"""Get source code segment of the *source* that generated *node*.

If some location information (`lineno`, `end_lineno`, `col_offset`,
or `end_col_offset`) is missing, return None.

If *padded* is `True`, the first line of a multi-line statement will
be padded with spaces to match its original position.
"""
try:
lineno = node.lineno - 1
end_lineno = node.end_lineno - 1
col_offset = node.col_offset
end_col_offset = node.end_col_offset
except AttributeError:
return None

lines = _splitlines_no_ff(source)
if end_lineno == lineno:
return lines[lineno].encode()[col_offset:end_col_offset].decode()

if padded:
padding = _pad_whitespace(lines[lineno].encode()[:col_offset].decode())
else:
padding = ''

first = padding + lines[lineno].encode()[col_offset:].decode()
last = lines[end_lineno].encode()[:end_col_offset].decode()
lines = lines[lineno+1:end_lineno]

lines.insert(0, first)
lines.append(last)
return ''.join(lines)


def walk(node):
"""
Recursively yield all descendant nodes in the tree starting at *node*
Expand Down
6 changes: 4 additions & 2 deletions Lib/test/test_asdl_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,14 +62,16 @@ def test_product(self):

def test_attributes(self):
stmt = self.types['stmt']
self.assertEqual(len(stmt.attributes), 2)
self.assertEqual(len(stmt.attributes), 4)
self.assertEqual(str(stmt.attributes[0]), 'Field(int, lineno)')
self.assertEqual(str(stmt.attributes[1]), 'Field(int, col_offset)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for stmt.attributes[2] and stmt.attributes[3]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

self.assertEqual(str(stmt.attributes[2]), 'Field(int, end_lineno, opt=True)')
self.assertEqual(str(stmt.attributes[3]), 'Field(int, end_col_offset, opt=True)')

def test_constructor_fields(self):
ehandler = self.types['excepthandler']
self.assertEqual(len(ehandler.types), 1)
self.assertEqual(len(ehandler.attributes), 2)
self.assertEqual(len(ehandler.attributes), 4)

cons = ehandler.types[0]
self.assertIsInstance(cons, self.asdl.Constructor)
Expand Down
Loading