Skip to content

Add digit-separators specification v1.0 #3846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 12, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 199 additions & 0 deletions working/digit-separators/feature-specification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Digit Separators

Author: Lasse Nielsen, Sam Rawlins

Status: In-progress

Version 1.0

## Motivation

To make long number literals more readable, allow authors to inject [digit
group separators][] inside numbers. Examples with different possible separators:

```none
100 000 000 000 000 000 000 // space
100,000,000,000,000,000,000 // comma
100.000.000.000.000.000.000 // period
100'000'000'000'000'000'000 // apostrophe (C++)
100_000_000_000_000_000_000 // underscore (many programming languages).
```

## Proposal

### Digit separators in number literals

Allow one or more `_`s between any two otherwise adjacent _digits_ of a NUMBER
or HEX\_NUMBER token. The following are not digits: The leading `0x` or `0X` in
HEX\_NUMBER, and any `.`, `e`, `E`, `+` or `-` in NUMBER.

That means only allowing `_`s between two `0-9` digits in NUMBER and between
two `0-9`,`a-f`,`A-F` digits in HEX\_NUMBER.

The grammar would be changing `<DIGIT>+` to `<DIGITS>` which is then `<DIGIT>`s
with optional `_`s between, and same for hex digits:

```bnf
<NUMBER> ::= <DIGITS> (`.' <DIGITS>)? <EXPONENT>?
\alt `.' <DIGITS> <EXPONENT>?

<EXPONENT> ::= (`e' | `E') (`+' | `-')? <DIGITS>

<DIGITS> ::= <DIGIT> (`_'* <DIGIT>)*

<HEX\_NUMBER> ::= `0x' <HEX\_DIGITS>
\alt `0X' <HEX\_DIGITS>

<HEX\_DIGIT> ::= `a' .. `f'
\alt `A' .. `F'
\alt <DIGIT>

<HEX\_DIGITS> ::= <HEX\_DIGIT> (`_'* <HEX\_DIGIT>)*
```

### Examples

```none
100__000_000__000_000__000_000 // one hundred million million millions!
0x4000_0000_0000_0000
0.000_000_000_01
0x00_14_22_01_23_45 // MAC address
555_123_4567 // US Phone number
```

**Invalid** literals:

```none
100_
0x_00_14_22_01_23_45
0._000_000_000_1
100_.1
1.2e_3
```

An identifier like `_100` is a valid identifier, and `_100._100` is a valid
member access. If users learn the "separator only between digits" rule quickly,
this will likely not be an issue.

### Why choose underscores

The syntax must work even with just a single separator, so it can't be anything
that can already validly seperate two expressions (excludes all infix operators
and comma) and should already be part of a number literal (excludes decimal
point).

So, the comma and decimal point are probably never going to work, even if they
are already the standard "thousands separator" in text in different parts of
the world.

Space separation is dangerous because it's hard to see whether it's just space,
or it's an accidental tab character. If we allow spacing, should we allow
arbitrary whitespace, including line terminators? If so, then this suddenly
become quite dangerous. Forget a comma at the end of a line in a multiline
list, and two adjacent integers are automatically combined (we already have
that problem with strings). So, probably not a good choice, even if it is the
preferred formatting for print text.

The apostrope is also the string single-quote character. We don't currently
allow adjacent numbers and strings, but if we ever do, then this syntax becomes
ambiguous. It's still possible (we disambiguate by assuming it's a digit
separator). It is currently used by C++ 14 as a digit group separator, so it is
definitely possible.

That leaves underscore, which could be the start of an identifier. Currently
`100_000` would be tokenized as "integer literal 100" followed by "identifier
`_000`". However, users would never write an identifier adjacent to another
token that contains identifier-valid characters (unlike strings, which have
clear delimiters that do not occur anywher else), so this is unlikely to happen
in practice. Underscore is already used by a large number of programming
languages including Java, Swift, and Python.

We also want to allow multiple separators for higher-level grouping, e.g.,:

```none
100__000_000_000__000_000_000
```

For this purpose, the underscore extends gracefully. So does space, but has the
disadvantage that it collapses when inserted into HTML, whereas `''` looks odd.

### Related work

* [Java digit separators](https://docs.oracle.com/javase/8/docs/technotes/guides/language/underscores-literals.html)
* [Python PEP 515 - underscores in numeric literals](https://peps.python.org/pep-0515/)

### Possible new lint rules

There are some possible new lint rule considerations, but none of these are
considered vital to the usability or general success of the feature.

The feature is designed to help the readability of long numbers. But a
developer can still make a mistake about where to place separators. For example:

```
var one = 1_000_000;
var two = 2_000_000;
var three = 3_000_000;
var four = 4_0000_000; // Whoops!
```

If a developer uses the Dart formatter to format their code, they cannot try to
vertically align the numbers with whitespace (extra space characters are
removed by the formatter). So we could offer a lint rule to only place
separators every three digits of a decimal number. Also possibly a similar rule
for hexadecimal numbers. If a developer ever uses digit separators for a
different purpose (as in separating the digits of a phone number), the rule may
not prove useful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another lint option could be "consistent digit separators", which triggers if the digit groups do not have the same size (except the most significant I one, which can be shorter).
If there are any __ separators, the number of _ separated groups between them should also be the same, and repeatedly for higher numbers of _s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, added!


A separate lint rule could encourage _consistent_ digit separators, which
triggers if the digit groups do not have the same size (except the most
significant one, which can be shorter). If there are any `__` separators, the
number of `_`-separated groups between them should also be the same, and
repeatedly for higher numbers of `_`s.

### Possible new quick fixes

There are some possible new automated fix ("quick fix") considerations, but
none of these are considered vital to the usability or general success of the
feature.

#### Unexpected underscores

With the digit-separators feature, separators can be added between _digits_ of
a number literal, but nowhere else. In most error cases, the unexpected
underscore can be detected as such, and we can offer quick fixes to remove
unexpected errors (for example, `100_`, `100_e1.2`, `100._00`). In a few cases,
the intention is not as straightforward, such as `100._100`, where `_100` can
be a legal name of an extension member (though the presense of such a private
extension member can be detected).

#### Unexpected commas

The only legal digit separator that is introduced with this feature is the
underscore character. If a developer attempts to use another character, for
example commas, as a separator, we may be able to detect this, and offer a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this fall into the same problem that we have when we consider , as a separating digit? How can we be sure that the user intended to use , as a separating digit instead of two expressions separated by a comma?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't. Many automated fixes are offered with the understanding that they are not the only fix, and may not be the desired fix.

I personally doubt we will offer this fix. But there might be some cases, like if the user pastes 100 lines of:

var x1 = 1,000,000;
var x2 = 2,000,000;
var x3 = 3,000,000;

into their editor, from another source text, we could offer a fix to convert the commas, because it looks more likely that they meant to write x1 = 1_000_000, since var x1 = 1, cannot be followed by an int literal (it can only be followed by an identifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same person pasted 100 lines of var x = foo(1,000,000); we'd be having a harder time helping them.
(Not hat we couldn't suggest 1_000_000, especially if we can see that foo takes only one argument, but it was not grammatically wrong to begin with.)

One thing that worries me is whether we can do the same fixes after the formatter has run.
If not, we may have a problem helping people who auto-format often.
If the code doesn't parse, it likely won't format either, but if it parses, and you get to var x1 = foo(1, 000, 000);, it would be sad if it's too late to help the user.

quick fix to convert the commas to underscores.

### Non-breaking change

This change is strictly non-breaking. The feature can be thought of as a single
change from previous Dart syntax: some syntax which was previously illegal
(producing compile-time errors) becomes legal.

(The feature is still introduced with a [Dart language version][], so that
packages that start using the feature declare that they require some new lower
bound of the Dart SDK.)

### Formatting

As any number literal remains a single token, there are no formatting
considerations.

## Changelog

### 1.0

- Initial version

[digit group separators]: https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping
[Dart language version]: https://github.com/dart-lang/language/blob/main/accepted/2.8/language-versioning/feature-specification.md
Loading