Skip to content

Fix #6542: Pickle line sizes in TASTy #10363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 30, 2020

Conversation

nicolasstucki
Copy link
Contributor

@nicolasstucki nicolasstucki commented Nov 17, 2020

This description is outdated, see comment for second commit bellow.

Line Sizes section

This PR add a new section called LineSizes to the tasty format. This section starts with an Int containing the number of lines followed by an This section h sequence of Int with the size of each line. It requires around one byte for each line (2 bytes if line longer than 127 characters) and extra section info.

This information is then used to compute the line offsets in SourceFile to be able to translate Span offsets into line/column numbers.

TASTy file size increase

Numbers before removing the initial Int for line count

The jar of the standard library increased form 6,647,453 bytes to 6,893,064 bytes. This is roughly a 3.7% increase in file size.

For sources in scala.collection.immutable the non-compressed TASTy files increased cumulative sizes increased from 964,421 bytes to 1,100,330 bytes (filesystem wise in Mac). This is roughly a 14% increase in file size.

Here is a subset of byte sizes of some sections (not including headers and names)

Source AST Positions Line sizes Docs
List.tasty 7343 7671 735 4609
Vector.tsaty 5542 5192 2617 2158
Map.tasty 14939 13566 844 1143
SortedSet.tasty 509 389 64 263
Set.tasty 6657 6237 456 675
Range.tasty 9321 8830 798 4531
Iterable.tasty 413 328 45 205
ArraySeq.tasty 15879 14674 851 1917

@nicolasstucki nicolasstucki self-assigned this Nov 17, 2020
@nicolasstucki nicolasstucki force-pushed the add-line-sizes-in-TASTy branch 2 times, most recently from b256b21 to 6e47481 Compare November 18, 2020 10:06
@nicolasstucki nicolasstucki changed the title Pickle line sizes in TASTy Fix #6542: Pickle line sizes in TASTy Nov 18, 2020
@nicolasstucki nicolasstucki linked an issue Nov 18, 2020 that may be closed by this pull request
@nicolasstucki nicolasstucki marked this pull request as ready for review November 18, 2020 12:16
@anatoliykmetyuk anatoliykmetyuk added this to the 3.0.0-RC1 milestone Nov 18, 2020
@nicolasstucki
Copy link
Contributor Author

test performance please

@nicolasstucki nicolasstucki force-pushed the add-line-sizes-in-TASTy branch from 71a6c12 to 65afab0 Compare November 20, 2020 08:26
@nicolasstucki
Copy link
Contributor Author

test performance please

@smarter
Copy link
Member

smarter commented Nov 21, 2020

I don't know if it would be better or worse in term of size, but it might be worth trying to store the line number as part of the PositionPickler, using delta-coding like it's already doing for the start/end/span.

@nicolasstucki
Copy link
Contributor Author

The delta coding wouldn't save us anything as we are already close to one byte per line. We could save on the section name and size of we join them.

@nicolasstucki
Copy link
Contributor Author

nicolasstucki commented Nov 21, 2020

The second commit puts the line sizes directly in the Positions section. In this version, we pickle the number of lines followed by the line sizes at the start of the Positions section. It requires around one byte for each line (2 bytes if line longer than 127 characters) and usually needs 2 bytes for the number of lines (1 if the source file has 127 lines or less).

This simplifies considerably the code required to implement this addition.

The jar of the standard library increased from 6,647,453 bytes 6,851,555 bytes (instead of the previous 6,893,064 bytes). This is a 3.1% (instead of 3.7%) increase in jar file size.

@nicolasstucki nicolasstucki force-pushed the add-line-sizes-in-TASTy branch 3 times, most recently from 45a2479 to 5db4c1f Compare November 21, 2020 16:02
@nicolasstucki
Copy link
Contributor Author

test performance please

@dottybot
Copy link
Member

performance test scheduled: 3 job(s) in queue, 1 running.

@smarter
Copy link
Member

smarter commented Nov 23, 2020

The delta coding wouldn't save us anything as we are already close to one byte per line.

Since the lien number delta between two trees should be 0 or 1 in most cases, it's possible we could find a better encoding that uses less than one byte per line, but I'm not sure.

@sjrd
Copy link
Member

sjrd commented Nov 23, 2020

One thing to pay attention to, if not already done: the different kinds of newlines and how they interact with offsets. In particular, the Windows \r\n takes 2 bytes, so it's a single line jump but with a shift of 2 bytes/Chars in offsets.

@liufengyun
Copy link
Contributor

It would be good to add neg test for inlined trees from separately compiled files to make sure it points to the correct source file and line.

@dottybot
Copy link
Member

Performance test finished successfully:

Visit http://dotty-bench.epfl.ch/10363/ to see the changes.

Benchmarks is based on merging with master (bbbcfde)

@nicolasstucki nicolasstucki force-pushed the add-line-sizes-in-TASTy branch from 5db4c1f to 9c326f2 Compare November 24, 2020 16:51
@nicolasstucki nicolasstucki force-pushed the add-line-sizes-in-TASTy branch from 9c326f2 to c7106ef Compare November 24, 2020 16:53
@nicolasstucki
Copy link
Contributor Author

Added a regression test in project/scripts/cmdTests.

sourceFile.setLineIndices(lineSizesUnpickler.lineIndices)
posUnpicklerOpt match
case Some(posUnpickler) =>
sourceFile.setLineIndicesFromLineSizes(posUnpickler.lineSizes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential issue here: the relative path of a file is not a unique identifier of the file, there might be path conflicts in the Scala ecosystem.

If we also store the hash of source files, that would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you use that hash?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ctx.getSource(path) can take hash as an argument? I haven't thought through the details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash would be on the contents of the source? Then we would need to read all the sources eagerly when we add them to the context. That may be quite expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we compile files, we already have the file contents in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is only loaded when we first access something the depends on the source https://github.com/lampepfl/dotty/blob/master/compiler/src/dotty/tools/dotc/util/SourceFile.scala#L43.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we compile Scala files, the contents for the Scala files will be forced.

@nicolasstucki
Copy link
Contributor Author

Could someone with a Windows machine help me check if the following behaves the same?

Reproduction steps
# cp tests/neg/i6371/A_1.scala A.scala     
# cp tests/neg/i6371/B_2.scala B.scala    
# mkdir outA
# sbt   
sbt:scala3> scalac -d outA A.scala
sbt:scala3> scalac -print-tasty outA/A.tasty
Names:
  ...
  ...

Trees:
start = Addr(0), base = 300, current = Addr(0), end = Addr(146)
146 bytes of AST, base = Addr(0)
  ...
  ...

 97 position bytes:
   lines: 7
   line sizes: 10, 42, 18, 13, 3, 1, 0
   positions:
         0: 0 .. 92
         5: 0 .. 0
         8: 0 .. 0
        24: 0 .. 92
        27: 13 .. 90
        34: 7 .. 7
        41: 13 .. 13
        45: 13 .. 13
        49: 13 .. 13
        54: 7 .. 7
        58: 7 .. 7
        67: 7 .. 7
        71: 7 .. 7
        76: 13 .. 90
        79: 28 .. 34
        82: 31 .. 34
        88: 37 .. 41
        96: 44 .. 90
        98: 44 .. 45
       100: 58 .. 72
       104: 63 .. 64
       110: 66 .. 69
       116: 70 .. 72
       117: 77 .. 86
       119: 82 .. 83
       125: 84 .. 86


 0 comment bytes:

sbt:scala3> scalac -classpath outA -d outA B.scala
-- Error: B.scala:2:7 ----------------------------------------------------------
2 |  A.foo("aa") // error
  |  ^^^^^^^^^^^
  |this case is unreachable since type String is not a subclass of class Integer
  | This location contains code that was inlined from A.scala:3
1 error found

sbt:scala3> exit
# rm A.scala 
# sbt
sbt:scala3> scalac -classpath outA -d outA B.scala
-- Error: B.scala:2:7 ----------------------------------------------------------
2 |  A.foo("aa") // error
  |  ^^^^^^^^^^^
  |this case is unreachable since type String is not a subclass of class Integer
  | This location contains code that was inlined from A.scala:3
1 error found

@liufengyun
Copy link
Contributor

Could someone with a Windows machine help me check if the following behaves the same?

Reproduction steps

I just sent you by email.

@nicolasstucki
Copy link
Contributor Author

Executing the test on Windows yielded the same result

Reproduction steps (Windows)
dotty-2>copy tests\neg\i6371\A_1.scala A.scala
        1 file(s) copied.
dotty-2>copy tests\neg\i6371\B_2.scala B.scala
        1 file(s) copied.
dotty-2>mkdir outA
dotty-2>sbt
sbt:scala3> scalac -d outA A.scala

Names:
   
   ...

Trees:
start = Addr(0), base = 300, current = Addr(0), end = Addr(146)
146 bytes of AST, base = Addr(0)

   ...

 97 position bytes:
   lines: 7
   line sizes: 10, 42, 18, 13, 3, 1, 0
   positions:
         0: 0 .. 92
         5: 0 .. 0
         8: 0 .. 0
        24: 0 .. 92
        27: 13 .. 90
        34: 7 .. 7
        41: 13 .. 13
        45: 13 .. 13
        49: 13 .. 13
        54: 7 .. 7
        58: 7 .. 7
        67: 7 .. 7
        71: 7 .. 7
        76: 13 .. 90
        79: 28 .. 34
        82: 31 .. 34
        88: 37 .. 41
        96: 44 .. 90
        98: 44 .. 45
       100: 58 .. 72
       104: 63 .. 64
       110: 66 .. 69
       116: 70 .. 72
       117: 77 .. 86
       119: 82 .. 83
       125: 84 .. 86


 0 comment bytes:

sbt:scala3> scalac -classpath outA -d outA B.scala
-- Error: B.scala:2:7 ----------------------------------------------------------
2 |  A.foo("aa") // error
  |  ^^^^^^^^^^^
  |this case is unreachable since type String is not a subclass of class Integer
  | This location contains code that was inlined from A.scala:3
1 error found

sbt:scala3> exit

dotty-2>del A.scala

dotty-2>sbt
-- Error: B.scala:2:7 ----------------------------------------------------------
2 |  A.foo("aa") // error
  |  ^^^^^^^^^^^
  |this case is unreachable since type String is not a subclass of class Integer
  | This location contains code that was inlined from A.scala:3
1 error found

Copy link
Contributor

@odersky odersky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@nicolasstucki nicolasstucki force-pushed the add-line-sizes-in-TASTy branch from a9976e9 to 361a680 Compare November 27, 2020 09:30
@nicolasstucki
Copy link
Contributor Author

@odersky I also changed writeInt/readInt to writeNat/readNat, not only in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Line numbers from TASTy
7 participants