Skip to content

Sockets: > 1 order magnitude slowdown when splitting message into two parts #52102

Closed
@jakemac53

Description

@jakemac53

Overview

I have some serialization code that supports various communication channels and formats. It is a relatively chatty protocol, that sends lots of fairly small messages.

When sending messages as byte data, all messages are preceded by a 32 bit int describing the length, and then that many bytes are consumed. Input bytes are handled by a MessageGrouper which produces its own stream of bytes that are just the fully formed messages.

These length bytes and message bytes are written through back to back but independent writes to the socket/stdio sink. Note that in stdio mode though these are actually grouped and we get only one list of bytes on the other end in practice (not the case for sockets, they come separately).

This performs relatively similar across all the supported channels (send ports, stdio) except for sockets, where it is significantly slower (~100x).

Note that this behaves the same in AOT or JIT modes.

Repro steps

See https://github.com/dart-lang/sdk/blob/main/pkg/_fe_analyzer_shared/benchmark/macros/serialization_benchmark.dart - you can just execute this benchmark (note the json version sometimes crashes which is a separate issue, feel free to comment out the json serialization mode from the loop at the top).

When digging into this, I discovered that by combining the length bytes and message bytes into a single call to Socket.add, the regression goes away and it is fast again.

I did that by modifying here and here to look like:

        final bytesBuilder = BytesBuilder(copy: false);
        _writeLength(result, bytesBuilder);
        bytesBuilder.add(result);
        client.add(bytesBuilder.takeBytes());

And modifying _writeLength to take a BytesBuilder.

(you also have to do similar updates to the client code which is just a big string block at the bottom and kind of a pain, but doing just half will speed it up by half).

Cpu profile

I have some cpu profiles with and without the optimization, if you ping me directly I can send them (not sure what info is contained so don't want to post them here). But the general trend is that just everything is much slower. As an example, we spend 693ms just inside the Uint8List constructor - where the optimized benchmark runs in its entirety in under 100ms.

It seems like we are possibly being cpu starved or something, I am really not sure where to begin with debugging this.

Metadata

Metadata

Assignees

Labels

P2A bug or feature request we're likely to work onarea-core-librarySDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries.library-iotriagedIssue has been triaged by sub team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions