Sockets: > 1 order magnitude slowdown when splitting message into two parts

## Overview

I have some serialization code that supports various communication channels and formats. It is a relatively chatty protocol, that sends lots of fairly small messages.

When sending messages as byte data, all messages are preceded by a 32 bit int describing the length, and then that many bytes are consumed. Input bytes are handled by a [MessageGrouper](https://github.com/dart-lang/sdk/blob/main/pkg/_fe_analyzer_shared/lib/src/macros/executor/message_grouper.dart) which produces its own stream of bytes that are just the fully formed messages.

These length bytes and message bytes are written through back to back but independent writes to the socket/stdio sink. Note that in stdio mode though these are actually grouped and we get only one list of bytes on the other end in practice (not the case for sockets, they come separately).

This performs relatively similar across all the supported channels (send ports, stdio) except for sockets, where it is significantly slower (~100x).

Note that this behaves the same in AOT or JIT modes.

## Repro steps

See https://github.com/dart-lang/sdk/blob/main/pkg/_fe_analyzer_shared/benchmark/macros/serialization_benchmark.dart - you can just execute this benchmark (note the json version sometimes crashes which is a separate issue, feel free to comment out the json serialization mode from the loop at the top).

When digging into this, I discovered that by combining the length bytes and message bytes into a single call to `Socket.add`, the regression goes away and it is fast again.

I did that by modifying [here](https://github.com/dart-lang/sdk/blob/cf3e6028691dfaf650355207ba75daaf00dd2b36/pkg/_fe_analyzer_shared/benchmark/macros/serialization_benchmark.dart#L243) and [here](https://github.com/dart-lang/sdk/blob/cf3e6028691dfaf650355207ba75daaf00dd2b36/pkg/_fe_analyzer_shared/benchmark/macros/serialization_benchmark.dart#L256) to look like:

```
        final bytesBuilder = BytesBuilder(copy: false);
        _writeLength(result, bytesBuilder);
        bytesBuilder.add(result);
        client.add(bytesBuilder.takeBytes());
```

And modifying `_writeLength` to take a BytesBuilder.

(you also have to do similar updates to the client code which is just a big string block at the bottom and kind of a pain, but doing just half will speed it up by half).

## Cpu profile

I have some cpu profiles with and without the optimization, if you ping me directly I can send them (not sure what info is contained so don't want to post them here). But the general trend is that just _everything_ is much slower. As an example, we spend 693ms just inside the Uint8List constructor - where the optimized benchmark runs in its entirety in under 100ms.

It seems like we are possibly being cpu starved or something, I am really not sure where to begin with debugging this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sockets: > 1 order magnitude slowdown when splitting message into two parts #52102

Overview

Repro steps

Cpu profile

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sockets: > 1 order magnitude slowdown when splitting message into two parts #52102

Description

Overview

Repro steps

Cpu profile

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions