Revisit the UTF-8 environment code

At the moment, Git for Windows' strategy is to convert the entire environment to UTF-8 wholesale at startup, and for performance reasons, keep the environment sorted so that lookups can be more efficient.

As has been found out recently the hard way, the underlying assumption that Git's own code is the exclusive user of the environment is not only fragile but incorrect: for example, cURL uses and modifies the environment as well.

So let's revisit the strategy to modify the environment. One viable option is to intercept both `getenv()` and `putenv()` in Git code to keep the _real_ environment encoded in the current code page, but convert transparently from/to UTF-8 so that Git itself only sees Unicode values. This could be sped up considerably by testing whether the value in question is pure ASCII (which it will be in most cases) and skip the conversion altogether.

If necessary, the converted values could be held in a hash map, but for long-running Git processes this would require a last-recently-used eviction scheme, incurring quite a bit of complexity. So let's do that only if it turns out that the performance without this cache is not good enough.

The most important part of this ticket is to come up with a realistic benchmark. The best way in this developer's opinion would be to record all the calls to the environment conversion as well as to `getenv()` and `putenv()`, as performed by a complete test suite run, and condense those calls into a single benchmark program.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisit the UTF-8 environment code #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisit the UTF-8 environment code #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions