Description
At the moment, Git for Windows' strategy is to convert the entire environment to UTF-8 wholesale at startup, and for performance reasons, keep the environment sorted so that lookups can be more efficient.
As has been found out recently the hard way, the underlying assumption that Git's own code is the exclusive user of the environment is not only fragile but incorrect: for example, cURL uses and modifies the environment as well.
So let's revisit the strategy to modify the environment. One viable option is to intercept both getenv()
and putenv()
in Git code to keep the real environment encoded in the current code page, but convert transparently from/to UTF-8 so that Git itself only sees Unicode values. This could be sped up considerably by testing whether the value in question is pure ASCII (which it will be in most cases) and skip the conversion altogether.
If necessary, the converted values could be held in a hash map, but for long-running Git processes this would require a last-recently-used eviction scheme, incurring quite a bit of complexity. So let's do that only if it turns out that the performance without this cache is not good enough.
The most important part of this ticket is to come up with a realistic benchmark. The best way in this developer's opinion would be to record all the calls to the environment conversion as well as to getenv()
and putenv()
, as performed by a complete test suite run, and condense those calls into a single benchmark program.