Skip to content

Revisit the UTF-8 environment code #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dscho opened this issue Mar 27, 2015 · 10 comments
Closed

Revisit the UTF-8 environment code #49

dscho opened this issue Mar 27, 2015 · 10 comments

Comments

@dscho
Copy link
Member

dscho commented Mar 27, 2015

At the moment, Git for Windows' strategy is to convert the entire environment to UTF-8 wholesale at startup, and for performance reasons, keep the environment sorted so that lookups can be more efficient.

As has been found out recently the hard way, the underlying assumption that Git's own code is the exclusive user of the environment is not only fragile but incorrect: for example, cURL uses and modifies the environment as well.

So let's revisit the strategy to modify the environment. One viable option is to intercept both getenv() and putenv() in Git code to keep the real environment encoded in the current code page, but convert transparently from/to UTF-8 so that Git itself only sees Unicode values. This could be sped up considerably by testing whether the value in question is pure ASCII (which it will be in most cases) and skip the conversion altogether.

If necessary, the converted values could be held in a hash map, but for long-running Git processes this would require a last-recently-used eviction scheme, incurring quite a bit of complexity. So let's do that only if it turns out that the performance without this cache is not good enough.

The most important part of this ticket is to come up with a realistic benchmark. The best way in this developer's opinion would be to record all the calls to the environment conversion as well as to getenv() and putenv(), as performed by a complete test suite run, and condense those calls into a single benchmark program.

@dscho dscho added the git label Mar 27, 2015
@dscho dscho self-assigned this Mar 27, 2015
@kblees
Copy link

kblees commented Mar 30, 2015

So let's revisit the strategy to modify the environment. One viable option is to intercept both getenv() and putenv() in Git code to keep the real environment encoded in the current code page, but convert transparently from/to UTF-8 so that Git itself only sees Unicode values.

Just for reference, this would be the 'wrapped' variant from this discussion [1]. The repo.or.cz links still work, but are easier to review in my github repo (e.g. [2]). The branches instrumented with performance measurements ('...-perf') are only on repo.or.cz, though.

IIRC, the main problems of the wrapped version were the lack of POSIX-compliant environ, putenv() and unsetenv(), as well as the added complexity of making getenv() return values permanent [3]. However, safe-getenv.c of that patch could nowadays be simplified greatly using strintern() from hashmap.[ch].

This could be sped up considerably by testing whether the value in question is pure ASCII (which it will be in most cases) and skip the conversion altogether.

I don't think so. We'd have to use _wgetenv() to check if its ASCII-only, then use _getenv() to get the char* version. I.e. adding another linear search of the environment, which will most likely be slower than just UTF-8-converting the _wgetenv() result.

[1] https://groups.google.com/forum/#!msg/msysgit/54h0ROyPTPU/nJPhRIOz5xAJ
[2] https://github.com/kblees/git/commits/kb/unicode-v9-wrapped
[3] kblees@71a975b

dscho referenced this issue Apr 7, 2015
Make sure that the nedmalloc patch is applied before.
@dscho
Copy link
Member Author

dscho commented Apr 8, 2015

We'd have to use _wgetenv() to check if its ASCII-only, then use _getenv() to get the char* version

Actually we can easily test whether the return value of _wgetenv() is ASCII only and then convert it trivially, i.e. no second linear search is required.

@dscho
Copy link
Member Author

dscho commented Apr 8, 2015

By the way, as we are talking performance already (in my mind, robustness must come first, and we are not there yet), I am starting to doubt that we query the environment often enough that this linear search is really hurting. In fact, I could imagine that we work way harder than necessary in most cases because we convert the environment wholesale to UTF-8, only to re-encode it to UTF-16 just before spawning a new process - which, if it is another Git process, re-re-encodes the environment to UTF-8. And with every encoding, we also sort the environment. Granted, it is almost linear after the first sort because mostly sorted arrays just sort faster than truly random ones. Still, a lot of churn for potentially no user at all...

@kblees
Copy link

kblees commented Apr 9, 2015

We'd have to use _wgetenv() to check if its ASCII-only, then use _getenv() to get the char* version

Actually we can easily test whether the return value of _wgetenv() is ASCII only and then convert it trivially, i.e. no second linear search is required.

But if you convert the string, you also need to pool the result so that it remains stable after getenv returns. Besides, UTF-16 to UTF-8 conversion is almost trivial and very fast (just shifting a few bits, no table lookups). So checking if its pure ASCII, then doing ASCII-only conversion may be slower than doing UTF-8 conversion right away.

I'll dig up that unicode-v9-wrapped branch to get some hard numbers...

@dscho
Copy link
Member Author

dscho commented Apr 9, 2015

UTF-16 to UTF-8 conversion is almost trivial and very fast (just shifting a few bits, no table lookups). So checking if its pure ASCII, then doing ASCII-only conversion may be slower than doing UTF-8 conversion right away.

Good point. My only worry there was that allocation of the result might be an issue (because you have to go through the string twice, but then, that's not a big deal now, is it?).

@kblees
Copy link

kblees commented Apr 9, 2015

My only worry there was that allocation of the result might be an issue

I'd convert on the stack (alloca()), then let strintern() decide whether it needs the value on the heap...

@dscho
Copy link
Member Author

dscho commented Apr 9, 2015

I'd convert on the stack (alloca()), then let strintern() decide whether it needs the value on the heap...

I guess that's good enough. As I told you in person, I thought that maybe a hashmap with the pointer returned by _wgetenv() would make sense, but that may very well be over-engineered.

kblees added a commit that referenced this issue May 24, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
kblees added a commit that referenced this issue May 27, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
kblees added a commit that referenced this issue Jun 10, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
kblees added a commit that referenced this issue Jun 17, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
kblees added a commit that referenced this issue Jun 25, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
kblees added a commit that referenced this issue Jul 18, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
kblees added a commit that referenced this issue Jul 30, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
@dscho dscho modified the milestone: Post-release cleanup Aug 24, 2015
kblees added a commit that referenced this issue Aug 31, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Sep 10, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Sep 18, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Sep 29, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Oct 5, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Oct 19, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Nov 9, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Dec 11, 2015
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
dscho pushed a commit that referenced this issue Jan 5, 2016
In test #49, $(pwd) must match $(readlink), which is an MSys utility.

Signed-off-by: Karsten Blees <[email protected]>
@dscho
Copy link
Member Author

dscho commented Dec 25, 2016

We will need to consolidate the new MSVC-specific environment handling with the current MINGW one anyway.

@dscho
Copy link
Member Author

dscho commented Oct 8, 2017

I guess this should get a higher priority.

@dscho
Copy link
Member Author

dscho commented Jun 4, 2019

I guess I did address this already, via fe21c6b

@dscho dscho closed this as completed Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants