-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Revisit the UTF-8 environment code #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just for reference, this would be the 'wrapped' variant from this discussion [1]. The repo.or.cz links still work, but are easier to review in my github repo (e.g. [2]). The branches instrumented with performance measurements ('...-perf') are only on repo.or.cz, though. IIRC, the main problems of the wrapped version were the lack of POSIX-compliant
I don't think so. We'd have to use _wgetenv() to check if its ASCII-only, then use _getenv() to get the char* version. I.e. adding another linear search of the environment, which will most likely be slower than just UTF-8-converting the _wgetenv() result. [1] https://groups.google.com/forum/#!msg/msysgit/54h0ROyPTPU/nJPhRIOz5xAJ |
Actually we can easily test whether the return value of |
By the way, as we are talking performance already (in my mind, robustness must come first, and we are not there yet), I am starting to doubt that we query the environment often enough that this linear search is really hurting. In fact, I could imagine that we work way harder than necessary in most cases because we convert the environment wholesale to UTF-8, only to re-encode it to UTF-16 just before spawning a new process - which, if it is another Git process, re-re-encodes the environment to UTF-8. And with every encoding, we also sort the environment. Granted, it is almost linear after the first sort because mostly sorted arrays just sort faster than truly random ones. Still, a lot of churn for potentially no user at all... |
But if you convert the string, you also need to pool the result so that it remains stable after getenv returns. Besides, UTF-16 to UTF-8 conversion is almost trivial and very fast (just shifting a few bits, no table lookups). So checking if its pure ASCII, then doing ASCII-only conversion may be slower than doing UTF-8 conversion right away. I'll dig up that unicode-v9-wrapped branch to get some hard numbers... |
Good point. My only worry there was that allocation of the result might be an issue (because you have to go through the string twice, but then, that's not a big deal now, is it?). |
I'd convert on the stack (alloca()), then let strintern() decide whether it needs the value on the heap... |
I guess that's good enough. As I told you in person, I thought that maybe a hashmap with the pointer returned by |
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
In test #49, $(pwd) must match $(readlink), which is an MSys utility. Signed-off-by: Karsten Blees <[email protected]>
We will need to consolidate the new MSVC-specific environment handling with the current MINGW one anyway. |
I guess this should get a higher priority. |
I guess I did address this already, via fe21c6b |
At the moment, Git for Windows' strategy is to convert the entire environment to UTF-8 wholesale at startup, and for performance reasons, keep the environment sorted so that lookups can be more efficient.
As has been found out recently the hard way, the underlying assumption that Git's own code is the exclusive user of the environment is not only fragile but incorrect: for example, cURL uses and modifies the environment as well.
So let's revisit the strategy to modify the environment. One viable option is to intercept both
getenv()
andputenv()
in Git code to keep the real environment encoded in the current code page, but convert transparently from/to UTF-8 so that Git itself only sees Unicode values. This could be sped up considerably by testing whether the value in question is pure ASCII (which it will be in most cases) and skip the conversion altogether.If necessary, the converted values could be held in a hash map, but for long-running Git processes this would require a last-recently-used eviction scheme, incurring quite a bit of complexity. So let's do that only if it turns out that the performance without this cache is not good enough.
The most important part of this ticket is to come up with a realistic benchmark. The best way in this developer's opinion would be to record all the calls to the environment conversion as well as to
getenv()
andputenv()
, as performed by a complete test suite run, and condense those calls into a single benchmark program.The text was updated successfully, but these errors were encountered: