Skip to content

Centralise code to map between UTF-8 and UTF-16 on Windows. #62605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 5, 2023

Conversation

al45tair
Copy link
Contributor

In various places we need to call the Windows API, and because Swift uses UTF-8 for its string representation, we can’t call the ANSI API functions (because the code page used for the ANSI functions varies depending on the system locale setting). Instead, we need to use the wide character APIs.

This means that we need to convert UTF-8 to wide character and vice-versa in various places in the runtime.

rdar://103397975

@al45tair
Copy link
Contributor Author

@swift-ci Please smoke test

@al45tair
Copy link
Contributor Author

al45tair commented Dec 15, 2022

Since @grynspan and I both had essentially the same code in #62462 and #62577 respectively, it made sense to pull it out into a separate PR (before anyone else starts adding another copy to their PR :-D).

While @compnerd suggested using the LLVM ConvertUTF16ToUTF8 function, calling LLVM code from the runtime isn't possible unless we copy the code into the LLVMSupport folder, and I'm keen that we minimise the amount of that that happens. Another alternative that wasn't mentioned is using the C++ library to do the conversion. But both of those cause allocation using std::new which is undesirable in the runtime (for the reason that new can be overridden by the user, which can create problems where user code gets called in contexts where some things might not work).

@al45tair
Copy link
Contributor Author

(Worth noting that presently these functions generate a fatal error when they fail. I'm not sure if that's desirable everywhere; maybe they shouldn't.)

Copy link
Contributor

@mikeash mikeash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fatal error makes sense when malloc fails. It doesn't seem right when the conversion fails, though.

Copy link
Member

@compnerd compnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really centralise it? It seems that we should simultaneously update the symbolication codepath.

@al45tair
Copy link
Contributor Author

Does this really centralise it? It seems that we should simultaneously update the symbolication codepath.

@grynspan is doing that in a separate PR. The point of this is to provide the functions in one place, in one PR.

@al45tair
Copy link
Contributor Author

@swift-ci Please smoke test

@al45tair
Copy link
Contributor Author

A fatal error makes sense when malloc fails. It doesn't seem right when the conversion fails, though.

Fixed it to return nullptr when it fails. malloc() failing in a recoverable way is in some sense more likely on Windows (it doesn't overcommit — at least, it isn't supposed to), so I think returning nullptr there too makes some sense.

@grynspan
Copy link
Contributor

A fatal error makes sense when malloc fails. It doesn't seem right when the conversion fails, though.

Fixed it to return nullptr when it fails. malloc() failing in a recoverable way is in some sense more likely on Windows (it doesn't overcommit — at least, it isn't supposed to), so I think returning nullptr there too makes some sense.

I think we should return nullptr and let the caller—oh, you just did that. Never mind!

@al45tair al45tair force-pushed the eng/PR-103397975 branch 2 times, most recently from 8ef7c84 to 18df4db Compare December 16, 2022 13:11
@al45tair
Copy link
Contributor Author

@swift-ci Please smoke test

@al45tair
Copy link
Contributor Author

@swift-ci Please smoke test Windows platform

In various places we need to call the Windows API, and because Swift uses UTF-8
for its string representation, we can’t call the ANSI API functions (because the
code page used for the ANSI functions varies depending on the system locale
setting). Instead, we need to use the wide character APIs.

This means that we need to convert UTF-8 to wide character and vice-versa in
various places in the runtime.

rdar://103397975
Instead of triggering a fatal error on failure, return `nullptr` and
let the caller decide what to do about it.  The CommandLine code should
trigger a fatal error, of course.

rdar://103397975
`SWIFT_RUNTIME_STDLIB_INTERNAL` does `extern "C"`, so we can't put these
in a namespace and have to use a C-style prefix instead.

Also slightly rearrange the code in `CommandLine.cpp`.

rdar://103397975
It makes sense to move this function into the new Win32 header.

rdar://103397975
@al45tair
Copy link
Contributor Author

@swift-ci Please smoke test

They're useful functions to have; I don't think we want them to be API, but
having them as SPI is conceivably useful for other purposes, and avoids everyone
rolling their own copy.

rdar://103397975
@al45tair
Copy link
Contributor Author

@swift-ci Please smoke test

@grynspan
Copy link
Contributor

grynspan commented Dec 28, 2022

How would you feel about exposing String.init(_wcs:) and String._asWCS(_:) rethrows functions for slightly higher-level access to these interfaces? Obviously a permanent API would go through swift-evolution, but as an experimental/underbarred feature specific to Win32 interop, the utility of such interfaces cannot be understated.

Edit: If we wanted something that was platform-agnostic, we could instead expose cross-platform API for interacting with wide C strings the way a developer would interact with "narrow" C strings:

// MARK: - wchar_t

extension String {
  public init<S>(wideCharacters: S) where S: Sequence, S.Element == wchar_t
  public init(wideCString: UnsafePointer<wchar_t>)
  public init<C>(wideCString: C) where C: Collection, C.Element == wchar_t
}

// MARK: - CWideChar

extension String {
  public init<S>(wideCharacters: S) where S: Sequence, S.Element == CWideChar
  public init(wideCString: UnsafePointer<CWideChar>)
  public init<C>(wideCString: C) where C: Collection, C.Element == CWideChar
}

Edit 2: As I thought about it more, I realized that was just a long way of spelling String(decodingCString: $0, as: UTF16.self). Never mind!

@al45tair
Copy link
Contributor Author

al45tair commented Jan 3, 2023

How would you feel about exposing String.init(_wcs:) and String.asWCS(:) rethrows functions for slightly higher-level access to these interfaces?

That's more a question for the standard library folks, I think. I don't think that's necessarily the right thing anyway. IMO it'd be better to allow the Windows wide character APIs to be imported in a way such that we can pass Strings to them directly, as we can with the narrow string APIs. But that's probably a discussion for elsewhere.

@al45tair
Copy link
Contributor Author

al45tair commented Jan 3, 2023

@swift-ci Please smoke test

@al45tair al45tair merged commit bbe27f9 into swiftlang:main Jan 5, 2023
@grynspan
Copy link
Contributor

grynspan commented Jan 5, 2023

How would you feel about exposing String.init(_wcs:) and String.asWCS(:) rethrows functions for slightly higher-level access to these interfaces?

That's more a question for the standard library folks, I think. I don't think that's necessarily the right thing anyway. IMO it'd be better to allow the Windows wide character APIs to be imported in a way such that we can pass Strings to them directly, as we can with the narrow string APIs. But that's probably a discussion for elsewhere.

Since String(decodingCString: $0, as: UTF16.self) does the thing already, I retracted my previous statement. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants