|
| 1 | +# Understanding file encoding in VSCode and PowerShell |
| 2 | + |
| 3 | +When using VS Code to create and edit PowerShell scripts, it is important that your files are saved |
| 4 | +using the correct character encoding format. |
| 5 | + |
| 6 | +## What is file encoding and why is it important? |
| 7 | + |
| 8 | +VSCode manages the interface between a human entering strings of characters into a buffer and |
| 9 | +reading/writing blocks of bytes to the filesystem. When VSCode saves the file, it uses a text |
| 10 | +encoding to do this. |
| 11 | + |
| 12 | +Similarly, when PowerShell runs a script it must convert the bytes in a file to characters to |
| 13 | +reconstruct the file into a PowerShell program. Since VSCode writes the file and PowerShell reads |
| 14 | +the file, they need to use the same encoding system. This process of parsing a PowerShell script |
| 15 | +goes: *bytes* -> *characters* -> *tokens* -> *abstract syntax tree* -> *execution*. |
| 16 | + |
| 17 | +Both VSCode and PowerShell are installed with a sensible default encoding configuration. However, |
| 18 | +the default encoding used by PowerShell has changed with the release of PowerShell Core (v6.x). To |
| 19 | +ensure you have no problems using PowerShell or the PowerShell extension in VSCode, you need to |
| 20 | +configure your VSCode and PowerShell settings properly. |
| 21 | + |
| 22 | +## Common causes of encoding issues |
| 23 | + |
| 24 | +Encoding problems occur when the encoding of VSCode or your script file does not match the expected |
| 25 | +encoding of PowerShell. There is no way for PowerShell to automatically determine the file encoding. |
| 26 | + |
| 27 | +You're more likely to have encoding problems when you're using characters not in the [7-bit ASCII character set](https://ascii.cl/), |
| 28 | +such as accented latin characters (e.g. `É`, `ü`), or non-latin characters like Cyrillic (`Д`, `Ц`) |
| 29 | +or Han Chinese (`脚`, `本`). |
| 30 | + |
| 31 | +Common reasons for encoding issues are: |
| 32 | + |
| 33 | +- The encodings of VSCode and PowerShell have not been changed from their defaults. For PowerShell |
| 34 | + 5.1 and below, the default encoding is different from VSCode's. |
| 35 | +- Another editor has opened and overwritten the file in a new encoding. This often happens with the |
| 36 | + ISE. |
| 37 | +- The file is checked into source control (like git) in a different encoding to what VSCode or |
| 38 | + PowerShell expects. This can happen when collaborators edit files with an editor with a different |
| 39 | + encoding configurations. |
| 40 | + |
| 41 | +### How to tell when you have encoding issues |
| 42 | + |
| 43 | +Often encoding errors present themselves as parse errors in scripts. If you find strange character |
| 44 | +sequences in your script, this can be the problem. In the example below, an en-dash (`–`) appears as |
| 45 | +the characters `–`: |
| 46 | + |
| 47 | +```Output |
| 48 | +Send-MailMessage : A positional parameter cannot be found that accepts argument 'Testing FuseMail SMTP...'. |
| 49 | +At C:\Users\<User>\<OneDrive>\Development\PowerShell\Scripts\Send-EmailUsingSmtpRelay.ps1:6 char:1 |
| 50 | ++ Send-MailMessage –From $from –To $recipient1 –Subject $subject ... |
| 51 | ++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 52 | + + CategoryInfo : InvalidArgument: (:) [Send-MailMessage], ParameterBindingException |
| 53 | + + FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.PowerShell.Commands.SendMailMessage |
| 54 | +``` |
| 55 | + |
| 56 | +This problem occurs because VSCode encodes the character `–` in UTF-8 as the bytes `0xE2 0x80 0x93`. |
| 57 | +When these bytes are decoded as Windows-1252, they are interpreted as the characters `–`. |
| 58 | + |
| 59 | +Some strange character sequences that you might see include: |
| 60 | + |
| 61 | +- `–` instead of `–` |
| 62 | +- `—` instead of `—` |
| 63 | +- `Ä2` instead of `Ä` |
| 64 | +- `Â` instead of ` ` (a non-breaking space) |
| 65 | +- `é` instead of `é` |
| 66 | + |
| 67 | +This handy [reference](https://www.i18nqa.com/debug/utf8-debug.html) lists the common patterns that |
| 68 | +indicate a UTF-8/Windows-1252 encoding problem. |
| 69 | + |
| 70 | +## How the PowerShell extension in VSCode interacts with encodings |
| 71 | + |
| 72 | +The PowerShell extension interacts with scripts in a number of ways: |
| 73 | + |
| 74 | +1. When scripts are edited in VSCode, the contents are sent by VSCode to the extension. The [Language Server Protocol][] |
| 75 | + mandates that this content is transferred in UTF-8. Therefore, it is not possible for the |
| 76 | + extension to get the wrong encoding. |
| 77 | +2. When scripts are executed directly in the Integrated Console, they are read off the filesystem by |
| 78 | + PowerShell directly. This means that if PowerShell's encoding differs from VSCode's, something |
| 79 | + may go wrong here. |
| 80 | +3. When a script that is open in VSCode references another script that is not open in VSCode, the |
| 81 | + extension falls back to loading that script's content from the file system. VSCode defaults to |
| 82 | + UTF-8 encoding, but uses [byte-order mark][], or BOM, detection to select the correct encoding. |
| 83 | + |
| 84 | +The problem occurs when assuming the encoding of BOM-less formats (like [UTF-8] with no BOM and [Windows-1252]). |
| 85 | +In these cases, the extension defaults to UTF-8 rather than more complex logic. The PowerShell |
| 86 | +extension cannot change VSCode's encoding settings. For more information see [issue #824](https://github.com/Microsoft/vscode/issues/824). |
| 87 | + |
| 88 | +## Choosing the right encoding |
| 89 | + |
| 90 | +Choosing an encoding depends on the platforms and applications you use to read and write your |
| 91 | +PowerShell files. |
| 92 | + |
| 93 | +On Windows, many applications have long used [Windows-1252]. Many .NET applications use [UTF-16]. In |
| 94 | +Windows, this is often called "Unicode", Unicode is a term that now refers to a broader [standard](https://en.wikipedia.org/wiki/Unicode). |
| 95 | + |
| 96 | +In the Linux world, on the web, and .NET Standard, UTF-8 is now the dominant encoding. |
| 97 | + |
| 98 | +Unicode encodings also have the concept of a byte-order mark (BOM). BOMs occur at the beginning of |
| 99 | +text to tell a decoder which encoding the text is using. In the case of multi-byte encodings, the |
| 100 | +BOM also indicates [endianness](https://en.wikipedia.org/wiki/Endianness) of the encoding. BOMs are |
| 101 | +designed to be bytes that rarely occur in non-Unicode text, allowing a reasonable guess that text is |
| 102 | +Unicode when a BOM is present. |
| 103 | + |
| 104 | +BOMs are optional and their adoption has not caught on in the Linux world, due to a dependable |
| 105 | +convention of UTF-8 being used everywhere. This means that most Linux applications presume that text |
| 106 | +input is encoded in UTF-8. While many Linux applications will recognize and correctly handle a BOM, |
| 107 | +a number do not, leading to artifacts in text manipulated with those applications. |
| 108 | + |
| 109 | +**Therefore**: |
| 110 | + |
| 111 | +- If you work primarily with Windows applications and Windows PowerShell, you should prefer an |
| 112 | + encoding like UTF-8 with BOM or UTF-16. |
| 113 | +- If you work across platforms, you should prefer UTF-8 with BOM. |
| 114 | +- If you work mainly in Linux-associated contexts, you should prefer UTF-8 without BOM. |
| 115 | +- Windows-1252 and latin-1 are essentially legacy encodings that you should avoid if possible. |
| 116 | + However, some older Windows applications may depend on them. |
| 117 | +- It's also worth noting that script signing is [encoding-dependent](https://github.com/PowerShell/PowerShell/issues/3466), |
| 118 | + meaning a change of encoding on a signed script will require resigning. |
| 119 | + |
| 120 | +## Configuring VSCode |
| 121 | + |
| 122 | +VSCode's default encoding is UTF-8 without BOM. |
| 123 | + |
| 124 | +To set [VSCode's encoding](https://code.visualstudio.com/docs/editor/codebasics#_file-encoding-support), |
| 125 | +go to the VSCode settings (Ctrl+,) and set the `"files.encoding"` setting: |
| 126 | + |
| 127 | +```json |
| 128 | +"files.encoding": "utf8bom" |
| 129 | +``` |
| 130 | + |
| 131 | +Some possible values are: |
| 132 | + |
| 133 | +- `utf8`: [UTF-8] without BOM |
| 134 | +- `utf8bom`: [UTF-8] with BOM |
| 135 | +- `utf16le`: Little endian [UTF-16] |
| 136 | +- `utf16be`: Big endian [UTF-16] |
| 137 | +- `windows1252`: [Windows-1252] |
| 138 | + |
| 139 | +You should get a dropdown for this in the GUI view, or completions for it in the JSON view. |
| 140 | + |
| 141 | +You can also add the following to auto-detect encoding when possible: |
| 142 | + |
| 143 | +```json |
| 144 | +"files.autoGuessEncoding": true |
| 145 | +``` |
| 146 | + |
| 147 | +If you don't want these settings to affect all files types, VSCode also allows per-language |
| 148 | +configurations. Create a language specific setting by putting settings in a `[<language-name>]` |
| 149 | +field. For example: |
| 150 | + |
| 151 | +```json |
| 152 | +"[powershell]": { |
| 153 | + "files.encoding": "utf8bom", |
| 154 | + "files.autoGuessEncoding": true |
| 155 | +} |
| 156 | +``` |
| 157 | + |
| 158 | +## Configuring PowerShell |
| 159 | + |
| 160 | +PowerShell's default encoding varies depending on version: |
| 161 | + |
| 162 | +- In PowerShell 6+, the default encoding is [UTF-8] without BOM on all platforms. |
| 163 | +- In Windows PowerShell, the default encoding is usually [Windows-1252], an extension of [latin-1], |
| 164 | + also known as ISO 8859-1. |
| 165 | + |
| 166 | +In PowerShell 5+ you can find your default encoding with this: |
| 167 | + |
| 168 | +```powershell |
| 169 | +[psobject].Assembly.GetTypes() | Where-Object { $_.Name -eq 'ClrFacade'} | |
| 170 | + ForEach-Object { |
| 171 | + $_.GetMethod('GetDefaultEncoding', [System.Reflection.BindingFlags]'nonpublic,static').Invoke($null, @()) |
| 172 | + } |
| 173 | +``` |
| 174 | + |
| 175 | +The following [this script](https://gist.github.com/rjmholt/3d8dd4849f718c914132ce3c5b278e0e) can be |
| 176 | +used to determine what encoded your PowerShell session infers for a script without a BOM. |
| 177 | + |
| 178 | +```powershell |
| 179 | +$badBytes = [byte[]]@(0xC3, 0x80) |
| 180 | +$utf8Str = [System.Text.Encoding]::UTF8.GetString($badBytes) |
| 181 | +$bytes = [System.Text.Encoding]::ASCII.GetBytes('Write-Output "') + [byte[]]@(0xC3, 0x80) + [byte[]]@(0x22) |
| 182 | +$path = Join-Path ([System.IO.Path]::GetTempPath()) 'encodingtest.ps1' |
| 183 | +
|
| 184 | +try |
| 185 | +{ |
| 186 | + [System.IO.File]::WriteAllBytes($path, $bytes) |
| 187 | +
|
| 188 | + switch (& $path) |
| 189 | + { |
| 190 | + $utf8Str |
| 191 | + { |
| 192 | + return 'UTF-8' |
| 193 | + break |
| 194 | + } |
| 195 | +
|
| 196 | + default |
| 197 | + { |
| 198 | + return 'Windows-1252' |
| 199 | + break |
| 200 | + } |
| 201 | + } |
| 202 | +} |
| 203 | +finally |
| 204 | +{ |
| 205 | + Remove-Item $path |
| 206 | +} |
| 207 | +``` |
| 208 | + |
| 209 | +If want to configure PowerShell to use a given encoding more generally, this is possible to do for |
| 210 | +some aspects with profile settings. See: |
| 211 | + |
| 212 | +- [@mklement0]'s [answer about PowerShell encoding on StackOverflow](https://stackoverflow.com/a/40098904). |
| 213 | +- [@rkeithhill]'s [blog post about dealing with BOM-less UTF-8 input in PowerShell](https://rkeithhill.wordpress.com/2010/05/26/handling-native-exe-output-encoding-in-utf8-with-no-bom/). |
| 214 | + |
| 215 | +It's not possible to force PowerShell to use a specific input encoding. PowerShell 5.1 and below |
| 216 | +default to Windows-1252 encoding when there is no BOM. For interoperability reasons, it's best to |
| 217 | +save scripts in a Unicode format with a BOM. |
| 218 | + |
| 219 | +> [!IMPORTANT] |
| 220 | +> Any other tools you have that touch PowerShell scripts may be affected by your |
| 221 | +> encoding choices or re-encode your scripts to another encoding. |
| 222 | +
|
| 223 | +### Scripts |
| 224 | + |
| 225 | +Scripts already on the file system may need to be re-encoded to your new chosen encoding. In the |
| 226 | +bottom bar of VSCode, you'll see the label UTF-8. Click it to open the action bar and select |
| 227 | +**Save with encoding**. You can now pick a new encoding for that file. |
| 228 | + |
| 229 | +If you need to re-encode multiple files, you can use the following script: |
| 230 | + |
| 231 | +```powershell |
| 232 | +Get-ChildItem *.ps1 -Recurse | ForEach-Object { |
| 233 | + $content = Get-Content -Path $_ |
| 234 | + Set-Content -Path $_.Fullname -Value $content -Encoding UTF8 -PassThru -Force |
| 235 | +} |
| 236 | +``` |
| 237 | + |
| 238 | +### The PowerShell Integrated Scripting Environment (ISE) |
| 239 | + |
| 240 | +If you also edit scripts using the PowerShell ISE, you will need to synchronize your encoding |
| 241 | +settings there. |
| 242 | + |
| 243 | +The ISE should honor a BOM, but it is also possible to use reflection to |
| 244 | +[set the encoding](https://bensonxion.wordpress.com/2012/04/25/powershell-ise-default-saveas-encoding/). |
| 245 | +Note that this would not be persisted between startups. |
| 246 | + |
| 247 | +### Source control software |
| 248 | + |
| 249 | +Some source control tools, such as git, ignore encodings; git just tracks the bytes. |
| 250 | +Others, like TFS or Mercurial, may not. Even some git-based tools rely on decoding text. |
| 251 | + |
| 252 | +When this is the case, make sure you: |
| 253 | + |
| 254 | +- Configure the text encoding in your source control to match your VSCode configuration. |
| 255 | +- Ensure all your files are checked into source control in the relevant encoding. |
| 256 | +- Be wary of changes to the encoding received through source control. A key sign of this is a diff |
| 257 | + indicating changes but where nothing seems to have changed (because bytes have but characters have |
| 258 | + not). |
| 259 | + |
| 260 | +### Collaborators' environments |
| 261 | + |
| 262 | +On top of configuring source control, ensure that your collaborators on any files you share don't |
| 263 | +have settings that override your encoding by re-encoding PowerShell files. |
| 264 | + |
| 265 | +### Other programs |
| 266 | + |
| 267 | +Any other program that reads or writes a PowerShell script may re-encode it. |
| 268 | + |
| 269 | +Some examples are: |
| 270 | + |
| 271 | +- Using the clipboard to copy and paste a script. This is common in scenarios like: |
| 272 | + - Copying a script into a VM |
| 273 | + - Copying a script out of an email or webpage |
| 274 | + - Copying a script into or out of an Microsoft Word or PowerPoint document |
| 275 | +- Other text editors, such as: |
| 276 | + - Notepad |
| 277 | + - vim |
| 278 | + - Any other PowerShell script editor |
| 279 | +- Text editing utilities, like: |
| 280 | + - `Get-Content`/`Set-Content`/`Out-File` |
| 281 | + - PowerShell redirection operators like `>` and `>>` |
| 282 | + - `sed`/`awk` |
| 283 | +- File transfer programs, like: |
| 284 | + - A web browser, when downloading scripts |
| 285 | + - A file share |
| 286 | + |
| 287 | +Some of these deal in bytes rather than text, but others offer encoding configurations. In those |
| 288 | +cases where you need to configure an encoding, you need to make it the same as your editor encoding |
| 289 | +to prevent problems. |
| 290 | + |
| 291 | +## Other resources on encoding in PowerShell |
| 292 | + |
| 293 | +There are a few other nice posts on encoding and configuring encoding in PowerShell that are worth a |
| 294 | +read: |
| 295 | + |
| 296 | +- [@mklement0]'s [summary of PowerShell encoding on StackOverflow](https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8) |
| 297 | +- Previous issues opened on vscode-PowerShell for encoding problems: |
| 298 | + - [#1308](https://github.com/PowerShell/vscode-powershell/issues/1308) |
| 299 | + - [#1628](https://github.com/PowerShell/vscode-powershell/issues/1628) |
| 300 | + - [#1680](https://github.com/PowerShell/vscode-powershell/issues/1680) |
| 301 | + - [#1744](https://github.com/PowerShell/vscode-powershell/issues/1744) |
| 302 | + - [#1751](https://github.com/PowerShell/vscode-powershell/issues/1751) |
| 303 | +- [The classic *Joel on Software* writeup about Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) |
| 304 | +- [Encoding in .NET Standard](https://github.com/dotnet/standard/issues/260#issuecomment-289549508) |
| 305 | + |
| 306 | + |
| 307 | +[@mklement0]: https://github.com/mklement0 |
| 308 | +[@rkeithhill]: https://github.com/rkeithhill |
| 309 | +[Windows-1252]: https://wikipedia.org/wiki/Windows-1252 |
| 310 | +[latin-1]: https://wikipedia.org/wiki/ISO/IEC_8859-1 |
| 311 | +[UTF-8]: https://wikipedia.org/wiki/UTF-8 |
| 312 | +[byte-order mark]: https://wikipedia.org/wiki/Byte_order_mark |
| 313 | +[UTF-16]: https://wikipedia.org/wiki/UTF-16 |
| 314 | +[Language Server Protocol]: https://microsoft.github.io/language-server-protocol/ |
0 commit comments