|
| 1 | +--- |
| 2 | +title: Understanding file encoding in VSCode and PowerShell |
| 3 | +description: Configure file encoding in VSCode and PowerShell |
| 4 | +ms.date: 02/28/2019 |
| 5 | +--- |
| 6 | +# Understanding file encoding in VSCode and PowerShell |
| 7 | + |
| 8 | +When using VS Code to create and edit PowerShell scripts, it is important that your files are saved |
| 9 | +using the correct character encoding format. |
| 10 | + |
| 11 | +## What is file encoding and why is it important? |
| 12 | + |
| 13 | +VSCode manages the interface between a human entering strings of characters into a buffer and |
| 14 | +reading/writing blocks of bytes to the filesystem. When VSCode saves a file, it uses a text |
| 15 | +encoding to do this. |
| 16 | + |
| 17 | +Similarly, when PowerShell runs a script it must convert the bytes in a file to characters to |
| 18 | +reconstruct the file into a PowerShell program. Since VSCode writes the file and PowerShell reads |
| 19 | +the file, they need to use the same encoding system. This process of parsing a PowerShell script |
| 20 | +goes: *bytes* -> *characters* -> *tokens* -> *abstract syntax tree* -> *execution*. |
| 21 | + |
| 22 | +Both VSCode and PowerShell are installed with a sensible default encoding configuration. However, |
| 23 | +the default encoding used by PowerShell has changed with the release of PowerShell Core (v6.x). To |
| 24 | +ensure you have no problems using PowerShell or the PowerShell extension in VSCode, you need to |
| 25 | +configure your VSCode and PowerShell settings properly. |
| 26 | + |
| 27 | +## Common causes of encoding issues |
| 28 | + |
| 29 | +Encoding problems occur when the encoding of VSCode or your script file does not match the expected |
| 30 | +encoding of PowerShell. There is no way for PowerShell to automatically determine the file encoding. |
| 31 | + |
| 32 | +You're more likely to have encoding problems when you're using characters not in the [7-bit ASCII character set](https://ascii.cl/). For example: |
| 33 | + |
| 34 | +- Accented latin characters (`É`, `ü`) |
| 35 | +- Non-latin characters like Cyrillic (`Д`, `Ц`) |
| 36 | +- Han Chinese (`脚`, `本`) |
| 37 | + |
| 38 | +Common reasons for encoding issues are: |
| 39 | + |
| 40 | +- The encodings of VSCode and PowerShell have not been changed from their defaults. For PowerShell |
| 41 | + 5.1 and below, the default encoding is different from VSCode's. |
| 42 | +- Another editor has opened and overwritten the file in a new encoding. This often happens with the |
| 43 | + ISE. |
| 44 | +- The file is checked into source control in an encoding that is different from what VSCode or |
| 45 | + PowerShell expects. This can happen when collaborators use editors with different encoding |
| 46 | + configurations. |
| 47 | + |
| 48 | +### How to tell when you have encoding issues |
| 49 | + |
| 50 | +Often encoding errors present themselves as parse errors in scripts. If you find strange character |
| 51 | +sequences in your script, this can be the problem. In the example below, an en-dash (`–`) appears as |
| 52 | +the characters `–`: |
| 53 | + |
| 54 | +```Output |
| 55 | +Send-MailMessage : A positional parameter cannot be found that accepts argument 'Testing FuseMail SMTP...'. |
| 56 | +At C:\Users\<User>\<OneDrive>\Development\PowerShell\Scripts\Send-EmailUsingSmtpRelay.ps1:6 char:1 |
| 57 | ++ Send-MailMessage –From $from –To $recipient1 –Subject $subject ... |
| 58 | ++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 59 | + + CategoryInfo : InvalidArgument: (:) [Send-MailMessage], ParameterBindingException |
| 60 | + + FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.PowerShell.Commands.SendMailMessage |
| 61 | +``` |
| 62 | + |
| 63 | +This problem occurs because VSCode encodes the character `–` in UTF-8 as the bytes `0xE2 0x80 0x93`. |
| 64 | +When these bytes are decoded as Windows-1252, they are interpreted as the characters `–`. |
| 65 | + |
| 66 | +Some strange character sequences that you might see include: |
| 67 | + |
| 68 | +- `–` instead of `–` |
| 69 | +- `—` instead of `—` |
| 70 | +- `Ä2` instead of `Ä` |
| 71 | +- `Â` instead of ` ` (a non-breaking space) |
| 72 | +- `é` instead of `é` |
| 73 | + |
| 74 | +This handy [reference](https://www.i18nqa.com/debug/utf8-debug.html) lists the common patterns that |
| 75 | +indicate a UTF-8/Windows-1252 encoding problem. |
| 76 | + |
| 77 | +## How the PowerShell extension in VSCode interacts with encodings |
| 78 | + |
| 79 | +The PowerShell extension interacts with scripts in a number of ways: |
| 80 | + |
| 81 | +1. When scripts are edited in VSCode, the contents are sent by VSCode to the extension. The [Language Server Protocol][] |
| 82 | + mandates that this content is transferred in UTF-8. Therefore, it is not possible for the |
| 83 | + extension to get the wrong encoding. |
| 84 | +2. When scripts are executed directly in the Integrated Console, they're read from the file by |
| 85 | + PowerShell directly. Tf PowerShell's encoding differs from VSCode's, something can go wrong here. |
| 86 | +3. When a script that is open in VSCode references another script that is not open in VSCode, the |
| 87 | + extension falls back to loading that script's content from the file system. The PowerShell |
| 88 | + extension defaults to UTF-8 encoding, but uses [byte-order mark][], or BOM, detection to select |
| 89 | + the correct encoding. |
| 90 | + |
| 91 | +The problem occurs when assuming the encoding of BOM-less formats (like [UTF-8][] with no BOM and [Windows-1252][]). |
| 92 | +The PowerShell extension defaults to UTF-8. The extension cannot change VSCode's encoding settings. |
| 93 | +For more information, see [issue #824](https://github.com/Microsoft/vscode/issues/824). |
| 94 | + |
| 95 | +## Choosing the right encoding |
| 96 | + |
| 97 | +Different systems and applications can use different encodings: |
| 98 | + |
| 99 | +- In .NET Standard, on the web, and in the Linux world, UTF-8 is now the dominant encoding. |
| 100 | +- Many .NET Framework applications use [UTF-16][]. For historical reasons, this is sometimes called |
| 101 | + "Unicode", a term that now refers to a broad [standard](https://en.wikipedia.org/wiki/Unicode) |
| 102 | + that includes both UTF-8 and UTF-16. |
| 103 | +- On Windows, many native applications that predate Unicode continue to use Windows-1252 by default. |
| 104 | + |
| 105 | +Unicode encodings also have the concept of a byte-order mark (BOM). BOMs occur at the beginning of |
| 106 | +text to tell a decoder which encoding the text is using. For multi-byte encodings, the BOM also |
| 107 | +indicates [endianness](https://en.wikipedia.org/wiki/Endianness) of the encoding. BOMs are designed |
| 108 | +to be bytes that rarely occur in non-Unicode text, allowing a reasonable guess that text is Unicode |
| 109 | +when a BOM is present. |
| 110 | + |
| 111 | +BOMs are optional and their adoption isn't as popular in the Linux world because a dependable |
| 112 | +convention of UTF-8 is used everywhere. Most Linux applications presume that text input is |
| 113 | +encoded in UTF-8. While many Linux applications will recognize and correctly handle a BOM, a number |
| 114 | +do not, leading to artifacts in text manipulated with those applications. |
| 115 | + |
| 116 | +**Therefore**: |
| 117 | + |
| 118 | +- If you work primarily with Windows applications and Windows PowerShell, you should prefer an |
| 119 | + encoding like UTF-8 with BOM or UTF-16. |
| 120 | +- If you work across platforms, you should prefer UTF-8 with BOM. |
| 121 | +- If you work mainly in Linux-associated contexts, you should prefer UTF-8 without BOM. |
| 122 | +- Windows-1252 and latin-1 are essentially legacy encodings that you should avoid if possible. |
| 123 | + However, some older Windows applications may depend on them. |
| 124 | +- It's also worth noting that script signing is [encoding-dependent](https://github.com/PowerShell/PowerShell/issues/3466), |
| 125 | + meaning a change of encoding on a signed script will require resigning. |
| 126 | + |
| 127 | +## Configuring VSCode |
| 128 | + |
| 129 | +VSCode's default encoding is UTF-8 without BOM. |
| 130 | + |
| 131 | +To set [VSCode's encoding][], go to the VSCode settings (<kbd>Ctrl<kbd>+</kbd>,</kbd>) and set the |
| 132 | +`"files.encoding"` setting: |
| 133 | + |
| 134 | +```json |
| 135 | +"files.encoding": "utf8bom" |
| 136 | +``` |
| 137 | + |
| 138 | +Some possible values are: |
| 139 | + |
| 140 | +- `utf8`: [UTF-8] without BOM |
| 141 | +- `utf8bom`: [UTF-8] with BOM |
| 142 | +- `utf16le`: Little endian [UTF-16] |
| 143 | +- `utf16be`: Big endian [UTF-16] |
| 144 | +- `windows1252`: [Windows-1252] |
| 145 | + |
| 146 | +You should get a dropdown for this in the GUI view, or completions for it in the JSON view. |
| 147 | + |
| 148 | +You can also add the following to autodetect encoding when possible: |
| 149 | + |
| 150 | +```json |
| 151 | +"files.autoGuessEncoding": true |
| 152 | +``` |
| 153 | + |
| 154 | +If you don't want these settings to affect all files types, VSCode also allows per-language |
| 155 | +configurations. Create a language-specific setting by putting settings in a `[<language-name>]` |
| 156 | +field. For example: |
| 157 | + |
| 158 | +```json |
| 159 | +"[powershell]": { |
| 160 | + "files.encoding": "utf8bom", |
| 161 | + "files.autoGuessEncoding": true |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +## Configuring PowerShell |
| 166 | + |
| 167 | +PowerShell's default encoding varies depending on version: |
| 168 | + |
| 169 | +- In PowerShell 6+, the default encoding is UTF-8 without BOM on all platforms. |
| 170 | +- In Windows PowerShell, the default encoding is usually Windows-1252, an extension of [latin-1][], |
| 171 | + also known as ISO 8859-1. |
| 172 | + |
| 173 | +In PowerShell 5+ you can find your default encoding with this: |
| 174 | + |
| 175 | +```powershell |
| 176 | +[psobject].Assembly.GetTypes() | Where-Object { $_.Name -eq 'ClrFacade'} | |
| 177 | + ForEach-Object { |
| 178 | + $_.GetMethod('GetDefaultEncoding', [System.Reflection.BindingFlags]'nonpublic,static').Invoke($null, @()) |
| 179 | + } |
| 180 | +``` |
| 181 | + |
| 182 | +The following [script](https://gist.github.com/rjmholt/3d8dd4849f718c914132ce3c5b278e0e) can be |
| 183 | +used to determine what encoding your PowerShell session infers for a script without a BOM. |
| 184 | + |
| 185 | +```powershell |
| 186 | +$badBytes = [byte[]]@(0xC3, 0x80) |
| 187 | +$utf8Str = [System.Text.Encoding]::UTF8.GetString($badBytes) |
| 188 | +$bytes = [System.Text.Encoding]::ASCII.GetBytes('Write-Output "') + [byte[]]@(0xC3, 0x80) + [byte[]]@(0x22) |
| 189 | +$path = Join-Path ([System.IO.Path]::GetTempPath()) 'encodingtest.ps1' |
| 190 | +
|
| 191 | +try |
| 192 | +{ |
| 193 | + [System.IO.File]::WriteAllBytes($path, $bytes) |
| 194 | +
|
| 195 | + switch (& $path) |
| 196 | + { |
| 197 | + $utf8Str |
| 198 | + { |
| 199 | + return 'UTF-8' |
| 200 | + break |
| 201 | + } |
| 202 | +
|
| 203 | + default |
| 204 | + { |
| 205 | + return 'Windows-1252' |
| 206 | + break |
| 207 | + } |
| 208 | + } |
| 209 | +} |
| 210 | +finally |
| 211 | +{ |
| 212 | + Remove-Item $path |
| 213 | +} |
| 214 | +``` |
| 215 | + |
| 216 | +It's possible to configure PowerShell to use a given encoding more generally using profile |
| 217 | +settings. See the following articles: |
| 218 | + |
| 219 | +- [@mklement0]'s [answer about PowerShell encoding on StackOverflow](https://stackoverflow.com/a/40098904). |
| 220 | +- [@rkeithhill]'s [blog post about dealing with BOM-less UTF-8 input in PowerShell](https://rkeithhill.wordpress.com/2010/05/26/handling-native-exe-output-encoding-in-utf8-with-no-bom/). |
| 221 | + |
| 222 | +It's not possible to force PowerShell to use a specific input encoding. PowerShell 5.1 and below |
| 223 | +default to Windows-1252 encoding when there's no BOM. For interoperability reasons, it's best to |
| 224 | +save scripts in a Unicode format with a BOM. |
| 225 | + |
| 226 | +> [!IMPORTANT] |
| 227 | +> Any other tools you have that touch PowerShell scripts may be affected by your |
| 228 | +> encoding choices or re-encode your scripts to another encoding. |
| 229 | +
|
| 230 | +### Existing scripts |
| 231 | + |
| 232 | +Scripts already on the file system may need to be re-encoded to your new chosen encoding. In the |
| 233 | +bottom bar of VSCode, you'll see the label UTF-8. Click it to open the action bar and select **Save |
| 234 | +with encoding**. You can now pick a new encoding for that file. See [VSCode's encoding][] for full |
| 235 | +instructions. |
| 236 | + |
| 237 | +If you need to re-encode multiple files, you can use the following script: |
| 238 | + |
| 239 | +```powershell |
| 240 | +Get-ChildItem *.ps1 -Recurse | ForEach-Object { |
| 241 | + $content = Get-Content -Path $_ |
| 242 | + Set-Content -Path $_.Fullname -Value $content -Encoding UTF8 -PassThru -Force |
| 243 | +} |
| 244 | +``` |
| 245 | + |
| 246 | +### The PowerShell Integrated Scripting Environment (ISE) |
| 247 | + |
| 248 | +If you also edit scripts using the PowerShell ISE, you need to synchronize your encoding |
| 249 | +settings there. |
| 250 | + |
| 251 | +The ISE should honor a BOM, but it's also possible to use reflection to |
| 252 | +[set the encoding](https://bensonxion.wordpress.com/2012/04/25/powershell-ise-default-saveas-encoding/). |
| 253 | +Note that this wouldn't be persisted between startups. |
| 254 | + |
| 255 | +### Source control software |
| 256 | + |
| 257 | +Some source control tools, such as git, ignore encodings; git just tracks the bytes. |
| 258 | +Others, like TFS or Mercurial, may not. Even some git-based tools rely on decoding text. |
| 259 | + |
| 260 | +When this is the case, make sure you: |
| 261 | + |
| 262 | +- Configure the text encoding in your source control to match your VSCode configuration. |
| 263 | +- Ensure all your files are checked into source control in the relevant encoding. |
| 264 | +- Be wary of changes to the encoding received through source control. A key sign of this is a diff |
| 265 | + indicating changes but where nothing seems to have changed (because bytes have but characters have |
| 266 | + not). |
| 267 | + |
| 268 | +### Collaborators' environments |
| 269 | + |
| 270 | +On top of configuring source control, ensure that your collaborators on any files you share don't |
| 271 | +have settings that override your encoding by re-encoding PowerShell files. |
| 272 | + |
| 273 | +### Other programs |
| 274 | + |
| 275 | +Any other program that reads or writes a PowerShell script may re-encode it. |
| 276 | + |
| 277 | +Some examples are: |
| 278 | + |
| 279 | +- Using the clipboard to copy and paste a script. This is common in scenarios like: |
| 280 | + - Copying a script into a VM |
| 281 | + - Copying a script out of an email or webpage |
| 282 | + - Copying a script into or out of a Microsoft Word or PowerPoint document |
| 283 | +- Other text editors, such as: |
| 284 | + - Notepad |
| 285 | + - vim |
| 286 | + - Any other PowerShell script editor |
| 287 | +- Text editing utilities, like: |
| 288 | + - `Get-Content`/`Set-Content`/`Out-File` |
| 289 | + - PowerShell redirection operators like `>` and `>>` |
| 290 | + - `sed`/`awk` |
| 291 | +- File transfer programs, like: |
| 292 | + - A web browser, when downloading scripts |
| 293 | + - A file share |
| 294 | + |
| 295 | +Some of these tools deal in bytes rather than text, but others offer encoding configurations. In |
| 296 | +those cases where you need to configure an encoding, you need to make it the same as your editor |
| 297 | +encoding to prevent problems. |
| 298 | + |
| 299 | +## Other resources on encoding in PowerShell |
| 300 | + |
| 301 | +There are a few other nice posts on encoding and configuring encoding in PowerShell that are worth a |
| 302 | +read: |
| 303 | + |
| 304 | +- [@mklement0]'s [summary of PowerShell encoding on StackOverflow](https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8) |
| 305 | +- Previous issues opened on vscode-PowerShell for encoding problems: |
| 306 | + - [#1308](https://github.com/PowerShell/vscode-powershell/issues/1308) |
| 307 | + - [#1628](https://github.com/PowerShell/vscode-powershell/issues/1628) |
| 308 | + - [#1680](https://github.com/PowerShell/vscode-powershell/issues/1680) |
| 309 | + - [#1744](https://github.com/PowerShell/vscode-powershell/issues/1744) |
| 310 | + - [#1751](https://github.com/PowerShell/vscode-powershell/issues/1751) |
| 311 | +- [The classic *Joel on Software* write up about Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) |
| 312 | +- [Encoding in .NET Standard](https://github.com/dotnet/standard/issues/260#issuecomment-289549508) |
| 313 | + |
| 314 | + |
| 315 | +[@mklement0]: https://github.com/mklement0 |
| 316 | +[@rkeithhill]: https://github.com/rkeithhill |
| 317 | +[Windows-1252]: https://wikipedia.org/wiki/Windows-1252 |
| 318 | +[latin-1]: https://wikipedia.org/wiki/ISO/IEC_8859-1 |
| 319 | +[UTF-8]: https://wikipedia.org/wiki/UTF-8 |
| 320 | +[byte-order mark]: https://wikipedia.org/wiki/Byte_order_mark |
| 321 | +[UTF-16]: https://wikipedia.org/wiki/UTF-16 |
| 322 | +[Language Server Protocol]: https://microsoft.github.io/language-server-protocol/ |
| 323 | +[VSCode's encoding]: https://code.visualstudio.com/docs/editor/codebasics#_file-encoding-support |
0 commit comments