Fix invalid or incorrect UTF-8 characters#2858
Fix invalid or incorrect UTF-8 characters#2858rmunn wants to merge 5 commits intoPowerShell:masterPowerShell/PowerShell:masterfrom rmunn:fix-unicode-in-rest-of-repormunn/PowerShell:fix-unicode-in-rest-of-repoCopy head branch name to clipboard
Conversation
This file contained many occurrences of U+F0A7, a character from the designated Private Use Area, so its meaning is not defined and fonts can assign any symbol to that character. Since the person who originally wrote this file works at Microsoft, it seems reasonable to assume that he was using the Wingdings font, which has a bullet character at U+F0A7. Thus, replacing these characters with the proper Unicode bullet (U+2022) seems the right way to fix this.
Problems fixed in this commit:
- Some curly-quotes encoded in CP-1252. Replaced with ASCII quotes.
- Some cases of – (U+2013 EN DASH) used in comments where a hyphen was
clearly intended (e.g., in the number -1). Replaced with hyphen.
- Many © characters (and one ® character) encoded in CP-1252 instead
of UTF-8. Replaced with correct UTF-8 characters.
- Two cases of ö encoded in CP-1252 instead of UTF-8. Fixed.
Problems NOT fixed in this commit:
- Microsoft.PowerShell.Utility.psm1 file contains some invalid UTF-8
bytes in comments on line 377 and lines 390-392. Since those are
meant to represent CP-1252 encoded output on the Windows command
prompt, it's less clear what to do here. Left them alone for now.
- Many files have a UTF-8 BOM that's no longer the first character
of the file. Usually this is because a "Copyright (C) Microsoft.
All rights reserved." comment has been automatically added at the
top of the file by some script which didn't properly handle the
existence of a BOM character. One example of this problem can be
seen in DefinitionCache.cs. This is a different issue and should
be tracked (and fixed) separately.
This should mostly fix PowerShell#2851, and any remaining Unicode problems (such
as the BOM issue above) should probably be tracked in a new issue.
|
Hi @rmunn, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution! TTYL, MSBOT; |
|
|
I've signed the CLA, but the MSBOT hasn't updated the PR yet. Do I need to email the signed PDF back to Microsoft or something? Update: Oh, I think I see what's going on. I signed it with my Gmail address, but the email address that GIthub has for me is my pobox.com forwarding address. I'll probably need to re-sign it with the pobox.com address before the MSBOT knows that it's me. |
| { | ||
| // we write null out because: | ||
| // PS C:\> �$null | ForEach-object {$_.aa} | ForEach-Object {$_ + 3}� | ||
| // PS C:\> "$null | ForEach-object {$_.aa} | ForEach-Object {$_ + 3}" |
There was a problem hiding this comment.
adding quotes here changes the meaning of that comment line from a powershell pipeline to a string
There was a problem hiding this comment.
The original file before my fixes contained a 0x93 byte at the start of the comment line, and a 0x94 byte at the end of the comment line. In CP-1252, those are opening and closing curly quotes, so I replaced them with straight quotes since that was the closest thing to the original intent of whoever wrote that comment.
But since that comment's original intent was apparently incorrect, I'll be happy to remove the quotes.
There was a problem hiding this comment.
In this case, the quotes shouldn't be there at all.
| // 3 | ||
| // so we also want | ||
| // PS C:\> �$null | ForEach-object aa | ForEach-Object {$_ + 3}� | ||
| // PS C:\> "$null | ForEach-object aa | ForEach-Object {$_ + 3}" |
There was a problem hiding this comment.
adding quotes here changes the meaning of that comment line from a powershell pipeline to a string
There was a problem hiding this comment.
Same comment as before: that was the intent of the original author, given the CP-1252-encoded curly quotes that I found. But I'll be happy to remove the quotes if they're actually incorrect.
There was a problem hiding this comment.
Same, quotes shouldn't have been there in the first place.
| protected override void EndProcessing() | ||
| { | ||
| // Win8: 264471 - Error message for New-ModuleManifest �ProcessorArchitecture is obsolete. | ||
| // Win8: 264471 - Error message for New-ModuleManifest - ProcessorArchitecture is obsolete. |
There was a problem hiding this comment.
ProcessorArchitecture is a parameter, so it should be written as:
// Win8: 264471 - Error message for New-ModuleManifest -ProcessorArchitecture is obsolete.
There was a problem hiding this comment.
It was an en-dash character, so I figured it was meant to be a dash. If it's supposed to be a single-hyphen prefix, then that's what I'll put in there.
There was a problem hiding this comment.
Should be single-hyphen indicating it's a parameter passed to new-modulemanifest cmdlet
|
|
||
| // When cmdlet metadata is generated for the 'HelpFile' field, we use the name assembly and we append '-Help.xml' to it. | ||
| // Because of this, if the cmdlet is part of an N�gen assembly, then 'HelpFile' field will be pointing to a help file which does not exist. | ||
| // Because of this, if the cmdlet is part of an N'gen assembly, then 'HelpFile' field will be pointing to a help file which does not exist. |
There was a problem hiding this comment.
Ngen refers to Ngen.exe which produces native code from managed code, so the apostrophe doesn't make sense
There was a problem hiding this comment.
The original comment had a 0x92 byte, which is a curly apostrophe in CP-1252. I did wonder about whether it made sense, but I wanted to stay as close to the original as possible. I'll be happy to fix it to what it should be, though.
|
Hi, I am closing and re-opening this PR to bump the CLA bot. Sorry for the inconvenience! |
|
Hi @rmunn, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution! The agreement was validated by Microsoft and real humans are currently evaluating your PR. TTYL, MSBOT; |
This implements all the changes requested by Steve Lee.
| Author="Microsoft Corporation" | ||
| CompanyName="Microsoft Corporation" | ||
| Copyright="� Microsoft Corporation. All rights reserved." | ||
| Copyright="© Microsoft Corporation. All rights reserved." |
There was a problem hiding this comment.
Some Windows tools will not handle this file correctly because there is no BOM. For example, gvim shows 2 characters for the copyright symbol, I assume because it guesses the file is Latin-1, not UTF8.
There was a problem hiding this comment.
Hmmm. This is a tricky question. The use of a BOM in UTF-8 files can cause problems, as we've seen in this very PR (where a naive script pasted copyright comments at the start of several files, pushing the BOM out of the first position in the file). And according to the Unicode spec, "[u]se of a BOM is neither required nor recommended for UTF-8".
Since UTF-8 is emerging (I'd say "has emerged") as the de facto standard encoding for text files, and other encodings are increasingly rare, the right answer in the long run is for tools like GVim to default to assuming UTF-8, and only try a non-Unicode encoding like Latin-1 if decoding the file as UTF-8 produces invalid byte sequences (like 0xA9 without a 0xC2 byte in front of it, as in this file). Another thing that needs to happen long-term is for Windows to make codepage 65001 (UTF-8) the default codepage unless users specifically select a different one. That would go a long way towards fixing encoding problems all over the world, and I hope Microsoft decides to move in that direction. (Actually, Microsoft has started to take steps towards making UTF-8 the default in some of their tools: I found out about the need this PR because of microsoft/vscode#15303, which was caused because VS Code assumes UTF-8 by default. I'd like to congratulate the VS Code team for getting it right.)
However, that's in the long run. In the short run, there are still tools in common use (like GVim) that don't assume UTF-8 out of the box but default to a non-Unicode encoding. So I guess we do need to add a BOM to these files, even though the UTF-8 BOM often causes problems of its own, so that these tools that don't yet have the right defaults will be able to figure it out. I won't have time to work on that until Monday, but I'll do it then. (And I'll also file a bug in the GVim issue tracker suggesting that they should make UTF-8 the default encoding. Because no software should be defaulting to a non-Unicode encoding these days.)
Although the BOM "is neither required nor recommended" for UTF-8 files by the Unicode standard, it is still necessary for some tools to process these files correctly now that they contain the © (copyright) character. In the absence of a BOM, some tools still default to Latin-1 instead of UTF-8; until such tools are changed to default to a Unicode encoding, putting a BOM in front of UTF-8 files remains an unfortunate necessity.
|
Why we use unwanted BOM instead of simply replace one © (copyright) character? Now 1457 file contain "Copyright (C)". |
|
I'm closing this PR as we're going a different route. |
Problems fixed in this commit:
Problems NOT fixed in this commit:
This should mostly fix #2851, and any remaining Unicode problems (such as the BOM issue mentioned above) should probably be tracked in a new issue.