Fix invalid or incorrect UTF-8 characters by rmunn · Pull Request #2858 · PowerShell/PowerShell

rmunn · Dec 8, 2016

Problems fixed in this commit:

Some curly-quotes encoded in CP-1252. Replaced with ASCII quotes.
Some cases of – (U+2013 EN DASH) used in comments where a hyphen was clearly intended (e.g., in the number -1). Replaced with hyphen.
Many © characters (and one ® character) encoded in CP-1252 instead of UTF-8. Replaced with correct UTF-8 characters.
Two cases of ö encoded in CP-1252 instead of UTF-8. Fixed.

Problems NOT fixed in this commit:

Microsoft.PowerShell.Utility.psm1 file contains some invalid UTF-8 bytes in comments on line 377 and lines 390-392. Since those are meant to represent CP-1252 encoded output on the Windows command prompt, it's less clear what to do here. Left them alone for now.
Many files have a UTF-8 BOM that's no longer the first character of the file. Usually this is because a "Copyright (C) Microsoft. All rights reserved." comment has been automatically added at the top of the file by some script which didn't properly handle the existence of a BOM character. One example of this problem can be seen in DefinitionCache.cs, where line 4 starts with the BOM that should be at the start of line 1. This is a different issue and should be tracked (and fixed) separately.

This should mostly fix #2851, and any remaining Unicode problems (such as the BOM issue mentioned above) should probably be tracked in a new issue.

This file contained many occurrences of U+F0A7, a character from the designated Private Use Area, so its meaning is not defined and fonts can assign any symbol to that character. Since the person who originally wrote this file works at Microsoft, it seems reasonable to assume that he was using the Wingdings font, which has a bullet character at U+F0A7. Thus, replacing these characters with the proper Unicode bullet (U+2022) seems the right way to fix this.

Problems fixed in this commit: - Some curly-quotes encoded in CP-1252. Replaced with ASCII quotes. - Some cases of – (U+2013 EN DASH) used in comments where a hyphen was clearly intended (e.g., in the number -1). Replaced with hyphen. - Many © characters (and one ® character) encoded in CP-1252 instead of UTF-8. Replaced with correct UTF-8 characters. - Two cases of ö encoded in CP-1252 instead of UTF-8. Fixed. Problems NOT fixed in this commit: - Microsoft.PowerShell.Utility.psm1 file contains some invalid UTF-8 bytes in comments on line 377 and lines 390-392. Since those are meant to represent CP-1252 encoded output on the Windows command prompt, it's less clear what to do here. Left them alone for now. - Many files have a UTF-8 BOM that's no longer the first character of the file. Usually this is because a "Copyright (C) Microsoft. All rights reserved." comment has been automatically added at the top of the file by some script which didn't properly handle the existence of a BOM character. One example of this problem can be seen in DefinitionCache.cs. This is a different issue and should be tracked (and fixed) separately. This should mostly fix PowerShell#2851, and any remaining Unicode problems (such as the BOM issue above) should probably be tracked in a new issue.

msftclas · Dec 8, 2016

Hi @rmunn, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes. I promise there's no faxing. https://cla.microsoft.com.

TTYL, MSBOT;

rmunn · Dec 8, 2016

Note that #2857 is a subset of these changes. I created that PR first, then decided to go ahead and fix the rest of the Unicode encoding problems I had found. So if this PR is accepted, #2857 will be moot.

iSazonov · Dec 8, 2016

I think we should fix Microsoft.PowerShell.Utility.psm1 here too because the problem is in comments , not in real data.
We should add tests. Perhaps we can add CI encoding checks as requested by @vors in CI check for files encoding #1559

rmunn · Dec 8, 2016

I've signed the CLA, but the MSBOT hasn't updated the PR yet. Do I need to email the signed PDF back to Microsoft or something?

Update: Oh, I think I see what's going on. I signed it with my Gmail address, but the email address that GIthub has for me is my pobox.com forwarding address. I'll probably need to re-sign it with the pobox.com address before the MSBOT knows that it's me.

SteveL-MSFT · Dec 8, 2016

                                {
                                    // we write null out because:
-                                    // PS C:\> �$null | ForEach-object {$_.aa} | ForEach-Object {$_ + 3}�
+                                    // PS C:\> "$null | ForEach-object {$_.aa} | ForEach-Object {$_ + 3}"


adding quotes here changes the meaning of that comment line from a powershell pipeline to a string

The original file before my fixes contained a 0x93 byte at the start of the comment line, and a 0x94 byte at the end of the comment line. In CP-1252, those are opening and closing curly quotes, so I replaced them with straight quotes since that was the closest thing to the original intent of whoever wrote that comment.

But since that comment's original intent was apparently incorrect, I'll be happy to remove the quotes.

In this case, the quotes shouldn't be there at all.

SteveL-MSFT · Dec 8, 2016

                                    // 3
                                    // so we also want
-                                    // PS C:\> �$null | ForEach-object aa | ForEach-Object {$_ + 3}�
+                                    // PS C:\> "$null | ForEach-object aa | ForEach-Object {$_ + 3}"


adding quotes here changes the meaning of that comment line from a powershell pipeline to a string

Same comment as before: that was the intent of the original author, given the CP-1252-encoded curly quotes that I found. But I'll be happy to remove the quotes if they're actually incorrect.

Same, quotes shouldn't have been there in the first place.

SteveL-MSFT · Dec 8, 2016

        protected override void EndProcessing()
        {
-            // Win8: 264471 - Error message for New-ModuleManifest �ProcessorArchitecture is obsolete.
+            // Win8: 264471 - Error message for New-ModuleManifest - ProcessorArchitecture is obsolete.


ProcessorArchitecture is a parameter, so it should be written as:

// Win8: 264471 - Error message for New-ModuleManifest -ProcessorArchitecture is obsolete.

It was an en-dash character, so I figured it was meant to be a dash. If it's supposed to be a single-hyphen prefix, then that's what I'll put in there.

Should be single-hyphen indicating it's a parameter passed to new-modulemanifest cmdlet

SteveL-MSFT · Dec 8, 2016


                // When cmdlet metadata is generated for the 'HelpFile' field, we use the name assembly and we append '-Help.xml' to it.
-                // Because of this, if the cmdlet is part of an N�gen assembly, then 'HelpFile' field will be pointing to a help file which does not exist.
+                // Because of this, if the cmdlet is part of an N'gen assembly, then 'HelpFile' field will be pointing to a help file which does not exist.


Ngen refers to Ngen.exe which produces native code from managed code, so the apostrophe doesn't make sense

The original comment had a 0x92 byte, which is a curly apostrophe in CP-1252. I did wonder about whether it made sense, but I wanted to stay as close to the original as possible. I'll be happy to fix it to what it should be, though.

Should be just Ngen

msftgits · Dec 8, 2016

Hi, I am closing and re-opening this PR to bump the CLA bot. Sorry for the inconvenience!

msftclas · Dec 8, 2016

Hi @rmunn, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution!
You've already signed the contribution license agreement. Thanks!

The agreement was validated by Microsoft and real humans are currently evaluating your PR.

TTYL, MSBOT;

This implements all the changes requested by Steve Lee.

SteveL-MSFT

LGTM

lzybkr · Dec 9, 2016

 Author="Microsoft Corporation"
 CompanyName="Microsoft Corporation"
-Copyright="� Microsoft Corporation. All rights reserved."
+Copyright="© Microsoft Corporation. All rights reserved."


Some Windows tools will not handle this file correctly because there is no BOM. For example, gvim shows 2 characters for the copyright symbol, I assume because it guesses the file is Latin-1, not UTF8.

Hmmm. This is a tricky question. The use of a BOM in UTF-8 files can cause problems, as we've seen in this very PR (where a naive script pasted copyright comments at the start of several files, pushing the BOM out of the first position in the file). And according to the Unicode spec, "[u]se of a BOM is neither required nor recommended for UTF-8".

Since UTF-8 is emerging (I'd say "has emerged") as the de facto standard encoding for text files, and other encodings are increasingly rare, the right answer in the long run is for tools like GVim to default to assuming UTF-8, and only try a non-Unicode encoding like Latin-1 if decoding the file as UTF-8 produces invalid byte sequences (like 0xA9 without a 0xC2 byte in front of it, as in this file). Another thing that needs to happen long-term is for Windows to make codepage 65001 (UTF-8) the default codepage unless users specifically select a different one. That would go a long way towards fixing encoding problems all over the world, and I hope Microsoft decides to move in that direction. (Actually, Microsoft has started to take steps towards making UTF-8 the default in some of their tools: I found out about the need this PR because of microsoft/vscode#15303, which was caused because VS Code assumes UTF-8 by default. I'd like to congratulate the VS Code team for getting it right.)

However, that's in the long run. In the short run, there are still tools in common use (like GVim) that don't assume UTF-8 out of the box but default to a non-Unicode encoding. So I guess we do need to add a BOM to these files, even though the UTF-8 BOM often causes problems of its own, so that these tools that don't yet have the right defaults will be able to figure it out. I won't have time to work on that until Monday, but I'll do it then. (And I'll also file a bug in the GVim issue tracker suggesting that they should make UTF-8 the default encoding. Because no software should be defaulting to a non-Unicode encoding these days.)

Although the BOM "is neither required nor recommended" for UTF-8 files by the Unicode standard, it is still necessary for some tools to process these files correctly now that they contain the © (copyright) character. In the absence of a BOM, some tools still default to Latin-1 instead of UTF-8; until such tools are changed to default to a Unicode encoding, putting a BOM in front of UTF-8 files remains an unfortunate necessity.

iSazonov · Dec 13, 2016

lzybkr · Feb 23, 2017

I'm closing this PR as we're going a different route.
First, some of these files will be removed when we remove Windows (full) PowerShell from this repo.
For other files, I think we want pure ASCII and remove any BOMs - necessary or not - which means we'll replace some of the characters, e.g. replace copyright with (c).

rmunn added 2 commits December 8, 2016 09:17

msftclas added the cla-required label Dec 8, 2016

PowerShellTeam added the Review - Needed The PR is being reviewed label Dec 8, 2016

SteveL-MSFT requested changes Dec 8, 2016

View reviewed changes

msftgits closed this Dec 8, 2016

msftgits reopened this Dec 8, 2016

msftclas added the cla-already-signed label Dec 8, 2016

msftgits removed the cla-required label Dec 8, 2016

Fix comments whose punctuation was incorrect

55ad5d6

This implements all the changes requested by Steve Lee.

SteveL-MSFT approved these changes Dec 9, 2016

View reviewed changes

lzybkr reviewed Dec 9, 2016

View reviewed changes

lzybkr self-assigned this Dec 9, 2016

rmunn added 2 commits December 13, 2016 12:08

Add UTF-8 BOM to two more files that will need it

0a1c7b1

lzybkr closed this Feb 23, 2017

iSazonov removed the Review - Needed The PR is being reviewed label Mar 27, 2017

Search code, repositories, users, issues, pull requests...

Conversation

rmunn commented Dec 8, 2016

Uh oh!

msftclas commented Dec 8, 2016

Uh oh!

rmunn commented Dec 8, 2016

Uh oh!

iSazonov commented Dec 8, 2016

Uh oh!

rmunn commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msftgits commented Dec 8, 2016

Uh oh!

msftclas commented Dec 8, 2016

Uh oh!

SteveL-MSFT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iSazonov commented Dec 13, 2016

Uh oh!

lzybkr commented Feb 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rmunn commented Dec 8, 2016 •

edited

Loading