-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Compiler lex interpolations 2 #43132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler lex interpolations 2 #43132
Conversation
81b5559
to
2594bfe
Compare
@petebacondarwin FYI there are legit failures (related to i18n) in the presubmit. I will investigate further and try to come up with a repro within the next day or two. Will keep you updated. |
😭 - I am sorry @AndrewKushnir - this is becoming an epic. I just realised that I don't think I fixed the i18n problem for attributes, only for text nodes. Let me add tests and fix that. |
2594bfe
to
fe9d947
Compare
fe9d947
to
7b98356
Compare
One more global presubmit (after g3 cleanup). |
Quick update: most recent TGP is "green" and extra checks in g3 confirmed that extracted message IDs are the same before/after the fix, so we should be ready for merge. Adding "risk: medium" label to indicate that this PR may still have a potential for some breakages, so we should merge and sync the change individually (do not bundle it with other changes). |
7b98356
to
d09a05a
Compare
The lexer now splits interpolation tokens out from text tokens. Previously the contents of `<div>Hello, {{ name}}<div>` would be a single text token. Now it will be three tokens: ``` TEXT: "Hello, " INTERPOLATION: "{{", " name", "}}" TEXT: "" ``` - INTERPOLATION tokens have three parts, "start marker", "expression" and "end marker". - INTERPOLATION tokens are always preceded and followed by TEXT tokens, even if they represent an empty string. The HTML parser has been modified to recombine these tokens to allow this refactoring to have limited effect in this commit. Further refactorings to use these new tokens will follow in subsequent commits.
The lexer now splits encoded entity tokens out from text and attribute value tokens. Previously encoded entities would be decoded and the decoded value would be included as part of the text token of the surrounding text. Now the entities have their own tokens. There are two scenarios: text and attribute values. Previously the contents of `<div>Hello & goodbye</div>` would be a single TEXT token. Now it will be three tokens: ``` TEXT: "Hello " ENCODED_ENTITY: "&", "&" TEXT: " goodbye" ``` Previously the attribute value in `<div title="Hello & goodbye">` would be a single text token. Now it will be three tokens: ``` ATTR_VALUE_TEXT: "Hello " ENCODED_ENTITY: "&", "&" ATTR_VALUE_TEXT: " goodbye" ``` - ENCODED_ENTITY tokens have two parts: "decoded" and "encoded". - ENCODED_ENTITY tokens are always preceded and followed by either TEXT tokens or ATTR_VALUE_TEXT tokens, depending upon the context, even if they represent an empty string. The HTML parser has been modified to recombine these tokens to allow this refactoring to have limited effect in this commit. Further refactorings to use these new tokens will follow in subsequent commits.
When it was tokenized, text content is split into parts that can include interpolations and encoded entities tokens. To make this information available to downstream processing, this commit adds these tokens to the `Text` AST nodes, with suitable processing.
The tests were checking that the source-span of parsed HTML nodes were accurate, but they were not checking the span when it includes the "leading trivia", which are given by the `fullStart` rather than `start` location.
…sages Previously, the way templates were tokenized meant that we lost information about the location of interpolations if the template contained encoded HTML entities. This meant that the mapping back to the source interpolated strings could be offset incorrectly. Also, the source-span assigned to an i18n message did not include leading whitespace. This confused the output source-mappings so that the first text nodes of the message stopped at the first non-whitespace character. This commit makes use of the previous refactorings, where more fine grain information was provided in text tokens, to enable the parser to identify the location of the interpolations in the original source more accurately. Fixes angular#41034
…43132) The lexer now splits encoded entity tokens out from text and attribute value tokens. Previously encoded entities would be decoded and the decoded value would be included as part of the text token of the surrounding text. Now the entities have their own tokens. There are two scenarios: text and attribute values. Previously the contents of `<div>Hello & goodbye</div>` would be a single TEXT token. Now it will be three tokens: ``` TEXT: "Hello " ENCODED_ENTITY: "&", "&" TEXT: " goodbye" ``` Previously the attribute value in `<div title="Hello & goodbye">` would be a single text token. Now it will be three tokens: ``` ATTR_VALUE_TEXT: "Hello " ENCODED_ENTITY: "&", "&" ATTR_VALUE_TEXT: " goodbye" ``` - ENCODED_ENTITY tokens have two parts: "decoded" and "encoded". - ENCODED_ENTITY tokens are always preceded and followed by either TEXT tokens or ATTR_VALUE_TEXT tokens, depending upon the context, even if they represent an empty string. The HTML parser has been modified to recombine these tokens to allow this refactoring to have limited effect in this commit. Further refactorings to use these new tokens will follow in subsequent commits. PR Close #43132
When it was tokenized, text content is split into parts that can include interpolations and encoded entities tokens. To make this information available to downstream processing, this commit adds these tokens to the `Text` AST nodes, with suitable processing. PR Close #43132
The tests were checking that the source-span of parsed HTML nodes were accurate, but they were not checking the span when it includes the "leading trivia", which are given by the `fullStart` rather than `start` location. PR Close #43132
…sages (#43132) Previously, the way templates were tokenized meant that we lost information about the location of interpolations if the template contained encoded HTML entities. This meant that the mapping back to the source interpolated strings could be offset incorrectly. Also, the source-span assigned to an i18n message did not include leading whitespace. This confused the output source-mappings so that the first text nodes of the message stopped at the first non-whitespace character. This commit makes use of the previous refactorings, where more fine grain information was provided in text tokens, to enable the parser to identify the location of the interpolations in the original source more accurately. Fixes #41034 PR Close #43132
These token interfaces will make it easier to reason about tokens in the parser and in specs. Previously, it was never clear what items could appear in the `parts` array of a token given a particular `TokenType`. Now, each token interface declares a labelled tuple for the parts, which helps to document the token better. PR Close #43132
…43132) The lexer now splits interpolation tokens out from text tokens. Previously the contents of `<div>Hello, {{ name}}<div>` would be a single text token. Now it will be three tokens: ``` TEXT: "Hello, " INTERPOLATION: "{{", " name", "}}" TEXT: "" ``` - INTERPOLATION tokens have three parts, "start marker", "expression" and "end marker". - INTERPOLATION tokens are always preceded and followed by TEXT tokens, even if they represent an empty string. The HTML parser has been modified to recombine these tokens to allow this refactoring to have limited effect in this commit. Further refactorings to use these new tokens will follow in subsequent commits. PR Close #43132
…43132) The lexer now splits encoded entity tokens out from text and attribute value tokens. Previously encoded entities would be decoded and the decoded value would be included as part of the text token of the surrounding text. Now the entities have their own tokens. There are two scenarios: text and attribute values. Previously the contents of `<div>Hello & goodbye</div>` would be a single TEXT token. Now it will be three tokens: ``` TEXT: "Hello " ENCODED_ENTITY: "&", "&" TEXT: " goodbye" ``` Previously the attribute value in `<div title="Hello & goodbye">` would be a single text token. Now it will be three tokens: ``` ATTR_VALUE_TEXT: "Hello " ENCODED_ENTITY: "&", "&" ATTR_VALUE_TEXT: " goodbye" ``` - ENCODED_ENTITY tokens have two parts: "decoded" and "encoded". - ENCODED_ENTITY tokens are always preceded and followed by either TEXT tokens or ATTR_VALUE_TEXT tokens, depending upon the context, even if they represent an empty string. The HTML parser has been modified to recombine these tokens to allow this refactoring to have limited effect in this commit. Further refactorings to use these new tokens will follow in subsequent commits. PR Close #43132
When it was tokenized, text content is split into parts that can include interpolations and encoded entities tokens. To make this information available to downstream processing, this commit adds these tokens to the `Text` AST nodes, with suitable processing. PR Close #43132
The tests were checking that the source-span of parsed HTML nodes were accurate, but they were not checking the span when it includes the "leading trivia", which are given by the `fullStart` rather than `start` location. PR Close #43132
…sages (#43132) Previously, the way templates were tokenized meant that we lost information about the location of interpolations if the template contained encoded HTML entities. This meant that the mapping back to the source interpolated strings could be offset incorrectly. Also, the source-span assigned to an i18n message did not include leading whitespace. This confused the output source-mappings so that the first text nodes of the message stopped at the first non-whitespace character. This commit makes use of the previous refactorings, where more fine grain information was provided in text tokens, to enable the parser to identify the location of the interpolations in the original source more accurately. Fixes #41034 PR Close #43132
These token interfaces will make it easier to reason about tokens in the parser and in specs. Previously, it was never clear what items could appear in the `parts` array of a token given a particular `TokenType`. Now, each token interface declares a labelled tuple for the parts, which helps to document the token better. PR Close #43132
This issue has been automatically locked due to inactivity. Read more about our automatic conversation locking policy. This action has been performed automatically by a bot. |
Blocked behind #43129This PR is a replacement for the reverted #42062
The significant commit that should fix the i18n message ID problem is 359e043. I have checked that the compliance test passes before this PR.