If #region and #if are not comments, what are they? #63

Jul 17, 2018

qwertie
Jul 17, 2018
Maintainer

C# is really weird sometimes. @j4m3z0r @jonathanvdc

I was thinking about treating #region/#endregion like statements rather than like comments, so they would be processed by the parser as normal tokens. This is attractive so one of them won't be deleted inadvertently from the output (see #58) and so they can have comments attached to them. The challenge is that normal C# allows bizarre code like this:

if (x >
#region What in tarnation? 
 10) {
  return x;
  #endregion
}

Realistically, the parser cannot handle #region if craziness like this is allowed. There is a completely independent system (AbstractTriviaInjector/StandardTriviaInjector/EcsTriviaInjector) for combining comments with the parser's output, so naturally I handled region as if it were a comment. But I'm thinking maybe no one in practice uses #region like this so maybe it's okay to just parse it like a statement and make the above code illegal.

Currently #if is supported only on the input side - it gets deleted from the output. If we would like to emit #if to the output side, the same problems would arise with #if/#endif as #region/#endregion. If an #if false/#else/#endif block is stored as two or three trivia nodes (i.e. comment-like attributes), the first could be deleted if the node to which it is attached is deleted, leaving the #endif intact and not deleted, which would cause a C# compiler error.

But #if seems like a more thorny issue than #region because very unlikely to see #region mid-statement, but you do see mid-statement #if sometimes. For example our IListSource interface is defined as

#if !DotNet2 && !DotNet3
public interface IListSource<out T> : IReadOnlyList<T>
#else
public interface IListSource<T> : IReadOnlyList<T>
#endif
{
	...
}

So... hmm. I'm not sure what to do with this.

Jul 17, 2018

j4m3z0r
Jul 17, 2018

Would it be simpler to pass #region and comments straight through without making any attempt to associate them with another node in the tree? So just map them to whatever rawText uses internally, or similar?

For my use-case, I'm totally ok with disallowing #region mid-statement, but I agree that #if is more complex, and users are far more likely to run into limitations there.

I guess more important to me than the exact decision made here is that it be applied consistently: I can live with my #region markers being stripped, as long as they're stripped consistently. The issues I've been struggling with are where some are emitted and others not, leading to mismatched #region #endregion tags. Of course, I'd prefer they were preserved. :)

Thinking more about it, it seems that the correct representation would have C# preprocessor stuff be a level "above" the rest of EC#; I presume C# itself is doing something like that. However, that seems like a pretty significant undertaking, so I'm not sure how practical that would be.

(apologies for my somewhat hand-wavey thoughts here; I've not yet familiarized myself with the code)

0 replies

Jul 17, 2018

qwertie
Jul 17, 2018
Maintainer Author

The idea of passing trivia "straight through" is not meaningful in the presence of macros, as the output syntax tree could potentially be completely different from the input tree. Some technique is needed to map the input trivia to a substantially altered output tree... we know that the current method is inadequate, but I haven't thought of anything better yet.

In terms of parsing, there is a separate layer for the preprocessor (EC# has five stages right now: lexer => preprocessor (which also separates comments and newlines to a separate stream) => tree parsing => normal parsing => trivia injection.) This produces a single unified tree that includes comments, and if you print it immediately you get output that matches the input pretty well (with some exceptions, because I did not design the tree to represent the original code perfectly token-for-token).

The current representation works mostly intuitively since you can write, say,

unroll(x in (x,y,z)) {
    // A coordinate.
    int x;
}

So the output has three comments, which makes sense. But then there are these cases where comments get deleted because their associated node gets deleted... but perhaps that's okay, maybe sometimes the user wants the comment to be deleted:

// This unroll command produces 3 variables called x, y and z
unroll(x in (x,y,z)) {
    int x;
}

Ahh, but then there's the #region thing. What to do? One idea is to enhance AbstractTriviaInjector (which is language-agnostic) to allow a derived class to "reify" certain trivia, converting it from trivia into a statement. Then the derived class EcsTriviaInjector says "ahh, well this #region exists between braces (not some weird place), so I hereby want it to be converted to a statement". (Edit: on the other hand when an #if directive exists in a weird place - maybe in that case we'd just have to store it as trivia and accept the risk that it could get messed up and wait for complaints on github :O?)

Also there are situations like this:

class Foo {
    // This comment is kind of ... alone.

    // Define 3 variables
    unroll(x in (x,y,z)) {
        int x;
    }
}

The "alone" comment is separated from unroll by a blank line, suggesting it isn't intended to be associated with unroll. However currently it will be associated with unroll anyway because there is no other child of {} with which it could be associated. I guess it could be associated with the opening brace... the braces' Target has a Range associated with { so this ought to work... hmm. But my initial idea was that this could be handled in a similar way: the derived class is somehow empowered to "reify" the newline (e.g. as #rawText("")) so that the "alone" comment can be associated with it.

0 replies

Jul 19, 2018

jonathanvdc
Jul 19, 2018

Hi all.

I don't think this is a problem that should be—or indeed can be—solved by trying to be smart about where preprocessor directives are placed.

It is my understanding that the EC# compilation pipeline should look more or less like this:

EC# source code -> preprocessor -> parser -> LeMP -> compiler -> IL

That is, a clean insert of a macro processing (LeMP) stage just before the compilation stage in the pipeline of a regular C# compiler such as csc or mcs.

Given that pipeline, the snippet below must be legal EC# because preprocessing happens (once) before macro expansion and preprocessing is the only stage that can raise an error about mismatched #region statements.

unroll (x in (x, y, z)) {
    #region
    int x;
}
#endregion

But in reality, the toolchain rejects this code. (I think? I don't have access to a computer right now.) And it's easy to see why: the macro processor "unrolls" the #region directive (that is not "wrong" given the ideal pipeline) and then feeds the result to a full-stack C# compiler, which balls.

That final step is only supposed to implement the compilation stage in the ideal EC# pipeline, but it has the unfortunate side-effect of invoking another preprocessing step, which is decidedly undesirable because it may mark correct code as incorrect.

Ideally, we'd just turn off csc/mcs's preprocessing stage. Lots of compilers (gcc, clang, etc.) have a flag for that, but csc doesn't seem to include one according to this MSDN page.

So I guess the next best thing would be to delete all preprocessor directives from the output tree. That's essentially the same thing as turning off the csc/mcs preprocessor.

I know that's kind of a drastic measure, but OTOH there's nothing left to preprocess. So why keep the preprocessor directives?

0 replies

Jul 19, 2018

qwertie
Jul 19, 2018
Maintainer Author

That's good food for thought, Jonathan, thanks.

One extra nuance here is that comments aren't needed in the "ideal pipeline" either, yet I made the effort to preserve them (which isn't easy to do). Why? Because I though it would be valuable to some users. For one thing I wanted people to feel comfortable using EC# on a trial basis - if its output faithfully preserves all normal C# code, you can always throw away EC# when the boss complains about your use of it or whatever.

I suppose the more important thing, though, was to preserve Doc Comments so that doc-generator tools (that only understand C#) still work when on EC# code has doc comments. Indeed, I wish someone would make a macro that would let me write doc-comments in Markdown and support doc-comments that cover all overloads of a function rather than just one... but I digress.

Compared to doc comments, the loss of #region markers or #if directives is a small thing and perhaps not very important. If csc doesn't let us ignore mismatched #region we could instead define a EC# setting to block their output; OTOH my idea of "reification" - empty statements to which trivia can be attached - would solve most of the problem (and improve comment handling), except that users would need to surround #region and #endregion with blank lines to guarantee no accidental deletion. A reliable way to avoid csc errors is to emit #region and #endregion as // comments in the output, though of course that's not ideal either. It seems like any "perfect" solution would be more trouble than it's worth.

0 replies

Jul 19, 2018

j4m3z0r
Jul 19, 2018

My 2c: preserving #region and comments is pretty important to me -- I'm generating code that forms an API I expect people to use. I want all the docstrings, etc to be preserved in the output, and the structure of #region greatly simplifies navigating the code when trying to debug everything.

I realize this is not a complete solution, but given the complexity of a "proper" solution, it might incrementally improve matters: what about adding a region directive to LeMP that emits regions. I imagine the code being something like this:

class C {
    region("Constructors") {
        public C(string s) { /*...*/ }
        public C(int i) { /* ... */ }
    }
}

If we do ever come up with a better solution that just preserves regions in the code, that's fine -- I see no harm in having 2 ways to emit these blocks. As I'm imagining it, this would promote regions to something akin to a class, and it would contain the elements inside of it.

I guess the question is to what extent do you want EC# to be a strict superset of C#?

Comments are trickier -- I can't see any easy way to constrain the problem space such that there's a better representation of them without eliminating an important use-case. The one thought I have there is that I feel like associating a comment with another node in the AST isn't the right approach. Since comments have no meaning to the compiler there isn't a way to do this reliably and still keep the full flexibility of the comment mechanism. I suppose it would not be unreasonable to introduce some constraints about comments to the language, but I feel like restructuring things to not rely on associating them with other items is the way forward.

0 replies

Jul 19, 2018

qwertie
Jul 19, 2018
Maintainer Author

@j4m3z0r Well, you could use #rawText("#region foo"); today (this will not be deleted like in the problem you were having, since it is a proper statement) but you've reminded me that you can write a region macro:

define region($name, { $(..code); }) {
  #rawText("#region "); #rawText($name);
  $(..code);
  #rawText("#endregion "); 
}

(As I still haven't added a stringify alternative that can concatenate things, the concatenation here is implicit. It only works because the macro puts the two #rawText nodes on the same line. The implementation detail here is that it causes a #trivia_appendStatement trivia on the second #rawText, which suppresses the newline that ordinarily appears between statements.)

0 replies

Jul 19, 2018

j4m3z0r
Jul 19, 2018

What a neat trick! I wasn't aware that you could have curly braces trailing a macro invocation pass that block as an argument to the macro. There's a broader discussion to be had here, but that solves my issue with regions fairly nicely, and lets me ensure that all these region markers are output consistently, which appeals to my inner* obsessive compulsive.

[*] Not actually that inner.

0 replies

Jul 20, 2018

qwertie
Jul 20, 2018
Maintainer Author

How it works is that there is a syntactic sugar where f(a) {b} actually means f(a, {b}). So it's a feature of the parser, not the define macro itself.

0 replies

Search code, repositories, users, issues, pull requests...

If #region and #if are not comments, what are they? #63

Uh oh!

Uh oh!

qwertie Jul 17, 2018 Maintainer

Replies: 8 comments

Uh oh!

j4m3z0r Jul 17, 2018

Uh oh!

Uh oh!

qwertie Jul 17, 2018 Maintainer Author

Uh oh!

jonathanvdc Jul 19, 2018

Uh oh!

Uh oh!

qwertie Jul 19, 2018 Maintainer Author

Uh oh!

j4m3z0r Jul 19, 2018

Uh oh!

Uh oh!

qwertie Jul 19, 2018 Maintainer Author

Uh oh!

j4m3z0r Jul 19, 2018

Uh oh!

qwertie Jul 20, 2018 Maintainer Author

qwertie
Jul 17, 2018
Maintainer

j4m3z0r
Jul 17, 2018

qwertie
Jul 17, 2018
Maintainer Author

jonathanvdc
Jul 19, 2018

qwertie
Jul 19, 2018
Maintainer Author

j4m3z0r
Jul 19, 2018

qwertie
Jul 19, 2018
Maintainer Author

j4m3z0r
Jul 19, 2018

qwertie
Jul 20, 2018
Maintainer Author