RFC 0550: macro-future-proofing

lang (macros | syntax)

Summary

Future-proof the allowed forms that input to an MBE can take by requiring certain delimiters following NTs in a matcher. In the future, it will be possible to lift these restrictions backwards compatibly if desired.

Key Terminology

Example:

macro_rules! i_am_an_mbe {
    (start $foo:expr $($i:ident),* end) => ($foo)
}

(start $foo:expr $($i:ident),* end) is a matcher. The whole matcher is a delimited sequence (with open- and close-delimiters ( and )), and $foo and $i are simple NT's with expr and ident as their respective fragment specifiers.

$(i:ident),* is also an NT; it is a complex NT that matches a comma-seprated repetition of identifiers. The , is the separator token for the complex NT; it occurs in between each pair of elements (if any) of the matched fragment.

Another example of a complex NT is $(hi $e:expr ;)+, which matches any fragment of the form hi <expr>; hi <expr>; ... where hi <expr>; occurs at least once. Note that this complex NT does not have a dedicated separator token.

(Note that Rust's parser ensures that delimited sequences always occur with proper nesting of token tree structure and correct matching of open- and close-delimiters.)

Motivation

In current Rust (version 0.12; i.e. pre 1.0), the macro_rules parser is very liberal in what it accepts in a matcher. This can cause problems, because it is possible to write an MBE which corresponds to an ambiguous grammar. When an MBE is invoked, if the macro parser encounters an ambiguity while parsing, it will bail out with a "local ambiguity" error. As an example for this, take the following MBE:

macro_rules! foo {
    ($($foo:expr)* $bar:block) => (/*...*/)
};

Attempts to invoke this MBE will never succeed, because the macro parser will always emit an ambiguity error rather than make a choice when presented an ambiguity. In particular, it needs to decide when to stop accepting expressions for foo and look for a block for bar (noting that blocks are valid expressions). Situations like this are inherent to the macro system. On the other hand, it's possible to write an unambiguous matcher that becomes ambiguous due to changes in the syntax for the various fragments. As a concrete example:

macro_rules! bar {
    ($in:ty ( $($arg:ident)*, ) -> $out:ty;) => (/*...*/)
};

When the type syntax was extended to include the unboxed closure traits, an input such as FnMut(i8, u8) -> i8; became ambiguous. The goal of this proposal is to prevent such scenarios in the future by requiring certain "delimiter tokens" after an NT. When extending Rust's syntax in the future, ambiguity need only be considered when combined with these sets of delimiters, rather than any possible arbitrary matcher.


Another example of a potential extension to the language that motivates a restricted set of "delimiter tokens" is (postponed) RFC 352, "Allow loops to return values other than ()", where the break expression would now accept an optional input expression: break <expr>.

Detailed design

We will tend to use the variable "M" to stand for a matcher, variables "t" and "u" for arbitrary individual tokens, and the variables "tt" and "uu" for arbitrary token trees. (The use of "tt" does present potential ambiguity with its additional role as a fragment specifier; but it will be clear from context which interpretation is meant.)

"SEP" will range over separator tokens, "OP" over the Kleene operators * and +, and "OPEN"/"CLOSE" over matching token pairs surrounding a delimited sequence (e.g. [ and ]).

We also use Greek letters "α" "β" "γ" "δ" to stand for potentially empty token-tree sequences. (However, the Greek letter "ε" (epsilon) has a special role in the presentation and does not stand for a token-tree sequence.)

Note that a matcher is merely a token tree. A "simple NT", as mentioned above, is an meta-variable NT; thus it is a non-repetition. For example, $foo:ty is a simple NT but $($foo:ty)+ is a complex NT.

Note also that in the context of this RFC, the term "token" generally includes simple NTs.

Finally, it is useful for the reader to keep in mind that according to the definitions of this RFC, no simple NT matches the empty fragment, and likewise no token matches the empty fragment of Rust syntax. (Thus, the only NT that can match the empty fragment is a complex NT.)

The Matcher Invariant

This RFC establishes the following two-part invariant for valid matchers

  1. For any two successive token tree sequences in a matcher M (i.e. M = ... tt uu ...), we must have FOLLOW(... tt) ⊇ FIRST(uu ...)

  2. For any separated complex NT in a matcher, M = ... $(tt ...) SEP OP ..., we must have SEP ∈ FOLLOW(tt ...).

The first part says that whatever actual token that comes after a matcher must be somewhere in the predetermined follow set. This ensures that a legal macro definition will continue to assign the same determination as to where ... tt ends and uu ... begins, even as new syntactic forms are added to the language.

The second part says that a separated complex NT must use a seperator token that is part of the predetermined follow set for the internal contents of the NT. This ensures that a legal macro definition will continue to parse an input fragment into the same delimited sequence of tt ...'s, even as new syntactic forms are added to the language.

(This is assuming that all such changes are appropriately restricted, by the definition of FOLLOW below, of course.)

The above invariant is only formally meaningful if one knows what FIRST and FOLLOW denote. We address this in the following sections.

FIRST and FOLLOW, informally

FIRST and FOLLOW are defined as follows.

A given matcher M maps to three sets: FIRST(M), LAST(M) and FOLLOW(M).

Each of the three sets is made up of tokens. FIRST(M) and LAST(M) may also contain a distinguished non-token element ε ("epsilon"), which indicates that M can match the empty fragment. (But FOLLOW(M) is always just a set of tokens.)

Informally:

We use the shorthand ANYTOKEN to denote the set of all tokens (including simple NTs).

(To review one's understanding of the above informal descriptions, the reader at this point may want to jump ahead to the examples of FIRST/LAST before reading their formal definitions.)

FIRST, LAST

Below are formal inductive definitions for FIRST and LAST.

"A ∪ B" denotes set union, "A ∩ B" denotes set intersection, and "A \ B" denotes set difference (i.e. all elements of A that are not present in B).

FIRST(M), defined by case analysis on the sequence M and the structure of its first token-tree (if any):

Note: The ε-case above,

FIRST(M) = (FIRST(tt ...) \ { ε }) ∪ sep_set ∪ FIRST(α)

may seem complicated, so lets take a moment to break it down. In the ε case, the sequence tt ... may be empty. Therefore our first token may be SEP itself (if it is present), or it may be the first token of α); that's why the result is including "sep_set ∪ FIRST(α)". Note also that if α itself may match the empty fragment, then FIRST(α) will ensure that ε is included in our result, and conversely, if α cannot match the empty fragment, then we must ensure that ε is not included in our result; these two facts together are why we can and should unconditionally remove ε from FIRST(tt ...).


LAST(M), defined by case analysis on M itself (a sequence of token-trees):

NOTE: The presence or absence of SEP is relevant to the above definitions, but solely in the case where the interior of the complex NT could be empty (i.e. ε ∈ FIRST(interior)). (I overlooked this fact in my first round of prototyping.)

NOTE: The above definition for LAST assumes that we keep our pre-existing rule that the seperator token in a complex NT is solely for separating elements; i.e. that such NT's do not match fragments that end with the seperator token. If we choose to lift this restriction in the future, the above definition will need to be revised accordingly.

Examples of FIRST and LAST

Below are some examples of FIRST and LAST. (Note in particular how the special ε element is introduced and eliminated based on the interation between the pieces of the input.)

Our first example is presented in a tree structure to elaborate on how the analysis of the matcher composes. (Some of the simpler subtrees have been elided.)

INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+   g
            ~~~~~~~~   ~~~~~~~                ~
                |         |                   |
FIRST:   { $d:ident }  { $e:expr }          { h }


INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+
            ~~~~~~~~~~~~~~~~~~             ~~~~~~~           ~~~
                       |                      |               |
FIRST:          { $d:ident }               { h, ε }         { f }

INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+   g
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~    ~~~~~~~~~~~~~~    ~~~~~~~~~   ~
                       |                       |              |       |
FIRST:        { $d:ident, ε }            {  h, ε, ;  }      { f }   { g }


INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+   g
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                        |
FIRST:                       { $d:ident, h, ;,  f }

Thus:

Note however that:

Here are similar examples but now for LAST.

and again, changing the end part of matcher changes its last set considerably:

FOLLOW(M)

Finally, the definition for FOLLOW(M) is built up incrementally atop more primitive functions.

We first assume a primitive mapping, FOLLOW(NT) (defined below) from a simple NT to the set of allowed tokens for the fragment specifier for that NT.

Second, we generalize FOLLOW to tokens: FOLLOW(t) = FOLLOW(NT) if t is (a simple) NT. Otherwise, t must be some other (non NT) token; in this case FOLLOW(t) = ANYTOKEN.

Finally, we generalize FOLLOW to arbitrary matchers by composing the primitive functions above:

FOLLOW(M) = FOLLOW(t1) ∩ FOLLOW(t2) ∩ ... ∩ FOLLOW(tN)
            where { t1, t2, ..., tN } = (LAST(M) \ { ε })

Examples of FOLLOW (expressed as equality relations between sets, to avoid incorporating details of FOLLOW(NT) in these examples):

FOLLOW(NT)

Here is the definition for FOLLOW(NT), which maps every simple NT to the set of tokens that are allowed to follow it, based on the fragment specifier for the NT.

The current legal fragment specifiers are: item, block, stmt, pat, expr, ty, ident, path, meta, and tt.

(Note that close delimiters are valid following any NT.)

Examples of valid and invalid matchers

With the above specification in hand, we can present arguments for why particular matchers are legal and others are not.

Drawbacks

It does restrict the input to a MBE, but the choice of delimiters provides reasonable freedom and can be extended in the future.

Alternatives

  1. Fix the syntax that a fragment can parse. This would create a situation where a future MBE might not be able to accept certain inputs because the input uses newer features than the fragment that was fixed at 1.0. For example, in the bar MBE above, if the ty fragment was fixed before the unboxed closure sugar was introduced, the MBE would not be able to accept such a type. While this approach is feasible, it would cause unnecessary confusion for future users of MBEs when they can't put certain perfectly valid Rust code in the input to an MBE. Versioned fragments could avoid this problem but only for new code.
  2. Keep macro_rules unstable. Given the great syntactical abstraction that macro_rules provides, it would be a shame for it to be unusable in a release version of Rust. If ever macro_rules were to be stabilized, this same issue would come up.
  3. Do nothing. This is very dangerous, and has the potential to essentially freeze Rust's syntax for fear of accidentally breaking a macro.

Edit History

Appendices

Appendix A: Algorithm for recognizing valid matchers.

The detailed design above only sought to provide a specification for what a correct matcher is (by defining FIRST, LAST, and FOLLOW, and specifying the invariant relating FIRST and FOLLOW for all valid matchers.

The above specification can be implemented efficiently; we here give one example algorithm for recognizing valid matchers.

The algorithm for recognizing valid matchers M is named ValidMatcher.

To define it, we will need a mapping from submatchers of M to the FIRST set for that submatcher; that is handled by FirstSet.

Procedure FirstSet(M)

input: a token tree M representing a matcher

output: FIRST(M)

Let M = tts[1] tts[2] ... tts[n].
Let curr_first = { ε }.

For i in n down to 1 (inclusive):
  Let tt = tts[i].

  1. If tt is a token, curr_first := { tt }

  2. Else if tt is a delimited sequence `OPEN uu ... ClOSE`,
     curr_first := { OPEN }

  3. Else tt is a complex NT `$(uu ...) SEP OP`

     Let inner_first = FirstSet(`uu ...`) i.e. recursive call

     if OP == `*` or ε ∈ inner_first then
         curr_first := curr_first ∪ inner_first
     else
         curr_first := inner_first

return curr_first

(Note: If we were precomputing a full table in this procedure, we would need a recursive invocation on (uu ...) in step 2 of the for-loop.)

Predicate ValidMatcher(M)

To simplify the specification, we assume in this presentation that all simple NT's have a valid fragment specifier (i.e., one that has an entry in the FOLLOW(NT) table above.

This algorithm works by scanning forward across the matcher M = α β, (where α is the prefix we have scanned so far, and β is the suffix that remains to be scanned). We maintain LAST(α) as we scan, and use it to compute FOLLOW(α) and compare that to FIRST(β).

input: a token tree, M, and a set of tokens that could follow it, F.

output: LAST(M) (and also signals failure whenever M is invalid)

Let last_of_prefix = { ε }

Let M = tts[1] tts[2] ... tts[n].

For i in 1 up to n (inclusive):
  // For reference:
  // α = tts[1] .. tts[i]
  // β = tts[i+1] .. tts[n]
  // γ is some outer token sequence; the input F represents FIRST(γ)

  1. Let tt = tts[i].

  2. Let first_of_suffix; // aka FIRST(β γ)

  3. let S = FirstSet(tts[i+1] .. tts[n]);

  4. if ε ∈ S then
     // (include the follow information if necessary)

     first_of_suffix := S ∪ F

  5. else

     first_of_suffix := S

  6. Update last_of_prefix via case analysis on tt:

     a. If tt is a token:
        last_of_prefix := { tt }

     b. Else if tt is a delimited sequence `OPEN uu ... CLOSE`:

        i.  run ValidMatcher( M = `uu ...`, F = { `CLOSE` })

       ii. last_of_prefix := { `CLOSE` }

     c. Else, tt must be a complex NT,
        in other words, `NT = $( uu .. ) SEP OP` or `NT = $( uu .. ) OP`:

        i. If SEP present,
          let sublast = ValidMatcher( M = `uu ...`, F = first_of_suffix ∪ { `SEP` })

       ii. else:
          let sublast = ValidMatcher( M = `uu ...`, F = first_of_suffix)

      iii. If ε in sublast then:
           last_of_prefix := last_of_prefix ∪ (sublast \ ε)

       iv. Else:
           last_of_prefix := sublast

  7. At this point, last_of_prefix == LAST(α) and first_of_suffix == FIRST(β γ).

     For each simple NT token t in last_of_prefix:

     a. If first_of_suffix ⊆ FOLLOW(t), then we are okay so far. </li>

     b. Otherwise, we have found a token t whose follow set is not compatible
        with the FIRST(β γ), and must signal failure.

// After running the above for loop on all of `M`, last_of_prefix == LAST(M)

Return last_of_prefix

This algorithm should be run on every matcher in every macro_rules invocation, with F = { EOF }. If it rejects a matcher, an error should be emitted and compilation should not complete.