RFC 0326: restrict-xXX-to-ascii

lang (syntax | expressions)

Summary

In string literal contexts, restrict \xXX escape sequences to just the range of ASCII characters, \x00 -- \x7F. \xXX inputs in string literals with higher numbers are rejected (with an error message suggesting that one use an \uNNNN escape).

Motivation

In a string literal context, the current \xXX character escape sequence is potentially confusing when given inputs greater than 0x7F, because it does not encode that byte literally, but instead encodes whatever the escape sequence \u00XX would produce.

Thus, for inputs greater than 0x7F, \xXX will encode multiple bytes into the generated string literal, as illustrated in the Rust example appendix.

This is different from what C/C++ programmers might expect (see Behavior of xXX in C appendix).

(It would not be legal to encode the single byte literally into the string literal, since then the string would not be well-formed UTF-8.)

It has been suggested that the \xXX character escape should be removed entirely (at least from string literal contexts). This RFC is taking a slightly less aggressive stance: keep \xXX, but only for ASCII inputs when it occurs in string literals. This way, people can continue using this escape format (which shorter than the \uNNNN format) when it makes sense.

Here are some links to discussions on this topic, including direct comments that suggest exactly the strategy of this RFC.

Note in particular the meeting minutes bullet, where the team explicitly decided to keep things "as they are".

However, at the time of that meeting, Rust did not have byte string literals; people were converting string-literals into byte arrays via the bytes! macro. (Likewise, the rust-dev post is also from a time, summer 2012, when we did not have byte-string literals.)

We are in a different world now. The fact that now \xXX denotes a code unit in a byte-string literal, but in a string literal denotes a codepoint, does not seem elegant; it rather seems like a source of confusion. (Caveat: While Felix does believe this assertion, this context-dependent interpretation of \xXX does have precedent in both Python and Racket; see Racket example and Python example appendices.)

By restricting \xXX to the range 0x00--0x7F, we side-step the question of "is it a code unit or a code point?" entirely (which was the real context of both the rust-dev thread and the meeting minutes bullet). This RFC is a far more conservative choice that we can safely make for the short term (i.e. for the 1.0 release) than it would have been to switch to a "\xXX is a code unit" interpretation.

The expected outcome is reduced confusion for C/C++ programmers (which is, after all, our primary target audience for conversion), and any other language where \xXX never results in more than one byte. The error message will point them to the syntax they need to adopt.

Detailed design

In string literal contexts, \xXX inputs with XX > 0x7F are rejected (with an error message that mentions either, or both, of \uNNNN escapes and the byte-string literal format b"..").

The full byte range remains supported when \xXX is used in byte-string literals, b"..."

Raw strings by design do not offer escape sequences, so they are unchanged.

Character and string escaping routines (such as core::char::escape_unicode, and such as used by the "{:?}" formatter) are updated so that string inputs that previously would previously have printed \xXX with XX > 0x7F are updated to use \uNNNN escapes instead.

Drawbacks

Some reasons not to do this:

The thesis of this RFC is that the first bullet is a falsehood.

While there is some precedent for the "\xXX is code point" interpretation in some languages, the majority do seem to favor the "\xXX is code unit" point of view. The proposal of this RFC is side-stepping the distinction by limiting the input range for \xXX.

The third bullet is a strawman since we have not yet released 1.0, and thus everything is up for change.

This RFC makes no comment on the validity of the fourth bullet.

Alternatives

Unresolved questions

None.

Appendices

Behavior of xXX in C

Here is a C program illustrating how xXX escape sequences are treated in string literals in that context:

#include <stdio.h>

int main() {
    char *s;

    s = "a";
    printf("s[0]: %d\n", s[0]);
    printf("s[1]: %d\n", s[1]);

    s = "\x61";
    printf("s[0]: %d\n", s[0]);
    printf("s[1]: %d\n", s[1]);

    s = "\x7F";
    printf("s[0]: %d\n", s[0]);
    printf("s[1]: %d\n", s[1]);

    s = "\x80";
    printf("s[0]: %d\n", s[0]);
    printf("s[1]: %d\n", s[1]);
    return 0;
}

Its output is the following:

% gcc example.c && ./a.out
s[0]: 97
s[1]: 0
s[0]: 97
s[1]: 0
s[0]: 127
s[1]: 0
s[0]: -128
s[1]: 0

Rust example

Here is a Rust program that explores the various ways \xXX sequences are treated in both string literal and byte-string literal contexts.

 #![feature(macro_rules)]

fn main() {
    macro_rules! print_str {
        ($r:expr, $e:expr) => { {
            println!("{:>20}: \"{}\"",
                     format!("\"{}\"", $r),
                     $e.escape_default())
        } }
    }

    macro_rules! print_bstr {
        ($r:expr, $e:expr) => { {
            println!("{:>20}: {}",
                     format!("b\"{}\"", $r),
                     $e)
        } }
    }

    macro_rules! print_bytes {
        ($r:expr, $e:expr) => {
            println!("{:>9}.as_bytes(): {}", format!("\"{}\"", $r), $e.as_bytes())
        } }

    // println!("{}", b"\u0000"); // invalid: \uNNNN is not a byte escape.
    print_str!(r"\0", "\0");
    print_bstr!(r"\0", b"\0");
    print_bstr!(r"\x00", b"\x00");
    print_bytes!(r"\x00", "\x00");
    print_bytes!(r"\u0000", "\u0000");
    println!("");
    print_str!(r"\x61", "\x61");
    print_bstr!(r"a", b"a");
    print_bstr!(r"\x61", b"\x61");
    print_bytes!(r"\x61", "\x61");
    print_bytes!(r"\u0061", "\u0061");
    println!("");
    print_str!(r"\x7F", "\x7F");
    print_bstr!(r"\x7F", b"\x7F");
    print_bytes!(r"\x7F", "\x7F");
    print_bytes!(r"\u007F", "\u007F");
    println!("");
    print_str!(r"\x80", "\x80");
    print_bstr!(r"\x80", b"\x80");
    print_bytes!(r"\x80", "\x80");
    print_bytes!(r"\u0080", "\u0080");
    println!("");
    print_str!(r"\xFF", "\xFF");
    print_bstr!(r"\xFF", b"\xFF");
    print_bytes!(r"\xFF", "\xFF");
    print_bytes!(r"\u00FF", "\u00FF");
    println!("");
    print_str!(r"\u0100", "\u0100");
    print_bstr!(r"\x01\x00", b"\x01\x00");
    print_bytes!(r"\u0100", "\u0100");
}

In current Rust, it generates output as follows:

% rustc --version && echo && rustc example.rs && ./example
rustc 0.12.0-pre (d52d0c836 2014-09-07 03:36:27 +0000)

                "\0": "\x00"
               b"\0": [0]
             b"\x00": [0]
   "\x00".as_bytes(): [0]
 "\u0000".as_bytes(): [0]

              "\x61": "a"
                b"a": [97]
             b"\x61": [97]
   "\x61".as_bytes(): [97]
 "\u0061".as_bytes(): [97]

              "\x7F": "\x7f"
             b"\x7F": [127]
   "\x7F".as_bytes(): [127]
 "\u007F".as_bytes(): [127]

              "\x80": "\x80"
             b"\x80": [128]
   "\x80".as_bytes(): [194, 128]
 "\u0080".as_bytes(): [194, 128]

              "\xFF": "\xff"
             b"\xFF": [255]
   "\xFF".as_bytes(): [195, 191]
 "\u00FF".as_bytes(): [195, 191]

            "\u0100": "\u0100"
         b"\x01\x00": [1, 0]
 "\u0100".as_bytes(): [196, 128]
%

Note that the behavior of \xXX on byte-string literals matches the expectations established by the C program in Behavior of xXX in C; that is good. The problem is the behavior of \xXX for XX > 0x7F in string-literal contexts, namely in the fourth and fifth examples where the .as_bytes() invocations are showing that the underlying byte array has two elements instead of one.

Racket example

% racket
Welcome to Racket v5.93.
> (define a-string "\xbb\n")
> (display a-string)
»
> (bytes-length (string->bytes/utf-8 a-string))
3
> (define a-byte-string #"\xc2\xbb\n")
> (bytes-length a-byte-string)
3
> (display a-byte-string)
»
> (exit)
%

The above code illustrates that in Racket, the \xXX escape sequence denotes a code unit in byte-string context (#".." in that language), while it denotes a code point in string context ("..").

Python example

% python
Python 2.7.5 (default, Mar  9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a_string = u"\xbb\n";
>>> print a_string
»

>>> len(a_string.encode("utf-8"))
3
>>> a_byte_string = "\xc2\xbb\n";
>>> len(a_byte_string)
3
>>> print a_byte_string
»

>>> exit()
%

The above code illustrates that in Python, the \xXX escape sequence denotes a code unit in byte-string context (".." in that language), while it denotes a code point in unicode string context (u"..").