RFC 0446: Unicode escape syntax

libs (syntax)

Summary

Remove \u203D and \U0001F4A9 unicode string escapes, and add ECMAScript 6-style \u{1F4A9} escapes instead.

Motivation

The syntax of \u followed by four hexadecimal digits dates from when Unicode was a 16-bit encoding, and only went up to U+FFFF. \U followed by eight hex digits was added as a band-aid when Unicode was extended to U+10FFFF, but neither four nor eight digits particularly make sense now.

Having two different syntaxes with the same meaning but that apply to different ranges of values is inconsistent and arbitrary. This proposal unifies them into a single syntax that has a precedent in ECMAScript a.k.a. JavaScript.

Detailed design

In terms of the grammar in The Rust Reference, replace:

unicode_escape : 'u' hex_digit 4
               | 'U' hex_digit 8 ;

with

unicode_escape : 'u' '{' hex_digit+ 6 '}'

That is, \u{ followed by one to six hexadecimal digits, followed by }.

The behavior would otherwise be identical.

Migration strategy

In order to provide a graceful transition from the old \uDDDD and \UDDDDDDDD syntax to the new \u{DDDDD} syntax, this feature should be added in stages:

Drawbacks

Alternatives

Unresolved questions

None so far.