Quirks of Rust's token representation
Like most compilers, rustc
(the Rust compiler) has a lexer that breaks source
code into tokens, which are small units such as identifiers, literals,
operators, and punctuation. The parser then checks that these tokens are
present in an order that satisfies Rust’s grammar.
Rust has many fixed-length tokens, mostly operators and punctuation:
- One-char:
;
,,
,.
,(
,)
,{
,}
,[
,]
,@
,#
,~
,?
,:
,$
,=
,!
,<
,>
,-
,&
,|
,+
,*
,/
,^
,%
. - Two-char:
==
,&&
,||
,+=
,-=
,*=
,/=
,%=
,^=
,&=
,|=
,<<
,>>
,..
,::
,<-
,->
,=>
. - Three-char:
<<=
,>>=
,...
,..=
.
Every two-char and three-char token starts with a character that is also a
legitimate one-char token. If the lexer sees two &&
characters in a row, is
it seeing two &
reference operators, or is it seeing a &&
logical AND
operator? It can’t tell, because it doesn’t have enough information. Not until
full parsing is done can this decision be made, based on the syntactic context.
So the compiler designers must make a choice between two representations.
- Split: Represent a multi-char sequence like
&&
as multiple single-char tokens, and combine them when necessary. - Joined: Represent a multi-char sequence like
&&
as a single token, and split it when necessary.
rustc
does… a mixture of the two. It’s complicated.
rustc
actually has two lexers.
- The first lexer is in the
rustc_lexer
crate, which is shared betweenrustc
andrust-analyzer
. Its tokens use a split representation. - The second lexer is in the
lexer
module of therustc_parse
crate. It converts the first lexer’s tokens into AST tokens, which use a joined representation. The parser splits tokens when necessary using a methodParser::break_and_eat
.
Plus there is a third token representation: proc macros receive and produce a
proc_macro::TokenStream
.
These token streams use the split strategy for the
Punct
type, in
combination with the
Spacing
type that
indicates if two adjacent tokens can be joined.
When converting from AST tokens to proc macro tokens, joined tokens are resplit. And when converting from proc macro tokens back to AST tokens, split tokens are rejoined.
I finally got all this clear in my head just last week. I’m now wondering whether the AST tokens could be changed to a split representation. That way, split representations would be used everywhere. I haven’t yet investigated how difficult this would be, nor what the benefits might be.
Update: Aleksey Kladov pointed out an existing issue suggesting exactly this change to AST tokens.