Like most compilers, rustc (the Rust compiler) has a lexer that breaks source code into tokens, which are small units such as identifiers, literals, operators, and punctuation. The parser then checks that these tokens are present in an order that satisfies Rust’s grammar.

Rust has many fixed-length tokens, mostly operators and punctuation:

  • One-char: ;, ,, ., (, ), {, }, [, ], @, #, ~, ?, :, $, =, !, <, >, -, &, |, +, *, /, ^, %.
  • Two-char: ==, &&, ||, +=, -=, *=, /=, %=, ^=, &=, |=, <<, >>, .., ::, <-, ->, =>.
  • Three-char: <<=, >>=, ..., ..=.

Every two-char and three-char token starts with a character that is also a legitimate one-char token. If the lexer sees two && characters in a row, is it seeing two & reference operators, or is it seeing a && logical AND operator? It can’t tell, because it doesn’t have enough information. Not until full parsing is done can this decision be made, based on the syntactic context.

So the compiler designers must make a choice between two representations.

  • Split: Represent a multi-char sequence like && as multiple single-char tokens, and combine them when necessary.
  • Joined: Represent a multi-char sequence like && as a single token, and split it when necessary.

rustc does… a mixture of the two. It’s complicated.

rustc actually has two lexers.

Plus there is a third token representation: proc macros receive and produce a proc_macro::TokenStream. These token streams use the split strategy for the Punct type, in combination with the Spacing type that indicates if two adjacent tokens can be joined.

When converting from AST tokens to proc macro tokens, joined tokens are resplit. And when converting from proc macro tokens back to AST tokens, split tokens are rejoined.

I finally got all this clear in my head just last week. I’m now wondering whether the AST tokens could be changed to a split representation. That way, split representations would be used everywhere. I haven’t yet investigated how difficult this would be, nor what the benefits might be.

Update: Aleksey Kladov pointed out an existing issue suggesting exactly this change to AST tokens.