From Source Code to Tokens: A Lexer's Job

I finished reading the first chapter of Writing an Interpreter in Go by Thorsten Ball. This article is my attempt to walk through what I learned.

What is an Interpreter?#

An interpreter is a program which reads source code and executes it directly. It doesn’t produce a binary executable like a compiler would. It takes your source code and performs the actions defined by it.

To do this, an interpreter works in three stages:

Lexer: turns source code into tokens
Parser: turns tokens into an Abstract Syntax Tree (AST)
Evaluator: walks the AST and executes it

In this article, we’ll focus on the first stage, the lexer.

The Lexer#

The lexer reads the source code and converts it into tokens. The number and types of tokens depend on the programming language being interpreted.

For example, the following source code:

let age = 1;

gets tokenized to:

LET
IDENTIFIER("age")
ASSIGN
INTEGER(1)
SEMICOLON

Each token corresponds to a piece of the source code. The exact meaning of a token depends entirely on the interpreter’s implementation. Some other implementation might treat 1 as a string token in the lexer stage, and convert it to type integer only during parsing.

The Token#

In the book’s implementation, every token has just two fields:

type Token struct {
	Type    TokenType
	Literal string
}

Type tells the parser type of token it’s dealing with, an integer, an identifier, an operator. Literal preserves the original text, which matters for identifiers like age and values like 42 where the specific characters carry meaning.

Two special token types are worth calling out:

EOF: emitted once the lexer runs out of input, so the parser knows when to stop.
ILLEGAL: emitted when the lexer encounters a character it doesn’t recognize. Rather than crashing, it produces an ILLEGAL token and keeps going, which makes error reporting cleaner downstream.

Tracking Position#

The lexer walks through the input one character at a time. To do this, it tracks where it is:

type Lexer struct {
	input        string
	position     int
	readPosition int
	ch           byte
}

Two position fields might seem redundant, but they serve different purposes:

position points to the character we’re currently looking at.
readPosition points to the next character we’ll read.

This separation is what lets us “peek ahead” without committing to the next character, which turns out to matter for two character operators like == and !=.

The Interesting Problem: Two-Character Tokens#

When the lexer sees =, it can’t immediately decide what kind of token that is. It might be an ASSIGN (=), or it might be the start of an EQUALS (==). The same ambiguity exists for !, is it NEGATE, or the start of NOT_EQUALS?

This is where readPosition becomes handy:

func (l *Lexer) peekChar() byte {
	if l.readPosition >= len(l.input) {
		return 0
	}
	return l.input[l.readPosition]
}
 
// and we handle the case in the switch
case '=':
	if l.peekChar() == '=' {
		ch := l.ch
		l.readChar()
		literal := string(ch) + string(l.ch)
		tok = token.Token{Type: token.EQUALS, Literal: literal}
	} else {
		tok = newToken(token.ASSIGN, l.ch)
	}

If the next character is also =, we consume both and produce an EQUALS token. Otherwise, we emit ASSIGN and move on. Without the peek, we’d have to commit to ASSIGN first and then somehow walk it back which is exactly the kind of mess we’re trying to avoid.

Keywords vs. Identifiers#

Keywords look exactly like identifiers to the lexer. They’re just strings of letters. let and age are indistinguishable until the lexer has read the whole word.

So the lexer reads the full word first, then looks it up:

func LookupIdent(ident string) TokenType {
	if tok, ok := keywords[ident]; ok {
		return tok
	}
	return IDENT
}

If the word matches a reserved keyword, we return the corresponding token type. Otherwise, it’s a plain identifier. This is the reason you can’t name a variable let in Monkey. The lexer will categorize it as a LET token before the parser ever sees it.

One small implementation detail: the lexer treats _ as a letter, which means identifiers like my_var and _private are valid. This is a small language-design decision baked right into the lexer. Some programming language might not want to allow it. They do it here in the lexing stage.

Putting It All Together#

The test file exercises the lexer with a chunk of Monkey code that uses nearly every token type:

let five = 5;
let ten = 10;
 
let add = fn(x, y) {
    x + y;
}
 
let result = add(five, ten);
!-/*5;
5 < 10 > 5;
 
if (5 < 10) {
    return true;
} else {
    return false;
}
10 == 10;
10 != 9;

Running this through the lexer produces tokens LET, IDENT("five"), ASSIGN, INT("5"), SEMICOLON, and so on, all the way to EOF. No meaning yet. No idea that add is being bound to a function, or that fn(x, y) takes two parameters. Just categorized pieces, ready for the parser to assemble into something structured.

What’s Next#

The lexer’s output is a flat stream of tokens. The parser’s job is to take that stream and build a tree, the Abstract Syntax Tree. The AST captures the structure of the code: which expressions are nested inside which, which function calls take which arguments, which branches of an if go where.

That’s chapter two and the story for the next article.