Let’s LEX!
So this is actually one of the easier parts of a programming language. In this case, we need to turn the raw text of a program into a sequence of tokens / lexemes that will be easier to parse. In this case, we want to:
- Remove all whitespace and comments
- Store the row and column with the token to make debugging easier
So let’s do it!
First, the definition of all tokens that we need to be able to deal with:
const TOKEN_PATTERNS = [
/^[\(\)\[\]\{\}]/, // Brackets
/^,/, // Commas
/^:/, // Colons
/^(false|true)/, // Booleans
/^0x[0-9a-fA-F]+/, // Hexadecimal literals
/^\-?\d+\/\d+/, // Fractions
/^\-?\d+\.\d+/, // Decimals
/^\-?\d+(deg|rad)/, // Angles with units
/^\-?\d+/, // Integers
/^[a-zA-Z][a-zA-Z0-9_-]*/, // Words
/^(\.\.|\+|\-|\/|\*)/, // Operators
/^"([^\\"]|\\.)*"/, // Strings (with escaping)
]
For the most part, they’re all unique. Brackets (()[]{}
) are all treated as their own thing, since those will define params
, lists
, and groups
(respectively, see my previous post). ,
is a token that should only be used in params
, but does break up tokens in that case. Likewise :
is only used in kwargs
during param
parsing. They we have literal booleans. After that, we start getting slightly more interesting, with numbers!
- Hexadecimal literals:
0x__
- Fractions:
_/_
- Decimals:
_._
- Angles:
_deg
and_rad
- Integers:
_
These are the main case where we have to parse in order, since if we parse integers before fractions (for example), it will parse as integer, operator(/), integer
(which would mostly be fine) instead of what we want. Likewise with angles.
After that, we have words
, which are basically any name. They have to start with a letter, but then can have alphanumerics, underscores, and dashes. This is before operators
so that a-b
is a word
, not word(a), operator(-), word(b)
.
Operators are used in expressions
, these will be parts of mathematical expressions. The main interesting one here is ..
making ranges.
And finally, strings. We allow double quoted strings with arbitrary escape sequences. A string is a pair of double quotes around any character, where \"
will not be counted as the end.
And that’s actually it! Let’s use that to turn a raw string into a sequence of tokens:
import logging from "../lib/logging.js"
const log = logging.get("lexer")
import { TOKEN_PATTERNS } from "./constants.js"
export default function lex(text) {
let row = 0, col = 0
let tokens = []
while (text.length > 0) {
let match
// Match against comments
match = text.match(/^\#.*\n/)
if (match !== null) {
row += 1
text = text.substring(match[0].length)
continue
}
// Try to match against known tokens
let matched = false
for (let token_pattern of TOKEN_PATTERNS) {
let match = text.match(token_pattern)
if (match === null) continue
let token = match[0]
let chars = token.length
tokens.push({ row, col, token })
row += chars
text = text.substring(chars)
matched = true
break
}
if (matched) continue
// Try to match against a newline
if (text[0] === "\n") {
row = 0
col += 1
text = text.substring(1)
continue
}
// Try to match against other whitespace
match = text.match(/^\s+/)
if (match !== null) {
let chars = match[0].length
row += chars
text = text.substring(chars)
continue
}
// Error case, no idea what's at the beginning
let context = text.substring(0, 10).replace(/\n/g, "\\n")
if (text.length > 10) context += "..."
log.error("Lex Error, unknown token at", row, ":", col, ":", context, "")
}
return tokens
}
I’ll admit, making substrings all the time is potentially a performance problem. I don’t know if substring
makes a copy in Javascript. I know that if we were working in Java
(with immutable strings by default), this wouldn’t be a problem at all. Given the size of strings that we’ve used so far now though… it’s still not a problem. Fast enough and it’s functional!
Here’s a demo!
Source
Lexed
Log (most recent messages first):
Onward!
As before, here’s the current source: jpverkamp/runelang
And here is the entire series (as I write them):