StackLang, part 3: parsing. This is going to be the most complicated one thus far! Onward.
Posts in this series:
Posts in StackLang:
This post:
Full source code for StackLang: github:jpverkamp/stacklang
Types
Okay, things are getting more interesting. Let’s start where I generally do, with the types. In this case, I’m going to split things to a few levels.
Expression
- the AST itself
Starting at the top level, we have expressions. These are the actual AST and represent one level of the eventual parse tree:
/// An expression is a single unit of a program, part of the AST
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum Expression {
/// An identifier/variable, used to lookup a named value or global
Identifier(String),
/// A literal value
Literal(Value),
/// A function definition, generally delimited with {}
Block(Vec<Expression>),
/// A list of values, generally delimited with []
List(Vec<Expression>),
/// A group of values, generally delimited with (), currently used only for clean code
Group(Vec<Expression>),
/// An @ prefixed expression, used to name values on the stack
/// If followed by [], an @list is multiple names
At(Box<Expression>),
/// A ! prefixed expression, used to set values by name
Bang(Box<Expression>),
/// A $ prefixed expression, used to pass to the stack (only really needed for blocks)
Dollar(Box<Expression>),
}
It’s actually for the most part simpler than a lot of ASTs, since (like S-Expression based languages) we really don’t have much complicated syntax. In this case, there are broadly three groups:
- Single tokens: literal values (like
42
,"Hello world"
, etc; we’ll get there) and identifiers (anything else) - Groups of subexpressions (these could probably have been one expression with a node that stores which type it is but I found this better since they work differently):
- Blocks, defined with
{}
, essentially function definitions, important since these will have their own arity - Lists, defined with
[]
, currently only used in list naming@[a b c]
- Groups, defined with
()
, implicitly used for the ‘main’ block, but can be used to group code mostly for visual appeal, because of the stack nature, doesn’t actually modify control flow or introduce scopes
- Blocks, defined with
- Prefixed expressions:
@
expressions, for naming values; the expression should be a single identifier, a list of identifiers, or an integer (arity in), but that isn’t a parse time check yet!
expressions, for setting values by name or setting arity out; should be a single identifier or integer (arity)$
expressions, for passing blocks as a parameter rather than applying them
One interesting bit is the implementation of Display
for Expression
:
macro_rules! write_children {
($f:ident $prefix:literal $children:ident $suffix:literal) => {{
let mut s = String::new();
s.push($prefix);
for (i, child) in $children.iter().enumerate() {
s.push_str(&format!("{}", child));
if i != $children.len() - 1 {
s.push(' ');
}
}
s.push($suffix);
write!($f, "{}", s)
}};
}
impl Display for Expression {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Expression::Identifier(id) => write!(f, "{}", id),
Expression::Literal(value) => write!(f, "{}", value),
Expression::Block(children) => write_children! {f '{' children '}'},
Expression::List(children) => write_children! {f '[' children ']'},
Expression::Group(children) => write_children! {f '(' children ')'},
Expression::At(expr) => write!(f, "@{}", expr),
Expression::Bang(expr) => write!(f, "!{}", expr),
Expression::Dollar(expr) => write!(f, "${}", expr),
}
}
}
It should be possible to round trip this: Display
ed code should be readable. Mostly the goal is to display expressions without using up quite so much screen real estate. Don’t have tests for that though.
Value
- values on the stack + literals
Okay, we have expressions, but we’re missing Value
. That’s any Value
that can actually be stored on the stack, but is also used for literals.
/// A value is a literal value that has been evaluated.
#[derive(Clone, Debug, PartialEq, Eq)]
#[repr(u8)]
pub enum Value {
Number(Number),
String(String),
Boolean(bool),
Block {
arity_in: usize,
arity_out: usize,
expression: Box<Expression>,
},
}
impl Display for Value {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"{}",
match self {
Value::Number(v) => v.to_string(),
Value::String(v) => v.to_string().trim_matches('"').to_string(),
Value::Boolean(v) => v.to_string(),
Value::Block {
arity_in,
arity_out,
..
} => format!("{{{}->{}}}", arity_in, arity_out),
}
)
}
}
Numbers are further distinguished (see below). Otherwise, it’s strings, booleans, or blocks. Blocks shouldn’t show up in the parser, they’re when Values are used as values on the stack, but they’re defined here anyways.
Number
- the numeric tower
The numeric tower.
/// The numeric tower
#[derive(Copy, Clone, Debug)]
pub enum Number {
Integer(i64),
// Rational(i64, i64),
Float(f64),
// Complex(f64, f64),
}
So… I haven’t actually written Rational
or Complex
values yet! But I’m getting there.
I do want to implement the various mathematical traits for numbers on these values though, so let’s give that a try:
macro_rules! do_op {
($trait:ty, $f:ident, $op:tt) => {
impl $trait for Number {
type Output = Number;
fn $f(self, rhs: Self) -> Self::Output {
let (a, b) = Number::coerce(self, rhs);
match (a, b) {
(Number::Integer(av), Number::Integer(bv)) => Number::Integer(av $op bv),
(Number::Float(av), Number::Float(bv)) => Number::Float(av $op bv),
_ => unreachable!(),
}
}
}
};
}
do_op!(Add, add, +);
do_op!(Sub, sub, -);
do_op!(Mul, mul, *);
do_op!(Div, div, /);
do_op!(Rem, rem, %);
impl PartialEq for Number {
fn eq(&self, other: &Self) -> bool {
let (a, b) = Number::coerce(*self, *other);
match (a, b) {
(Number::Integer(av), Number::Integer(bv)) => av == bv,
(Number::Float(av), Number::Float(bv)) => av == bv,
_ => unreachable!(),
}
}
}
impl Eq for Number {}
impl PartialOrd for Number {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
let (a, b) = Number::coerce(*self, *other);
match (a, b) {
(Number::Integer(av), Number::Integer(bv)) => av.partial_cmp(&bv),
(Number::Float(av), Number::Float(bv)) => av.partial_cmp(&bv),
_ => unreachable!(),
}
}
}
I don’t know if this will actually work for Rational
/ Complex
. We’ll figure that out.
I think that what I’m going to have to do is to create a new Rational
struct and then implement Number::Rational
as Rational(Rational)
(or something less confusing). That way, I can implement Add
etc manually on that type and everything else will just work (tm).
We’ll see.
Future work: type constructors / macros
One problem that I’ve been running into a bit is code like this:
#[test]
fn test_integer() {
let input = tokenize("123".as_bytes());
let output = parse(input);
assert_eq!(
output,
Expression::Group(vec![Expression::Literal(Value::Number(Number::Integer(
123
)))])
);
}
Having a Group
of one Literal
, which is a Number
, specifically an Integer
is… annoying. At the very least, being able to write Literal::make_integer(123)
would be nice.
Or alternatively, with macros: group!{integer!(123)}
.
Or perhaps a single macro: expr!{ group(integer(123)) }
. But at that point, we’re re-writing the parser. We’ll see.
Parsing
Okay. That’s fun, but a worthwhile framework. Now that we have that, I’m going to have to start actually parsing things, probably with a pub fn parse(tokens: Vec<Token>) -> Expression
function. But to do that, I think that I want to split it into two helper functions:
fn parse_one(tokens: &[Token]) -> (Expression, &[Token])
given a slice of
Token
, consume enough to parse one expression and return the remaining slice ofToken
fn parse_until(tokens: &[Token], ending: Option<String>) -> (Vec<Expression>, &[Token])
given a slice of
Token
and anending
, parse until we see theending
character; this is for parsing block etc expressions until you reach the}
/]
/)
ending
isOption
for themain
block (that doesn’t have a delimiter); in that case, parse until the end of theToken
slice
parse_until
- for blocks of code
Since the latter is actually easier (and depends on parse_one
), let’s start with that:
// A helper to parse a list of expressions until a given ending token
// If ending is not set, parse until end of stream
fn parse_until(tokens: &[Token], ending: Option<String>) -> (Vec<Expression>, &[Token]) {
let mut tokens = tokens;
let mut expressions = vec![];
while !tokens.is_empty() && Some(tokens[0].token.clone()) != ending {
let (expression, next_tokens) = parse_one(tokens);
expressions.push(expression);
tokens = next_tokens;
}
if !tokens.is_empty() && Some(tokens[0].token.clone()) == ending {
(expressions, &tokens[1..])
} else {
(expressions, tokens)
}
}
That really is it. For each token (until we hit the ending), parse_one
and collect them. If we hit the ending (or end of slice), we’re done.
Return the sub slice–this advances the tokens that we’ve already consumed, so we can start from where we left off. This is something I really like about Rust. Slices are a window (+ reference) into a section of memory, so it’s cheap to get a sub slice, rather than copying Vec
around.
parse_one
- for single expressions
This is the much more interesting one.
// A helper to parse a single expression from the current position in the token stream
fn parse_one(tokens: &[Token]) -> (Expression, &[Token]) {
if tokens[0].token == "@" {
// @ expressions prefix the next value (naming)
let (next, tokens) = parse_one(&tokens[1..]);
(Expression::At(Box::new(next)), tokens)
} else if tokens[0].token == "!" {
// ! expressions prefix the next value (assignment)
let (next, tokens) = parse_one(&tokens[1..]);
(Expression::Bang(Box::new(next)), tokens)
} else if tokens[0].token == "$" {
// $ expressions allow pushing a block to the stack
let (next, tokens) = parse_one(&tokens[1..]);
(Expression::Dollar(Box::new(next)), tokens)
} else if tokens[0].token == "{" {
// { expressions are blocks
let (children, tokens) = parse_until(&tokens[1..], Some(String::from("}")));
(Expression::Block(children), tokens)
} else if tokens[0].token == "[" {
// [ expressions are lists
let (children, tokens) = parse_until(&tokens[1..], Some(String::from("]")));
(Expression::List(children), tokens)
} else if tokens[0].token == "(" {
// ( expressions are groups
let (children, tokens) = parse_until(&tokens[1..], Some(String::from(")")));
(Expression::Group(children), tokens)
} else {
// Try to parse each literal value, if none match assume it's an identifier
if let Some(v) = tokens[0].token.parse::<i64>().ok() {
(
Expression::Literal(Value::Number(Number::Integer(v))),
&tokens[1..],
)
} else if let Some(v) = tokens[0].token.parse::<f64>().ok() {
(
Expression::Literal(Value::Number(Number::Float(v))),
&tokens[1..],
)
} else if tokens[0].token.starts_with("\"") {
(
Expression::Literal(Value::String(
tokens[0].token.trim_matches('"').to_string(),
)),
&tokens[1..],
)
} else if tokens[0].token == "true" || tokens[0].token == "false" {
(
Expression::Literal(Value::Boolean(tokens[0].token == "true")),
&tokens[1..],
)
} else {
(
Expression::Identifier(tokens[0].token.clone()),
&tokens[1..],
)
}
}
}
A bit more complicated. Similar to the same breakdown of the Value
struct, we have three parts (albeit in reverse order): prefixed expressions, nested expressions, and literals.
For prefixed expressions, they’re all the same:
if tokens[0].token == "@" {
// @ expressions prefix the next value (naming)
let (next, tokens) = parse_one(&tokens[1..]);
(Expression::At(Box::new(next)), tokens)
}
Take and consume the @
, parse_one
the expression it’s prefixing, and return the new Expression:At
plus the next tokens
(from the subexpression).
For subexpressions, likewise, we’re the same (and this is why we have parse_until
as a separate function):
if tokens[0].token == "{" {
// { expressions are blocks
let (children, tokens) = parse_until(&tokens[1..], Some(String::from("}")));
(Expression::Block(children), tokens)
}
parse_until
the matching delimiter and return the new Expression::Block
and the point in tokens
after that.
And finally, literals. These are a bit more interesting, but another reason that I’m really liking Rust. The parse
function will try to parse the given token into whatever type we give it (using that type’s FromStr
trait). If it can, use if let Some(v)
to unpack it and return. If not, fall through to the next case.
if let Some(v) = tokens[0].token.parse::<i64>().ok() {
(
Expression::Literal(Value::Number(Number::Integer(v))),
&tokens[1..],
)
}
All without exceptions! I like it.
Starting the parser
And that’s it. Now just tie it together:
/// Parses a vector of tokens into a vector of expressions.
pub fn parse(tokens: Vec<Token>) -> Expression {
log::debug!("parse({:?})", tokens);
// A helper to parse a single expression from the current position in the token stream
fn parse_one(tokens: &[Token]) -> (Expression, &[Token]) {
// ...
}
// A helper to parse a list of expressions until a given ending token
// If ending is not set, parse until end of stream
fn parse_until(tokens: &[Token], ending: Option<String>) -> (Vec<Expression>, &[Token]) {
// ...
}
// Parse the entire stream
// TODO: This should be an exception if the stream is not empty after this
Expression::Group(parse_until(tokens.as_slice(), None).0)
}
Testing
As before, I’m actually writing some unit tests! Thank you Github Co-Pilot.
#[cfg(test)]
mod test {
use crate::lexer::tokenize;
use crate::numbers::Number;
use crate::parser::parse;
use crate::types::{Expression, Value};
#[test]
fn test_integer() {
let input = tokenize("123".as_bytes());
let output = parse(input);
assert_eq!(
output,
Expression::Group(vec![Expression::Literal(Value::Number(Number::Integer(
123
)))])
);
}
// ...
#[test]
fn test_factorial() {
let input = tokenize(
"
{
@[n fact]
1
{ @0 n 1 - $fact fact n * }
n 1 < if
} @fact
5 $fact fact writeln"
.as_bytes(),
);
let output = parse(input);
assert_eq!(
output,
Expression::Group(vec![
Expression::Block(vec![
Expression::At(Box::new(Expression::List(vec![
Expression::Identifier(String::from("n")),
Expression::Identifier(String::from("fact")),
]))),
Expression::Literal(Value::Number(Number::Integer(1))),
Expression::Block(vec![
Expression::At(Box::new(Expression::Literal(Value::Number(
Number::Integer(0)
)),)),
Expression::Identifier(String::from("n")),
Expression::Literal(Value::Number(Number::Integer(1))),
Expression::Identifier(String::from("-")),
Expression::Dollar(Box::new(Expression::Identifier(String::from("fact")))),
Expression::Identifier(String::from("fact")),
Expression::Identifier(String::from("n")),
Expression::Identifier(String::from("*")),
]),
Expression::Identifier(String::from("n")),
Expression::Literal(Value::Number(Number::Integer(1))),
Expression::Identifier(String::from("<")),
Expression::Identifier(String::from("if")),
]),
Expression::At(Box::new(Expression::Identifier(String::from("fact")))),
Expression::Literal(Value::Number(Number::Integer(5))),
Expression::Dollar(Box::new(Expression::Identifier(String::from("fact")))),
Expression::Identifier(String::from("fact")),
Expression::Identifier(String::from("writeln")),
])
);
}
}
Writing that by hand would be a pain. But Co-Pilot doesn’t mind!
What’s next?
Onward!
Next up, I’m planning to:
- Type checking:
- Automatically determine the
arity
of blocks when possible - Automatically determine specific types of expressions (including blocks)
- Automatically determine the
- Numeric tower:
- Implement rationals/complex numbers at the parser level + in any interpreter / compiler I have at that point
- Implement automatic numeric coercion–if you try to add an integer to a complex number, the result should be a complex number
- Interpreters:
- An AST-walking interpreter, directly evaluating the AST
- A bytecode interpreter/compiler, evaluating at a lower level (I’m not sure how much this would gain, the AST is already fairly low level)
- Compilers:
- Compile to C (and then pass to GCC/Clang) to compile that
- Compile to WASM; since it’s also stack based, this should be interesting
- Compile to x86/ARM assembly