3

The Lox Language

3 . 0Lox 程序设计语言

有什么能比得上为饥肠辘辘的人端上一份丰盛早餐更好的事呢?

安东尼·波登(美国知名大厨)

本书接下来的部分将仔细探索 Lox 程序语言实现的角角落落,但是在真正动手编写解释器代码之前,还是让我们来好好地认识一下这个我们将要实现的程序语言吧。

在当下,你还没有真正上手编写代码之前,我不想先让你陷入到 Lox 程序语言的各种细节与语言规范里头去。所以本章是对 Lox 程序语言的一个温和、友好的介绍,大量语法细节与边界条件让我们放到后面具体实现时再详细阐述。

3 . 1你好,Lox

这是你首次接触 Lox 程序设计语言。

// Your first Lox program!
print "Hello, world!";

行注释//与句末分号;标示着 Lox 语法偏向 C 系,Lox 是一门类 C 的程序语言(字符串两边没有带小括号,这是因为print是一个内建语句,而不是一个库函数)。

我毫不否认,C 有着令人惊叹的语法设计。如果我们想要更加优雅的语法风格,Lox 可能偏向 Pascal 或者 Smalltalk ;如果我们想要极简主义的语法风格,Lox 可能更偏向 Scheme 。这些程序语言都有各自的可取之处。

类 C 语法最大的优势在于,它可以为读者带来:熟悉感。我知道你已经对我们将要用来实现 Lox 的两门程序语言 Java、C 非常熟悉了,Lox 采用与 C、Java 一脉相承的类 C 语法,这将为读者减少很多心智负担。

3 . 2一门高阶的程序设计语言

当我写完这本书时,书本的篇幅超出了我的预期,但这本书仍然没有达到够容纳类似 Java 这样大型程序语言实现的厚度。为了能在本书有限的篇幅里包含两种 Lox 语言的完整实现,Lox 语言本身必须被设计得小巧紧凑。

当我思考哪些程序语言既小巧又实用的时候,我脑海中浮现出像 JavaScript、Scheme、Lua 这样的高级“脚本”语言。在这三者之中,Lox 与 JavaScript 最为相像,大多数类 C 语法的脚本语言都像 JavaScript。当我们继续深入,就会发现 Lox 的作用域界定与 Scheme 相似。而在本书第三部分,我们使用 C 语言编写的 Lox 解释器,在具体实现细节上大量参考借鉴了 Lua 清晰高效的代码实现。

在以下两个方面,Lox 也与这三门程序语言有着同样的理念。

3 . 2 . 1动态类型

Lox 是一门动态类型的程序语言。变量可以存储任意类型的数据,也可以在程序运行的任意时刻改变其存储的数据类型。如果你在不兼容数据类型上执行了一个错误操作,比如说:将一个数字除以一枚字符串,那么该错误将在运行时被捕获和抛出。

静态类型有着许多令人喜爱的理由,但在实践上,我还是为 Lox 选择了动态类型。实现一个完备的静态类型系统需要大量的背景知识、深厚的代码功底。忽略静态类型,程序语言会变得更加简单,这本书也会变得更薄一些。将类型检查下推到运行时,可以让我们快速构建起一支可以执行代码的语言解释器。

3 . 2 . 2自动内存管理

高阶程序语言诞生的目的之一便是为了消除易出错的底层操作,还有什么比手动管理内存的分配释放更令人烦心的事情呢?没有人愿意早晨起来这么互相打招呼:“我迫不及待想为我今天申请的每块内存找个合适的地方调用free()函数啦!”

目前主要有两类技术用以管理内存:引用计数(Reference Counting)垃圾回收(Tracing Garbage Collection、Garbage Collection、GC)。引用计数在实现上更加简便,我想这也是为什么 Perl、PHP、Python 起初都使用引用计数管理内存的原因。但是随着语言的发展,引用计数的局限性越来越明显,所以到最后,这些程序语言都添加上了完整的垃圾回收器,用于在对象全流程生命周期里做好内存清理回收工作。

“垃圾回收(GC)”声名远播,让人心生畏惧,它需要开发者真正面向内存做一些细致的工作,调试垃圾回收器能让你做梦都能梦到程序的 16 进制转储调试信息。但是无需担心,本书就是为了揭开魔法、消灭怪兽的。我们将编写自己的垃圾回收器,你会发现,垃圾回收用到的算法其实非常简单,实现起来也非常有趣。

3 . 3数据类型

在 Lox 的小小宇宙中,构成宇宙的原子即是内置数据类型。Lox 目前只有几种简单的内置数据类型:

3 . 4表达式

If built-in data types and their literals are atoms, then expressions must be the molecules. Most of these will be familiar.

3 . 4 . 1Arithmetic

Lox features the basic arithmetic operators you know and love from C and other languages:

add + me;
subtract - me;
multiply * me;
divide / me;

The subexpressions on either side of the operator are operands. Because there are two of them, these are called binary operators. (It has nothing to do with the ones-and-zeroes use of “binary”.) Because the operator is fixed in the middle of the operands, these are also called infix operators (as opposed to prefix operators where the operator comes before the operands, and postfix where it comes after).

One arithmetic operator is actually both an infix and a prefix one. The - operator can also be used to negate a number.

-negateMe;

All of these operators work on numbers, and it’s an error to pass any other types to them. The exception is the + operatoryou can also pass it two strings to concatenate them.

3 . 4 . 2Comparison and equality

Moving along, we have a few more operators that always return a Boolean result. We can compare numbers (and only numbers), using Ye Olde Comparison Operators.

less < than;
lessThan <= orEqual;
greater > than;
greaterThan >= orEqual;

We can test two values of any kind for equality or inequality.

1 == 2;         // false.
"cat" != "dog"; // true.

Even different types.

314 == "pi"; // false.

Values of different types are never equivalent.

123 == "123"; // false.

I’m generally against implicit conversions.

3 . 4 . 3Logical operators

The not operator, a prefix !, returns false if its operand is true, and vice versa.

!true;  // false.
!false; // true.

The other two logical operators really are control flow constructs in the guise of expressions. An and expression determines if two values are both true. It returns the left operand if it’s false, or the right operand otherwise.

true and false; // false.
true and true;  // true.

And an or expression determines if either of two values (or both) are true. It returns the left operand if it is true and the right operand otherwise.

false or false; // false.
true or false;  // true.

The reason and and or are like control flow structures is that they short-circuit. Not only does and return the left operand if it is false, it doesn’t even evaluate the right one in that case. Conversely (contrapositively?), if the left operand of an or is true, the right is skipped.

3 . 4 . 4Precedence and grouping

All of these operators have the same precedence and associativity that you’d expect coming from C. (When we get to parsing, we’ll get way more precise about that.) In cases where the precedence isn’t what you want, you can use () to group stuff.

var average = (min + max) / 2;

Since they aren’t very technically interesting, I’ve cut the remainder of the typical operator menagerie out of our little language. No bitwise, shift, modulo, or conditional operators. I’m not grading you, but you will get bonus points in my heart if you augment your own implementation of Lox with them.

Those are the expression forms (except for a couple related to specific features that we’ll get to later), so let’s move up a level.

3 . 5Statements

Now we’re at statements. Where an expression’s main job is to produce a value, a statement’s job is to produce an effect. Since, by definition, statements don’t evaluate to a value, to be useful they have to otherwise change the world in some wayusually modifying some state, reading input, or producing output.

You’ve seen a couple of kinds of statements already. The first one was:

print "Hello, world!";

A print statement evaluates a single expression and displays the result to the user. You’ve also seen some statements like:

"some expression";

An expression followed by a semicolon (;) promotes the expression to statement-hood. This is called (imaginatively enough), an expression statement.

If you want to pack a series of statements where a single one is expected, you can wrap them up in a block.

{
  print "One statement.";
  print "Two statements.";
}

Blocks also affect scoping, which leads us to the next section . . . 

3 . 6Variables

You declare variables using var statements. If you omit the initializer, the variable’s value defaults to nil.

var imAVariable = "here is my value";
var iAmNil;

Once declared, you can, naturally, access and assign a variable using its name.

var breakfast = "bagels";
print breakfast; // "bagels".
breakfast = "beignets";
print breakfast; // "beignets".

I won’t get into the rules for variable scope here, because we’re going to spend a surprising amount of time in later chapters mapping every square inch of the rules. In most cases, it works like you would expect coming from C or Java.

3 . 7Control Flow

It’s hard to write useful programs if you can’t skip some code or execute some more than once. That means control flow. In addition to the logical operators we already covered, Lox lifts three statements straight from C.

An if statement executes one of two statements based on some condition.

if (condition) {
  print "yes";
} else {
  print "no";
}

A while loop executes the body repeatedly as long as the condition expression evaluates to true.

var a = 1;
while (a < 10) {
  print a;
  a = a + 1;
}

Finally, we have for loops.

for (var a = 1; a < 10; a = a + 1) {
  print a;
}

This loop does the same thing as the previous while loop. Most modern languages also have some sort of for-in or foreach loop for explicitly iterating over various sequence types. In a real language, that’s nicer than the crude C-style for loop we got here. Lox keeps it basic.

3 . 8Functions

A function call expression looks the same as it does in C.

makeBreakfast(bacon, eggs, toast);

You can also call a function without passing anything to it.

makeBreakfast();

Unlike in, say, Ruby, the parentheses are mandatory in this case. If you leave them off, it doesn’t call the function, it just refers to it.

A language isn’t very fun if you can’t define your own functions. In Lox, you do that with fun.

fun printSum(a, b) {
  print a + b;
}

Now’s a good time to clarify some terminology. Some people throw around “parameter” and “argument” like they are interchangeable and, to many, they are. We’re going to spend a lot of time splitting the finest of downy hairs around semantics, so let’s sharpen our words. From here on out:

The body of a function is always a block. Inside it, you can return a value using a return statement.

fun returnSum(a, b) {
  return a + b;
}

If execution reaches the end of the block without hitting a return, it implicitly returns nil.

3 . 8 . 1Closures

Functions are first class in Lox, which just means they are real values that you can get a reference to, store in variables, pass around, etc. This works:

fun addPair(a, b) {
  return a + b;
}

fun identity(a) {
  return a;
}

print identity(addPair)(1, 2); // Prints "3".

Since function declarations are statements, you can declare local functions inside another function.

fun outerFunction() {
  fun localFunction() {
    print "I'm local!";
  }

  localFunction();
}

If you combine local functions, first-class functions, and block scope, you run into this interesting situation:

fun returnFunction() {
  var outside = "outside";

  fun inner() {
    print outside;
  }

  return inner;
}

var fn = returnFunction();
fn();

Here, inner() accesses a local variable declared outside of its body in the surrounding function. Is this kosher? Now that lots of languages have borrowed this feature from Lisp, you probably know the answer is yes.

For that to work, inner() has to “hold on” to references to any surrounding variables that it uses so that they stay around even after the outer function has returned. We call functions that do this closures. These days, the term is often used for any first-class function, though it’s sort of a misnomer if the function doesn’t happen to close over any variables.

As you can imagine, implementing these adds some complexity because we can no longer assume variable scope works strictly like a stack where local variables evaporate the moment the function returns. We’re going to have a fun time learning how to make these work correctly and efficiently.

3 . 9Classes

Since Lox has dynamic typing, lexical (roughly, “block”) scope, and closures, it’s about halfway to being a functional language. But as you’ll see, it’s also about halfway to being an object-oriented language. Both paradigms have a lot going for them, so I thought it was worth covering some of each.

Since classes have come under fire for not living up to their hype, let me first explain why I put them into Lox and this book. There are really two questions:

3 . 9 . 1Why might any language want to be object oriented?

Now that object-oriented languages like Java have sold out and only play arena shows, it’s not cool to like them anymore. Why would anyone make a new language with objects? Isn’t that like releasing music on 8-track?

It is true that the “all inheritance all the time” binge of the ’90s produced some monstrous class hierarchies, but object-oriented programming (OOP) is still pretty rad. Billions of lines of successful code have been written in OOP languages, shipping millions of apps to happy users. Likely a majority of working programmers today are using an object-oriented language. They can’t all be that wrong.

In particular, for a dynamically typed language, objects are pretty handy. We need some way of defining compound data types to bundle blobs of stuff together.

If we can also hang methods off of those, then we avoid the need to prefix all of our functions with the name of the data type they operate on to avoid colliding with similar functions for different types. In, say, Racket, you end up having to name your functions like hash-copy (to copy a hash table) and vector-copy (to copy a vector) so that they don’t step on each other. Methods are scoped to the object, so that problem goes away.

3 . 9 . 2Why is Lox object oriented?

I could claim objects are groovy but still out of scope for the book. Most programming language books, especially ones that try to implement a whole language, leave objects out. To me, that means the topic isn’t well covered. With such a widespread paradigm, that omission makes me sad.

Given how many of us spend all day using OOP languages, it seems like the world could use a little documentation on how to make one. As you’ll see, it turns out to be pretty interesting. Not as hard as you might fear, but not as simple as you might presume, either.

3 . 9 . 3Classes or prototypes

When it comes to objects, there are actually two approaches to them, classes and prototypes. Classes came first, and are more common thanks to C++, Java, C#, and friends. Prototypes were a virtually forgotten offshoot until JavaScript accidentally took over the world.

In class-based languages, there are two core concepts: instances and classes. Instances store the state for each object and have a reference to the instance’s class. Classes contain the methods and inheritance chain. To call a method on an instance, there is always a level of indirection. You look up the instance’s class and then you find the method there:

How fields and methods are looked up on classes and instances

Prototype-based languages merge these two concepts. There are only objectsno classesand each individual object may contain state and methods. Objects can directly inherit from each other (or “delegate to” in prototypal lingo):

How fields and methods are looked up in a prototypal system

This means that in some ways prototypal languages are more fundamental than classes. They are really neat to implement because they’re so simple. Also, they can express lots of unusual patterns that classes steer you away from.

But I’ve looked at a lot of code written in prototypal languagesincluding some of my own devising. Do you know what people generally do with all of the power and flexibility of prototypes?  . . . They use them to reinvent classes.

I don’t know why that is, but people naturally seem to prefer a class-based (Classic? Classy?) style. Prototypes are simpler in the language, but they seem to accomplish that only by pushing the complexity onto the user. So, for Lox, we’ll save our users the trouble and bake classes right in.

3 . 9 . 4Classes in Lox

Enough rationale, let’s see what we actually have. Classes encompass a constellation of features in most languages. For Lox, I’ve selected what I think are the brightest stars. You declare a class and its methods like so:

class Breakfast {
  cook() {
    print "Eggs a-fryin'!";
  }

  serve(who) {
    print "Enjoy your breakfast, " + who + ".";
  }
}

The body of a class contains its methods. They look like function declarations but without the fun keyword. When the class declaration is executed, Lox creates a class object and stores that in a variable named after the class. Just like functions, classes are first class in Lox.

// Store it in variables.
var someVariable = Breakfast;

// Pass it to functions.
someFunction(Breakfast);

Next, we need a way to create instances. We could add some sort of new keyword, but to keep things simple, in Lox the class itself is a factory function for instances. Call a class like a function, and it produces a new instance of itself.

var breakfast = Breakfast();
print breakfast; // "Breakfast instance".

3 . 9 . 5Instantiation and initialization

Classes that only have behavior aren’t super useful. The idea behind object-oriented programming is encapsulating behavior and state together. To do that, you need fields. Lox, like other dynamically typed languages, lets you freely add properties onto objects.

breakfast.meat = "sausage";
breakfast.bread = "sourdough";

Assigning to a field creates it if it doesn’t already exist.

If you want to access a field or method on the current object from within a method, you use good old this.

class Breakfast {
  serve(who) {
    print "Enjoy your " + this.meat + " and " +
        this.bread + ", " + who + ".";
  }

  // ...
}

Part of encapsulating data within an object is ensuring the object is in a valid state when it’s created. To do that, you can define an initializer. If your class has a method named init(), it is called automatically when the object is constructed. Any parameters passed to the class are forwarded to its initializer.

class Breakfast {
  init(meat, bread) {
    this.meat = meat;
    this.bread = bread;
  }

  // ...
}

var baconAndToast = Breakfast("bacon", "toast");
baconAndToast.serve("Dear Reader");
// "Enjoy your bacon and toast, Dear Reader."

3 . 9 . 6Inheritance

Every object-oriented language lets you not only define methods, but reuse them across multiple classes or objects. For that, Lox supports single inheritance. When you declare a class, you can specify a class that it inherits from using a less-than (<) operator.

class Brunch < Breakfast {
  drink() {
    print "How about a Bloody Mary?";
  }
}

Here, Brunch is the derived class or subclass, and Breakfast is the base class or superclass.

Every method defined in the superclass is also available to its subclasses.

var benedict = Brunch("ham", "English muffin");
benedict.serve("Noble Reader");

Even the init() method gets inherited. In practice, the subclass usually wants to define its own init() method too. But the original one also needs to be called so that the superclass can maintain its state. We need some way to call a method on our own instance without hitting our own methods.

As in Java, you use super for that.

class Brunch < Breakfast {
  init(meat, bread, drink) {
    super.init(meat, bread);
    this.drink = drink;
  }
}

That’s about it for object orientation. I tried to keep the feature set minimal. The structure of the book did force one compromise. Lox is not a pure object-oriented language. In a true OOP language every object is an instance of a class, even primitive values like numbers and Booleans.

Because we don’t implement classes until well after we start working with the built-in types, that would have been hard. So values of primitive types aren’t real objects in the sense of being instances of classes. They don’t have methods or properties. If I were trying to make Lox a real language for real users, I would fix that.

3 . 10The Standard Library

We’re almost done. That’s the whole language, so all that’s left is the “core” or “standard” librarythe set of functionality that is implemented directly in the interpreter and that all user-defined behavior is built on top of.

This is the saddest part of Lox. Its standard library goes beyond minimalism and veers close to outright nihilism. For the sample code in the book, we only need to demonstrate that code is running and doing what it’s supposed to do. For that, we already have the built-in print statement.

Later, when we start optimizing, we’ll write some benchmarks and see how long it takes to execute code. That means we need to track time, so we’ll define one built-in function, clock(), that returns the number of seconds since the program started.

And . . . that’s it. I know, right? It’s embarrassing.

If you wanted to turn Lox into an actual useful language, the very first thing you should do is flesh this out. String manipulation, trigonometric functions, file I/O, networking, heck, even reading input from the user would help. But we don’t need any of that for this book, and adding it wouldn’t teach you anything interesting, so I’ve left it out.

Don’t worry, we’ll have plenty of exciting stuff in the language itself to keep us busy.

Challenges

  1. Write some sample Lox programs and run them (you can use the implementations of Lox in my repository). Try to come up with edge case behavior I didn’t specify here. Does it do what you expect? Why or why not?

  2. This informal introduction leaves a lot unspecified. List several open questions you have about the language’s syntax and semantics. What do you think the answers should be?

  3. Lox is a pretty tiny language. What features do you think it is missing that would make it annoying to use for real programs? (Aside from the standard library, of course.)

Design Note: Expressions and Statements

Lox has both expressions and statements. Some languages omit the latter. Instead, they treat declarations and control flow constructs as expressions too. These “everything is an expression” languages tend to have functional pedigrees and include most Lisps, SML, Haskell, Ruby, and CoffeeScript.

To do that, for each “statement-like” construct in the language, you need to decide what value it evaluates to. Some of those are easy:

  • An if expression evaluates to the result of whichever branch is chosen. Likewise, a switch or other multi-way branch evaluates to whichever case is picked.

  • A variable declaration evaluates to the value of the variable.

  • A block evaluates to the result of the last expression in the sequence.

Some get a little stranger. What should a loop evaluate to? A while loop in CoffeeScript evaluates to an array containing each element that the body evaluated to. That can be handy, or a waste of memory if you don’t need the array.

You also have to decide how these statement-like expressions compose with other expressionsyou have to fit them into the grammar’s precedence table. For example, Ruby allows:

puts 1 + if true then 2 else 3 end + 4

Is this what you’d expect? Is it what your users expect? How does this affect how you design the syntax for your “statements”? Note that Ruby has an explicit end to tell when the if expression is complete. Without it, the + 4 would likely be parsed as part of the else clause.

Turning every statement into an expression forces you to answer a few hairy questions like that. In return, you eliminate some redundancy. C has both blocks for sequencing statements, and the comma operator for sequencing expressions. It has both the if statement and the ?: conditional operator. If everything was an expression in C, you could unify each of those.

Languages that do away with statements usually also feature implicit returnsa function automatically returns whatever value its body evaluates to without need for some explicit return syntax. For small functions and methods, this is really handy. In fact, many languages that do have statements have added syntax like => to be able to define functions whose body is the result of evaluating a single expression.

But making all functions work that way can be a little strange. If you aren’t careful, your function will leak a return value even if you only intend it to produce a side effect. In practice, though, users of these languages don’t find it to be a problem.

For Lox, I gave it statements for prosaic reasons. I picked a C-like syntax for familiarity’s sake, and trying to take the existing C statement syntax and interpret it like expressions gets weird pretty fast.