Lang2 is a programming language. Its implementation is split into two major parts: the compiler and the virtual machine. There are also shared components.
The lang2 implementation uses a simple but effective I/O model defined in
include/lang2/io.h.
There are two basic “interface” types: the reader l2_io_reader
and the writer l2_io_writer
.
A reader is something which data can be read from. A reader must implement
the size_t read(struct l2_io_reader *self, void *buf, size_t len)
function,
which fills buf
with up to len
bytes, and returns the number of bytes written.
The reader should block until something is read into buf
. A return value of
0 should be interpreted as EOF.
A writer is something which data can be written to. A writer must impleemnt
the void write(struct l2_io_writer *self, void *buf, size_t len)
function,
which writes len
bytes from buf
. The writer should block until
all bytes are written.
For example writers, look at l2_io_mem_writer
(with its write
function
l2_io_mem_write
) and l2_io_file_writer
(with its write
function l2_io_file_write
).
For example readers, look at l2_io_mem_reader
(with its read
function
l2_io_mem_read
) and l2_io_file_reader
(with read
function l2_io_mem_read
).
Readers and writers aren’t necessarily meant to be fast. Instead, if you’re doing
a lot of reading or writing (especially short reads and writes), you should be
using the types l2_bufio_reader
and l2_bufio_writer
. These implement buffered
reading/writing, only calling the underlying read
or write
function when
the buffer gets full. The l2_bufio_reader
also has the ability to peek
some number (less than L2_IO_BUFSIZ
) forwards.
In lang2, every identifier is an “atom” (name borrowed from Erlang). An atom is just a number which represents the name. Every identifier gets its own ID, and every occurrence of a particular identifier name will be associated with the same ID. You can think of it like a hash, except that there are no collisions.
Every namespace (be that the namespace of local variables in a function, the namespace of global functions, or a namespace variable) is just a map from an integer ID (the atom) to a variable reference.
Atom literals such as 'foo
will evaluate to an atom variable which has the
same numeric ID as the identifier foo
. This makes it possible to pass
names around at runtime.
The compiler consists of the lexer, the parser and the code generator.
The lexer reads from a reader
and produces tokens as it reads.
A token carries information about what kind of token it is (identifier, number,
open paren, etc). In addition, the token might carry extra information;
number tokens contain a number, dot-number tokens contain an integer,
and identifiers and strings contain a string.
The parser reads tokens from the lexer and uses the code generator to generate bytecode. Unlike other languages, the parser doesn’t produce a syntax tree; it just emits bytecode as it goes. This is possible thanks to careful syntax design combined with careful bytecode design.
The code generator contains, for the most part, thin wrappers around bytecode
instructions. For example, the l2_gen_halt
function just contains one line
to write an L2_OP_HALT
instruction. However, it also has the responsibility
of keeping track of the atoms (so that the name foo
always gets the same ID
everywhere), and of keeping track of the string literals (so that multiple
string literals with the same content only show up once in the bytecode).
The virtual machine executes bytecode.
The central piece of the VM is the l2_vm_step
function, which looks at the
code at the instruction pointer and executes it with a giant switch statement.