3 years ago · 8f093cf96d
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -0,0 +1,77 @@
 # Lang2 Architecture

 Lang2 is a programming language. Its implementation is split into two major parts:
 the compiler and the virtual machine. There are also shared components.

 ## I/O

 The lang2 implementation uses a simple but effective I/O model defined in
 [include/lang2/io.h](https://git.mort.coffee/mort/lang2/src/branch/master/include/lang2/io.h).
 There are two basic "interface" types: the reader `l2_io_reader`
 and the writer `l2_io_writer`.

 A reader is something which data can be read from. A reader must implement
 the `size_t read(struct l2_io_reader *self, void *buf, size_t len)` function,
 which fills `buf` with up to `len` bytes, and returns the number of bytes written.
 The reader should block until something is read into `buf`. A return value of
 0 should be interpreted as EOF.

 A writer is something which data can be written to. A writer must impleemnt
 the `void write(struct l2_io_writer *self, void *buf, size_t len)` function,
 which writes `len` bytes from `buf`. The writer should block until
 all bytes are written.

 For example writers, look at `l2_io_mem_writer` (with its `write` function
 `l2_io_mem_write`) and `l2_io_file_writer` (with its `write` function `l2_io_file_write`).
 For example readers, look at `l2_io_mem_reader` (with its `read` function
 `l2_io_mem_read`) and `l2_io_file_reader` (with `read` function `l2_io_mem_read`).

 Readers and writers aren't necessarily meant to be fast. Instead, if you're doing
 a lot of reading or writing (especially short reads and writes), you should be
 using the types `l2_bufio_reader` and `l2_bufio_writer`. These implement buffered
 reading/writing, only calling the underlying `read` or `write` function when
 the buffer gets full. The `l2_bufio_reader` also has the ability to peek
 some number (less than `L2_IO_BUFSIZ`) forwards.

 ## Atoms

 In lang2, every identifier is an "atom" (name borrowed from Erlang). An atom is just
 a number which represents the name. Every identifier gets its own ID, and every
 occurrence of a particular identifier name will be associated with the same ID.
 You can think of it like a hash, except that there are no collisions.

 Every namespace (be that the namespace of local variables in a function, the
 namespace of global functions, or a namespace variable) is just a map from
 an integer ID (the atom) to a variable reference.

 Atom literals such as `'foo` will evaluate to an atom variable which has the
 same numeric ID as the identifier `foo`. This makes it possible to pass
 names around at runtime.

 ## Compiler

 The compiler consists of the lexer, the parser and the code generator.

 The lexer reads from a `reader` and produces tokens as it reads.
 A token carries information about what kind of token it is (identifier, number,
 open paren, etc). In addition, the token might carry extra information;
 number tokens contain a number, dot-number tokens contain an integer,
 and identifiers and strings contain a string.

 The parser reads tokens from the lexer and uses the code generator to generate
 bytecode. Unlike other languages, the parser doesn't produce a syntax tree;
 it just emits bytecode as it goes. This is possible thanks to careful syntax
 design combined with careful bytecode design.

 The code generator contains, for the most part, thin wrappers around bytecode
 instructions. For example, the `l2_gen_halt` function just contains one line
 to write an `L2_OP_HALT` instruction. However, it also has the responsibility
 of keeping track of the atoms (so that the name `foo` always gets the same ID
 everywhere), and of keeping track of the string literals (so that multiple
 string literals with the same content only show up once in the bytecode).

 ## Virtual Machine

 The virtual machine executes bytecode.
 The central piece of the VM is the `l2_vm_step` function, which looks at the
 code at the instruction pointer and executes it with a giant switch statement.