Assemble

Thursday 12 August 2021

Even I don’t enjoy the low level so much that I want to remember which numbers represent which instructions, so I started writing an assembler for my 8-bit machine. Assembly code is pretty close to machine code; just in a more readable format. Here’s my little demo program in machine code.

2 0 3 1 4 5 1 4

And here it is in assembly code.

:start
  set_r0 0
  set_r1 1

:loop
  add
  swap
  jump @loop

Opcodes and machine instructions

To keep the assembler simple, I decided to have a one-to-one mapping between opcodes and machine instructions. An opcode is the command you write in assembly, like add or set_r0. Opcodes don’t need to map exactly to machine instructions. For example, I could write the first few lines like this.

:start
  set r0, 0
  set r1, 1

In that case I would have one opcode (set) and it would take two operands: the register and the value. As far as I know, this is what assembly typically looks like and I want to support that syntax, too. I wouldn’t even have to change the machine for it; the assembler just needs to translate it to the same machine code as before. The only reason I didn’t implement this yet, is that I wanted to start simple.

Parsing made simple

Another thing I did to keep the assembler simple, is have every token type start in a different way.

An opcode always starts with a letter.
A number always starts with a digit.
A label always start with a :.
A reference always starts with a @.

There’s never any ambiguity. Seriously, if you can parse INI files, you can parse assembly code. Actually, parsing the assembly code is currently so simple that I didn’t even bother with a tokenizer. That’s likely to change as I make the assembler more convenient to use, but for now it works just fine.

Output

The output of the assembler is a memory map for the 8-bit machine. If you remember, the machine has 256 bytes of memory, so the assembler just creates a dump of 256 bytes that the machine can read on startup. Actually, the assembler creates a memory map up to 256 bytes. If the program is smaller than that, the 8-bit machine will just leave the rest of the memory uninitialized.

Creating the memory map is straightforward since every opcode and every number directly translates into one byte in the memory map. I don’t even bother checking whether they are in an order that makes sense. The assembler is perfectly fine translating a program that looks like this.

add jump 15 8 7 6 swap set_r0

Garbage in, garbage out.

Labels and references

The only part that’s slightly tricky, is dealing with jumps. The jump machine instruction expects an address to jump to as its operand. If you write the machine code directly, you can only get that address by counting bytes. This is particularly inconvenient if you add more instruction later, because now all of your jump targets shift. In assembly, you solve this by adding a label. You can give jump the name of the label instead of an address and the assembler will figure out what to replace it with.

I call the name after jump a reference, because it’s convenient to distinguish between labels and references when implementing the assembler, but I won’t be offended if you call those names after jump labels as well. Really, try me. See, not offended.

My first thought on implementing this, was to keep a table of labels. Every time I encounter a label, I would add the name as a key and the address as a value. Then when I encounter a reference, I just look up the address in the table and output that. That works fine, until you want to jump to a label that’s defined after the reference, and that’s definitely something I want to support.

So instead, I keep two tables: one to keep track of the labels, one to keep track of the references. When I encounter a reference, I write a 0 to the memory map, but I also store in the reference table what the address is of the byte I just wrote. Then, when the entire assembly code has been processed, I go back to the reference table and for every entry, I look up the jump target in the label table. Then I replace the 0 I wrote earlier with the jump target. For example, let’s assemble the following code.

  set_r1 16

:loop
  add
  jump @loop

The instruction pointer starts at 0.
set_r1 is an opcode, so we write its instruction code (3) to the memory map.
The instruction pointer is now 1.
16 is a number, so we write it to the memory map.
The instruction pointer is now 2.
:loop is a label, so we add an entry to the label table. Since the instruction pointer is 2, the entry is loop, 2.
The instruction pointer is still 2, because the label produces no output in the memory map.
add is an opcode, so we write its instruction code (4) to the memory map.
The instruction pointer is now 3.
jump is an opcode, so we write its instruction code (1) to the memory map.
The instruction pointer is now 4.
@loop is a reference, so we add a 0 to the memory map and we add an entry to the reference table. Since the instruction pointer is 4, the entry is loop, 4.

That’s all the assembly code dealt with. Now it’s time to resolve all the references.

We take the entry from the reference table. It says loop, 4.
We look up the label with the name loop in the label table. It gives back the address 2.
We take the byte in the memory map with address 4 (where 4 is the value stored with the reference) and set that byte to 2 (where 2 is the value stored with the label).

Write the memory map to file and you’re done.

Source code

If you’re a Patreon supporter on the Source Code tier, you can grab the source code from GitHub. You can create your own assembly program, run it through the assembler, and then pass it to the 8-bit machine. Instructions are in the readme.