# Harmonized Assembly Layer

The Harmonized Assembly Layer is a set of words implemented by all Dusk kernels
which have the same semantics and compile native code that has consistent
results on all architectures. For example, "RSP) 2 +) 16b) +," will, on all
arches, compile a set of instructions that will result in the 16-bit addition
of RSP+2 into the Work register. On i386, this is the same as
"ax sp 2 d) 16b) add,".

This layer allows us to generate performant code in a cross-arch manner. It is
also what compilers such as the C compiler rely on to generate code.

Of course, as with any abstraction, we sometimes lose a little bit in speed and
binary space compared to direct assembler instructions, but in general, the
result is pretty good and direct assembler should be needed only in the tightest
of the loops.

The HAL is implemented at the kernel level and is available from the very
beginning of the boot sequence, which makes extensive use of it to bootstrap
into a usable system.

The HAL is always for the "live" system. It has not been designed with cross-
compiling in mind.

## Concepts

### Register allocation

The HAL has 5 virtual registers: W, A, S, PSP, RSP. Each architecture
implementing the HAL will need to map those virtual registers to actual
registers. Here is the list of mappings for all supported CPUs:

i386  W=eax A=ebx S=edx PSP=esi RSP=esp
ARM   W=r9  A=r11 S=r8  PSP=r10 RSP=r13

### RSP and PSP registers

RSP and PSP registers map directly to Forth's RS pointer and PS pointer. They're
the same.

### W register

The W register is the "work" register and the default destination of all HAL
instructions. When we say that "@," means "fetch", we mean "fetch into the
destination", which is the W register by default.

The W register is PS's top of stack. This means that, for example, increasing
the W register by 1 is the exact same thing as executing the "1+" word.

### A and S register

The HAL has two extra registers that regular Forth doesn't have: the A (Address)
register and the S (Scratch) register.

Both are general purpose registers that can be used both as source and
destination. To use as a source, there's the A) and S) words that behave like
W).

To have an operand target one of those registers as their destination, you use
the A>) and S>) words.

In terms of capabilities, the A register has the exact same ones as W. The S
register can also be used in the same way, but it's used by the "/mod,"
operation as a result register, so it's a tiny bit less "permanent".

### Register permanence

The A and S registers cannot be expected to keep their value across word calls.
As soon as another word is called, we must consider those values destroyed.

However, all HAL operations and macros must preserve A's value (unless A is the
destination, of course). Therefore, we can rely on A's value as long as the code
we generate doesn't branch to other words.

This applies to all "compiling" words such as dup, drop, if, then, etc. Those
words are supposed to leave the A register alone.

The S register doesn't have the same guarantees and is used by some HAL
operations and macros as temporary storage. That's why it's called "Scratch".
Operations and macros using it with will mention it in their documentation.

Rule of thumb: if you're coding a HAL macro, use the S register. If you're
coding a regular word, use A because the S register might be swept under your
feet.

It goes without saying that W, being PS's top of stack, is preserved at all
times.

### Operands

All HAL instructions take either no operand (inherent) or one operand parameter.
That operand parameter is a 32-bit number with an arch-specific (that is,
opaque) bit structure and that contains all the information the instruction
needs to know the source and destination of the instruction.

Operand words all end with ")". For example, "A) +," means "add 32-bit location
where the A register points to the W register".

Some operand words are not directly operands, but operand modifiers. For
example, "+)" adds a numerical offset to an operand. "W) 4 +)" refers to the
memory location where W points to, with a 4 bytes displacement. The "8b)"
modifier transforms the operand into a 8-bit operand.

By default, all operands refer to a memory location. Only through the "&)"
operand (see below) can we refer directly to a value in a register.

### &) operand modifier

The &) word takes an input operand and returns its dereferenced counterpart. For
example, m) becomes i), W) becomes a direct reference to W, etc. This also works
with displacements. For example, "RSP) 4 +) &)" yields an operand that points
to RSP+4.

This operand might not be adressable directly by the host CPU. In that case, the
HAL operator will compile two instructions. For example, "RSP) 4 +) &) +," under
i386 would yield "di sp 4 +) lea, ax di add," ("di" being any unallocated
register).

The "&)" word never writes instructions directly, only operator words do. The
"lea," above wouldn't be written when "&)" is called, but when "+," is.

The &) operand always results in a 32-bit operation. Don't try to apply 16b) or
8b) afterwards, this results in undefined behavior.

&) can't be used with i).

### <>) operand modifier

The <>) word inverts the destination and the source of the HAL instruction,
allowing arithmetic result to be stored directly in memory. For example, "$1234
m) 8b) <>) +," adds the 8-bit value at address $1234 to W and stores the result
directly in address $1234 without affecting W.

### 8b) and 16b) arithmetics

8b) and 16b) modifiers only apply to memory accesses and all arithmetics are
"upscaled" to 32-bit with regards to flags settings.

This also applies to compare, which means that, for example,
"$4242 i) @, RSP) 8b) compare," will never set the Z flag because even if RSP)
is $42, comparison is done one the whole W register.

### RSP) and [rcnt]

The only HAL operation that automatically adjusts [rcnt] (see "Local variables"
in [doc/usage/rs]) is rs+,. Other HAL operations don't touch [rcnt]. Therefore,
special care must be taken when using the RSP) operand.

If you're inside of a regular "code" word, you don't care about [rcnt], so you
can ignore this warning.

However, if you're writing HAL as part of a macro that could be used in a word
that has local variables, then every time you write a HAL operation that
modifies RSP ("RSP) @+," for example), you need to adjust [rcnt] accordingly or
else you'll break local variables.

### Branching and flags

The HAL can generate branching, conditional or not, through its "branch"
instructions. "branchC,", the conditional branching generator, takes a "cond"
argument. This argument is generated by words like "Z)", ">)", etc. and the
number they yield is arch-specific. The idea is that through this number, the
"branchC," instruction knows the kind of native branch instruction to generate.

These conditions depend on flags being set (or not) and the conditions under
which these flags are set is not exactly the same across achitectures.

As of now, when we refer to "flags", there's actually only the Z flag involved,
which is set when the operation yields a zero result. One day, maybe we'll add
the C flag (Carry), so that's why we refer to "flags" as plural. Of course, CPUs
have more flags than that, but they are opaque to the HAL.

To be able to rely on consistant condition branching, HAL instructions make
guarantees on the flags set by certain instructions. If an instruction has a "Z"
next to it in the listing below, it's safe to conditionally branch using "Z)" or
"NZ)" right after having called it. Even if the native instruction for a
particular HAL word doesn't supply that flag, the HAL instruction will generate
the necessary native instructions to make it so, at the cost of speed. For this
reason, we minimize flag guarantees in HAL words.

Condition flags are only valid right after the instruction that's supposed to
set it. Flags are considered destroyed as soon as you compile another
instruction... with one exception: branching preserves flags. This means that
after a branch, branchC, or branchR, flags are the same as they were before.

Arithmetic conditions (">)", "<=)", etc.) have no associated flag and can only
be used after a "compare,".

If you look at branching words signatures, you'll notice something weird: they
take an address parameter and yield an address result. This is because those
words can be used for both backward branching or forward branching. What they do
is to write down a branch to the supplied address, but also yield an address to
a memory location that can then be used by "branch!".

Therefore, a backward branch looks like "begin .. branch, drop" and a forward
branch looks like "0 branch, .. here branch!"

All addresses passed to branching words are absolute addresses. If the native
instructions use relative branching addressing, the HAL takes care of the
translation.

## pushret, popret, and popexit,

In Dusk, "Call" means "Push the address of the instruction following the current
one to RSP and then jump to the address being called". "Return" means "Pop RSP
and jump to that address".

On "traditional" CPU architectures, this maps exactly to the behavior of the
native "call" and "return" instructions, so we can live a happy life of
blissful ignorance when using these CPUs.

On some CPUs such as ARM, the native "call" model is to save the address we'll
want to return to to a register and leave the task of push/popping to RSP to the
programmer.

Of course, one thing we could do is to simply wrap all calls and returns in Dusk
into RSP push/pop operations, but that would squander a wonderful speedup
opportunity: With such an approach to calling, we can avoid one push and one pop
on each "leaf" routine call, that is, on each call to a routine that doesn't
call any other routine. That adds up to quite a lot of pushes and pops.

To grab this opportunity, the HAL has three words: pushret, popret, and popexit,

On "traditional" CPUs, these are hollow shells. The first two are noops and the
last one is an alias to "exit,". On ARM, these words push and pop the
return address register to and from RSP.

Words defined through "high level" mechanisms such as ":" call those words
automatically, no need to worry. However, words created with "code" don't
because it could be a "leaf" word.

This means that if you create a "code" word that happens to not be a "leaf"
(that calls another word), it needs to call "pushret," as a prelude and to call
"popret," before it returns. Leaf words don't need to do that, which makes them
faster.

## Word marks

On some architectures (on WASM), there is a strong separation between "code" and
"data" and memory areas containing executable code have to be "marked" as such.
We do so with "wordmark,". Calling this results in an arbitrary number of bytes
to be written to "here" to serve as such mark (it's a noop in architectures not
needing it).

This mark also serves as JIT status, which means that when a piece of code
changes (for example in "realias"), its word mark should be re-written.

These marks apply to
- every word on sysdict (as well as their code16 and code8 metadata)
- every location that is targeted by a CALL instruction

The "code", "code8b", "code16b" words will automatically write a word mark,
while the "entry" word will not.

## HAL API

Operand words:

W)    -- op          Indirect W register
A)    -- op          Indirect A register
S)    -- op          Indirect S register
PSP)  -- op          Indirect PSP register
RSP)  -- op          Indirect RSP register
i)    n -- op        Immediate operand. Can't use with <>)
m)    addr -- op     Absolute address
+)    op disp -- op  Apply displacement to op. Can be applied multiple times.
                     Displacement can be negative.
W>)   op -- op       Set destination to W
A>)   op -- op       Set destination to A
S>)   op -- op       Set destination to S
&)    op -- op       Dereference operand (see above)
<>)   op -- op       Direction of the operation is inverted (see above)
8b)   op -- op       Make op 8-bit
16b)  op -- op       Make op 16-bit
32b)  op -- op       Make op 32-bit (default)

Operand query words:

(W? ( op -- f )
  Yields whether "op"'s base register is W, regardless of its
  direct/displacement/invert flags. Because W is the top of stack, there's often
  special processing to do in that case.
(split  ( op -- width dst src )
  Split "op" in three components. The broad idea is that "or"-ring those
  components together will yield "op" back.
  "src" is the "heaviest" component. It includes displacement/invert flags.
  "src" is "destination-less" and "width-less" (*not* the same thing as 32-bit)
  and *cannot* be used as-is with an operation. Either "or" it back with a
  "dst/width" or re-apply explicit destination/width words on it. Conversely,
  you shouldn't "or" a component back with a "full" op, only a "splitted" one.
  "dst" only includes the destination register.
  "width" is a *set of flags*, not a number of bytes. To get a number of bytes,
  apply "(sz" to it.
(sz ( op -- n )
  Yields "op" width in bytes, that is, 4, 2 or 1.

Branching and conditions:

Z)   "Zero" flag set. On "compare,", this means "equal".
NZ)  "Zero" flag not set. On "compare," this means "not equal".
<)
<=)
>)
>=)
s<)   Signed comparison
s<=)
s>)
s>=)

C>W,       cond --
  If cond is met, W=1. Otherwise, W=0.

branch,    a -- a
  Branch to address a, yielding a "forward" address for "branch!"
branchC,   a cond -- a
  Branch to address a if condition is met, yielding "a" for "branch!"
branch!    braddr tgtaddr --
  Given "braddr" yielded by a previous "branch" instruction, change the
  reference at the address so that it targets "tgtaddr". Used for forward
  branching.
branchR,   a --
  Compile a branch to address a while at the same time setting the "return
  address" (commonly, that means pushing to RSP, but not always) to the
  instruction directly following this one. This is commonly called a "call".
branchA,   --
  Branch to the address held in the A register.
exit,      --
  Compile a return from a call.
pushret,   --
  Push the current return address to RSP (on relevant CPUs)
popret,    --
  Pop RSP in return address register (on relevant CPUs)
popexit,    --
  Equivalent to "popret, exit," but faster.
wordmark,  --
  Write a "word mark". See section above.

Instructions:

@,       op --      Read source into dest
@!,      op --      Swap dest and source
+,       op --   Z  dest + source
-,       op --   Z  dest - source
*,       op --      dest * source.
/mod,    op --      divide dest by source and put remainder in S register.
                    Can't be used with S>).
<<,      op --      dest lshift source
>>,      op --      dest rshift source
s>>,     op --      Arithmetic ("signed") shift right. Instead of filling the
                    "right" part of dest with zeroes, it fills it with its b31.
&,       op --   Z  dest and source
|,       op --   Z  dest or source
^,       op --   Z  dest xor source
@+,      op --      Read source into dest and then add 4/2/1 to operand's
                    dereferenced source. Cannot be used with m) i) &)
                    If source is the same as dest, behavior is undefined.
-@,      op --      Subtract 4/2/1 to operand's dereferenced source and then
                    read source into dest. Decrement happens before fetch,
                    hence the symbol order being the opposite of "@+".
                    Cannot be used with m) i) &).
compare, op --   *  Compare source to dest (all flags set)
                    example: if W=1 and A=2, "A) &) compare," makes "<)"
                    condition true.
+n,      n op -- Z  Add n to source without affecting dest
                    Can't use with i) or <>)
-W,     --          W = -W

## HAL macros

These words below aren't implemented in kernels and are combinations of the
words above, but they're pretty useful nonetheless.

(src    op -- src   Same as "(split rot> 2drop"
(dst    op -- dst   Same as "(split rot 2drop"
(width  op -- width Same as "(split 2drop"
ps+,    n --        Add n to PSP
rs+,    n --        Add n to RSP

!,      op --       Write dest to source. Shortcut for "<>) @,"
!+,     op --       Equivalent to "<>) @+,". Source==dest is weird, but fine.
-!,     op --       Equivalent to "<>) -@,".
field+) op "x" --   Equivalent to "0 to' +)" with "x" input stream.
                    In other words: add the offset of the typed field to the HAL
                    operand. Doesn't work with methods.

[@+], ( op -- )
  Do an indirect fetch+increase, that is: Fetch a 32-bit address at op's src
  and fetch perform a "@," on that address. Then, increase op's src by op's
  size in bytes.

  Cannot be used with "S)", "&)", "i)" or "<>)". Destroys the S register, but
  using it with "S>)" is fine.

[!+], ( op -- )
  Do an indirect store+increase, that is: Fetch a 32-bit address at op's src
  and fetch perform a "!," on that address. Then, increase op's src by op's
  size in bytes.

  Cannot be used with "S)", "S>)", "&)", "i)" or "<>)". Destroys the S register.

## Examples

To give a better idea of how the HAL works, here are examples with their
corresponding i386 instructions:

PSP) @,                                ax si 0 d) mov,
A) 8b) !,                              bx 0 d) al mov,
RSP 4 +) A>) +,                        bx sp 4 d) add,
PSP) &) A>) @!,                        bx si xchg,
PSP) <>) <<,                           cx ax mov,
                                       si 0 d) cl shl,
RSP) @+,                               ax sp 0 d) mov,
                                       sp 4 i) add,
A) 16b) !+,                            bx 0 d) 16b) ax mov,
                                       bx 2 i) add,
A) 16 +) &) @,                         bx 16 d) lea,
$1234 m) +n,                           $1234 m) 42 i) add,
42 PSP) &) +n,                         si 42 i) add,
54 i) -,                               ax 54 i) sub,

Here are actual word implementations:

code drop PSP) @+, exit,
code dup PSP) -!, exit,
code swap PSP) @!, exit,
code nip 4 ps+, exit,
code over PSP) -!, PSP) 4 +) @, exit,
code 1+ 1 i) +, exit,
code lshift PSP) <>) <<, PSP) @+, exit,
code c@ W) 8b) @, exit,
code , HERE m) A>) @, 4 HERE m) +n, A) !, PSP) @+, exit,
code not 0 i) compare, Z) C>W, exit,
code execute A) &) !, PSP) @+, branchA,

Branching:

' foo branchR, \ call "foo"
' foo branch, drop \ jump to "foo"
42 i) compare, ' foo Z) branchC, drop \ jump to "foo" if W=42
here branch, drop \ infinite loop
\ Execute code "..." only if W <= A
A) &) compare, 0 >) branchC, ... here branch!

## HAL number bank

Numbers supplied to i) m) and +) can be any number of the 32-bit range.
Nevertheless, as per HAL API constraints, all operands occupy only one PS slot.

Therein lies a problem: how can a 32-bit operand include its necessary metadata
along with a possible offset that can be anything in the 32-bit range? It does
so through a number bank mechanism.

The number bank is a 4b * 16 global and static rolling buffer. This allows us to
assign arbitrary number to slots numbering from 0 to 15. This slot number
occupies only 4 bit in our HAL operand, which is much more manageable.

This allows up to 16 operands associated with numbers to coexist at once on PS,
making HAL and assemblers (which piggy-back on this API) pretty macro-able.

Every kernel implement this number bank and expose this API:

hbank' ( slot -- a )
  Get address associated to bank slot.

hbank! ( n -- slot )
  Reserve a new slot and write "n" to it. Yield the ID of the new slot.

hbank@ ( slot -- n )
  Yield number in slot.