Dusk OS C compiler

The C compiler is a central piece of Dusk OS. It's written in Forth and is loaded very early in the boot process so that it can compile the drivers it's about to use.

This compiler needs to meet two primary design goals:

  1. Be as elegant and expressive as possible in the context of a Forth, that is, be an elegant fallback to Forth's shortcomings.
  2. Minimize the work needed to port existing C applications.

It is not a design goal of this C compiler to be able to compile POSIX applications without changes. It is expected that a significant porting effort will be needed each time.

Because of the first goal, we have to diverge from ANSI C. The standard library will likely be significantly different, the macro system too. Both will hopefully fit better with Forth than their ANSI counterparts.

But because of the second goal, we do want to stay reasonably close to ANSI. The idea is that the porting effort should be mostly a mechanical effort and it should be as little prone as possible to subtle logic changes caused by the porting.

For this reason, the core of the language is very close to ANSI C.

Other pages

Usage

There are three ways to compile C code with this compiler:

  1. cc<< ( -- )
  2. ccc<< ( -- )
  3. :c

The regular method is through "cc<<", which compiles the specified file. For example, "cc<< foo.c" Reads the file "foo.c" as a unit and compiles every element in it. Functions will be added to the system dictionary unless they have the "static" storage type.

"cc<<" clears symbols, types and macros from the previous unit before it proceeds. In some cases, you don't want that. In this case, you can use "ccc<<" (continue cc<<) which doesn't clear this data.

Another method is to compile in an "inline" manner with ":c". This word reads a single "unit element" (a function definition, a global variable or a typedef) from the input stream and compiles it. It returns to normal Forth interpretation after it parses the last token of the element. Example:

:c int foo() { return 42; } foo . \ prints 42

As with "ccc<<", ":c" doesn't clear temporary data before it compiles.

Differences in the core language

Writing for DuskCC is the same as writing for another ANSI C compiler, but there are a few differences:

General

Literals

Structures, unions, typedefs, enums

Defined behavior

These are not really differences with ANSI C, but rather definition of the "implementation defined" part of C:

Planned, but not implemented yet

Calling convention

Unless the "static" modifier is used, this C compiler produces words that can be called from Forth like any other word. C code can also call upon Forth words through function prototypes and "calias" (see below).

To be able to do so, it adheres to Dusk's calling convention, that is, it passes arguments on PS. All arguments in a function signature use 4 bytes, regardless of their type. In a function signature, the leftmost argument is the top of stack the rightmost is the deepest argument. Example:

:c int foo(int a, int b, int c) {
    return a-c;
}
1 2 3 foo . --> prints "2"
8 6 4 foo . --> prints "-4"

Caller save

Native words don't save registers they use. Whenever another word is called, one must assume that every registers expect PSP, RSP and W are trashed.

Therefore, it's important to remember that in Dusk OS, it's the caller's responsibility to save/restore registers around a call.

pspush() and pspop()

Builtin functions "void pspush(int val)" and "int pspop()" allow for direct control over PS. This gives you the ability to pop or push a variable number of arguments from/to PS.

These functions, however, mess with arguments and return values. You can't use "pspop()" if your function has arguments and you can't use "pspush()" if it has a return value. Also, "pspop()" shouldn't be used in complex expressions, only "straight". Example:

void foo() {
    int a = pspop();
    int b = pspop();
    int x = 42;
    pspush(a-b+x);
}
1 2 foo . --> prints "43"

Variable arguments

Function signatures can end with a "..." (example: "int foo(int a, int b, ...)") to indicate that they can receive a variable number of arguments. Accessing those variables doesn't work like in ANSI C though because those arguments live directly on PS.

For a "straight ..." function such as "int foo(...)", you can use "pspop()" to access those arguments.

Functions signatures that have other arguments prior to the "..." cannot be implemented in C, only in Forth. But the interfacing works just as you'd expect. For example, "void printf(char *fmt, ...)" is implemented in Forth, but can be called as you would call it in ANSI C, for example with "printf("foo %d", 42)".

Struct alignment

As per general Dusk alignment discipline policy, this is the responsibility of the developer to align their structs. A struct resulting in unaligned memory accesses will generate an error during their definition.

A struct can have an unaligned size (end with a field that has a size that isn't a multiple of 4). When that happens, the struct will automatically align its size for array and embedding purposes. For example, an array of 10 bytes structs will have a 2 bytes padding in between each element. If another struct embeds this 10 bytes struct, it will also add 2 bytes padding after it.

Strict type matching

DuskCC will error out whenever it sees a binary operator where the two operands don't have the same type. Therefore, when combining two different types in a binary operand, explicit typecasting need to be used.

Some types are "weak", that is, they yield to the "other" side of the binary operator. A binary operator with a "weak" size has the type of its "strong" side. These yield weak types:

Automatic type matching encourages sloppy typing of variables and explicit typecasting helps to notice type-related bugs before they arrive. With proper choice of types, the noise that this explicit typecasting represents is minimal.

The #define pre-processor directive

The pre-processor allows you to define text expansion macros which can then be used directly in C code. You define a macro with #define:

#define FOO bar

The #define directive reads the next identifier, considers it the macro name, and then reads the rest of the lines and associates it verbatim with that name.

Then, the tokenizer, whenever it encounters an identifier, checks if it corresponds to a macro name before sending it to the parser. If it does, that name is not sent to the parser. Instead, the content of the associated macro is processed through lib/macro (see doc/lib/macro) and fed back into the tokenizer as if that content had textually replaced the macro name. Parameters to feed to the macro must follow the macro name exactly as described in doc/lib/macro. Therefore:

#define RET return %<;
int foo() { RET "42" }

is the exact same equivalent to:

int foo() { return 42; }

The #const pre-processor directive

The #const directive works a bit like #define, except that it takes a Forth line of source code (like with #define, it's always one line). It interprets it immediately and expects this code to have a ( -- n ) signature. Then, it creates a constant with the specified names and attributes it this value. Example usage:

#const FOO 20 1+ <<

In addition to being a bit faster (in terms of compile speed) than #define because the value is yielded at pre-processing time, it allows us to go "fetch" values living in the Forth world from within C source.

Effectively, there is no other straightforward way to fetch a Forth constant from within a C unit. For example, let's say you want to have a "LF" constant like in Forth, what would you do? This:

#const LF LF

The created constant behaves exactly the same as an enum constant. In fact, they live in the same dictionary so a const will shadow an enum of the same name and vice-versa.

The #forth pre-processor directive

The #forth pre-processor directive executes the following line of Forth source as is, right now.

Because this C compiler generates binary code for the C code it parses right as it encounters it (with statements being its granularity level), this allows you to insert some fancy code generation yourself. In short, if it's not in the middle of a statement/expression and its ";" character, it's probably arlight to insert #forth there. Go crazy!

You have to make sure that the stack effect of your Forth line is ( -- ) or you're going to break the compiling process.

The #include directive

The #include directive opens the file described by the following path (no quoting) and interprets it as C code as if it was inlined. The line

#include /foo/bar.h

is effectively the same as:

#forth ccc<< /foo/bar.h

".h" files in Dusk OS are used in a similar manner than on other OSes, that is, they contain structure and function definitions to be shared across multiple C units.

The #include directive doesn't do any kind of auto-loading of the associated unit. Therefore, before you #include a header file with the intention of calling function prototypes defined in it, you must ensure that the associates unit has been loaded.

Function prototypes

A function prototype is a function without a body. Examples:

static int foo(short a, char b);
uint max(uint a, uint b);

There are two types of prototypes, static and non-static, with both a completely different usage.

Static prototypes are for forward declarations. When you declare one, memory wide enough for a jump is allocated. Then, when the real function is declared, it writes a jump to itself in that reserved space.

The static attribute of the prototype is not carried to the implementation function. You can very well have a prototype (which uses "static") that is a forward declaration to a non-static implementation.

The non-static prototype is to allow C to call a word from the system dictionary. By default, C has no visibility to the Forth dictionary because Forth words don't have C function signatures. When you declare a non-static function prototype, the CC looks into the system dictionary for a word of that name and links that symbol to the found word.

If the word is not found, it is not an error... yet. It's possible that this function prototype is part of a header file that is included by the unit that is about to implement it. In this case, DuskCC silently creates a broken function reference. If that broken function is then referenced before the actual function is defined, then it becomes an error.

You can only link words that have a signature compatible with C, that is: 0 or 1 return value.

Be aware that if you link words that do fancy things like shrinking PS or modifying RS, you are on risky grounds and you should know what you do. The best approach with these situations is to proxy the word as "void()" and use "pspop()" and "pspush()" for argument passing.

Also, note that CC's lib [comp/c/lib] already proxies quite a few system words.

The #calias directives family

When a prototype function wants to interface a forth word that has a name that can't be expressed as a C identifier, we have to resort to creating a proxy forth word for it and it's tedious (and pollutes the forth namespace).

To alleviate this, we have the "#calias" directive. It allows the creation of a prototype linking to a forth word of a different name. Like "alias", it first take the name of the target word, but instead of being followed by the name of the word to create, it's followed by a C function prototype.

For example, if you wanted to interface "0-9?" in C, you would do:

#calias 0-9? int isdigit(char c);

You can also alias the "8b" or "16b" versions of a word with #calias16 and

calias8.

If you want to target a word in a namespace, you can use #caliasns:

#caliasns MyNS :myword int myns_myword(int a);

Symbols, types and macro visibility and lifetime

The C compiler creates four kinds of artifacts: types, symbols, constants, macros.

Types are what is created by "struct" and "typedef". Those artifacts bind a name to type information. Once they're created, the following C code can use these names to refer to these types.

Symbols are declarations of functions and variables at different offsets. A variable declared outside a function in a C unit is a global variable and generates a Forth word that acts like a "create" word: it yields the variable's address.

A function declaration also generates a Forth word that call into the generated function.

If you don't want to generate a Forth word for your declaration, begin the declaration with the "static" keyword. Then, the function or variable will only be available to C code.

Constants are what is created by #const and enum {}. They are not exposed as Forth words.

Macros are what is created by #define. They are not exposed as Forth words.

Except for base types (int, uint, etc.), all of those artifacts are cleared at the beginning of a "cc<<" call. Therefore, if you want to share structures and function prototype across separate C units, you have to put them in a separate header file and #include them.

Note that clearing occurs at the beginning of cc<<, allowing you to fiddle with your artifacts with Forth code after having parsed a C unit. You can also have C units share the same "universe" by loading them with ccc<< instead of cc<. The ":c" word doesn't clear anything either.

Creating a type doesn't create a Forth word, so by default, types are invisible to Forth. It is possible, however, to export a type to a Forth structure with ":export ( self -- )". This word generates fields [doc/usage/bind] for each of the struct's members. You'll typically want to wrap an :export into "struct[ ... ]struct". The code looks like this:

struct[ MyType S" MyType" findType CType :export ]struct

After that, a "MyType" struct with the same fields as the C type is available to Forth.

Symbols declared in function bodies are local variables and are cleared at the end of the function body. During lookup, they are searched before all global symbols.

During symbol lookups, local symbols are searched first, so a global symbol cannot shadow a local one.

Speed considerations

This C compiler is a simple one. It doesn't try to second guess the code you write, so the binary code it will generate is rather predictable. It does resolve constant expressions at compile time, but it doesn't do the more complex "0 + a = a", "1 * a = a" analysis. This means that some C idioms will generate inefficient code.

For example, "i++" is inherently slower than "++i" because a copy of "i" has to be kept in register before the increase is done. If you don't need the result of the expression, the first form is wasteful.

List of known idioms and their faster replacement:

It's a good idea to keep those idioms in mind, but it's not worth going over the top either. The penalties are small.