Dusk OS C compiler

The C compiler is a central piece of Dusk OS. It's written in Forth and is loaded very early in the boot process so that it can compile the drivers it's about to use.

This compiler needs to meet two primary design goals:

Be as elegant and expressive as possible in the context of a Forth, that is, be an elegant fallback to Forth's shortcomings.
Minimize the work needed to port existing C applications.

It is not a design goal of this C compiler to be able to compile POSIX code without changes.

Because of the first goal, we have to diverge from ANSI C. The standard library is much closer to Dusk OS than to ANSI C. Porting anything related to I/Os involves rethinking.

But because of the second goal, we do want to stay reasonably close to ANSI. The idea is that the porting effort should be mostly a mechanical effort and it should be as little prone as possible to subtle logic changes caused by the porting. Therefore, purely logical code can mostly be ported as is or with a purely mechanical effort. This property is important to minimize the risks that a porting effort introduces subtle bugs that are hard to pinpoint.

Other pages

Implementation details [doc/comp/c/impl]
Standard Library [doc/comp/c/lib]

Usage

There are three ways to compile C code with this compiler:

cc<< ( -- )
ccc<< ( -- )
:c

The regular method is through "cc<<", which compiles the specified file. For example, "cc<< foo.c" Reads the file "foo.c" as a unit and compiles every element in it. Functions will be added to the system dictionary unless they have the "static" storage type.

"cc<<" clears symbols, types and macros from the previous unit before it proceeds. In some cases, you don't want that. In this case, you can use "ccc<<" (continue cc<<) which doesn't clear this data.

Another method is to compile in an "inline" manner with ":c". This word reads a single "unit element" (a function definition, a global variable or a typedef) from the input stream and compiles it. It returns to normal Forth interpretation after it parses the last token of the element. Example:

:c int foo() { return 42; } foo . \ prints 42

As with "ccc<<", ":c" doesn't clear temporary data before it compiles.

Differences in the core language

Writing for DuskCC is mostly the same as writing for another ANSI C compiler, but there are a few differences:

General

the C preprocessor is different, see below.
no 64-bit types
no long, redundant with int
no double, float is always 32b
No "signed" or "unsigned" keywords. char/short/int are signed, uchar/ushort/uint are unsigned.
Variable arguments work differently, see below.
The keyword "static" has a slightly different meaning. See below.
No multi-dimensional array.
No "const" attribute.
Type matching is strict. See below.
Element larger than 32-bit cannot be used as return values or argument.
Pointers to arrays are unrepresentable. Declarations such as int (*x)[42] (with which the CC can do nothing fancy anyways) won't work. Use int **x instead. Array of pointers (int *x[42]) do work, however.
This C compiler has "immediate" superpowers. See below.
Name clashes are resolved exactly as in Forth: the newer identifier shadows the old one.

Literals

Number literals are the same as Dusk OS, so 12345, $1234 and 'X'. As a special case, the "0x" prefix is supported as a "$" alternative. No octal support. No '\n' '\r' '\0' '\t' char literals.
string literals are not null-terminated, but "counted strings". The exact same format as system strings [doc/usage/lit]. You can create a null-terminated string literal by appending a "0" next to the closing quote, for example "hello"0. This 0 shifts the address of the literal by 1 (to skip the count) and adds an extra 0 at the end.
String literal parsing behaves like the Forth one, which means that it supports \n \r \0 \", but not \t.
The list literal {..., ..., ...} is, in a way, more flexible than in regular C and in another way, less flexible. It can be used in any expression, not just initialization. For example, "return {42, 54, $1234}[idx];" works fine, just fine. This list, however, needs to be static. You can include global symbol references, but not arguments or local variables.

Structures, unions, typedefs, enums

"struct MyStruct {...};" automatically creates a "typedef" to the struct.
The "struct" keyword can't be used to reference structs, only to define them.
No bit fields. Not worth their complexity weight.
"enum" is not a type, is always anonymous and has no other purpose than to declare the constants it contains. In other words, "enum" is a shortcut for repeated "#const".
"enum" can only be declared at unit top level.

Defined behavior

These are not really differences with ANSI C, but rather definition of the "implementation defined" part of C:

char is always 8 bit, short is always 16 bit, int is always 32 bit.
Doing ">>" on signed operands does an arithmetic shift right, that is, repeat bit 31 on the right instead of repeating 0 (which is what happens on unsigned ">>").
struct fields are never padded. See below.
in an "enum {}" each explicit value sets the "running" counter. In "enum { FOUR=4, FIVE }", FIVE is 5. That counter implicitly begins at 0.
Typecasting to a larger type performs explicit masking of the underlying value.
Typecasting to a larger type when the origin type is signed performs sign extension on the value.
The "&&", "||" and "?:" operators do shortcutting. That is, they jump over code that cannot change the outcome of the expression.

Planned, but not implemented yet

float
All global variables are effectively "static". No "extern" mechanism yet.
Compile time string literal concatenation
struct initialization with "{}" literal

Calling convention

Unless the "static" modifier is used, this C compiler produces words that can be called from Forth like any other word. C code can also call upon Forth words through function prototypes and "calias" (see below).

To be able to do so, it adheres to Dusk's calling convention, that is, it passes arguments on PS. All arguments in a function signature use 4 bytes, regardless of their type. In a function signature, the leftmost argument is the top of stack the rightmost is the deepest argument. Example:

:c int foo(int a, int b, int c) {
    return a-c;
}
1 2 3 foo . --> prints "2"
8 6 4 foo . --> prints "-4"

The return value is pushed to the stack. A function with a "void" return value doesn't push anything to the stack. Thus, "int foo(int x)" has the signature "( n -- n )" and "void foo(int x)" has the signature "( n -- )".

Caller save

In general, words don't save registers they use. Whenever another word is called, one must assume that every registers except PSP, RSP and W are trashed.

Therefore, it's important to remember that in Dusk OS, it's the caller's responsibility to save/restore registers around a call.

Variable arguments

Function signatures can end with a "..." (example: "int foo(int a, int b, ...)") to indicate that they can receive a variable number of arguments. Accessing those variables doesn't work like in ANSI C though because those arguments live directly on PS.

For a "straight ..." function such as "int foo(...)", you can use "pspop()" from [doc/comp/c/lib] to access those arguments.

Functions signatures that have other arguments prior to the "..." cannot be implemented in C, only in Forth. But the interfacing works just as you'd expect. For example, "void printf(char *fmt, ...)" is implemented in Forth, but can be called as you would call it in ANSI C, for example with "printf("foo %d", 42)".

Struct alignment

As per general Dusk alignment discipline policy [doc/usage/mem], it is the responsibility of the developer to align their structs. A struct resulting in unaligned memory accesses will generate an error during their definition.

A struct can end with a field that has a size that isn't a multiple of 4. When that happens, the struct will automatically align its size. The "sizeof()" of "struct { int a; short b; }" is 8, not 6.

Strict type matching

DuskCC will error out whenever it sees a binary operator where the two operands don't have the same type. Therefore, when combining two different types in a binary operand, explicit typecasting need to be used.

Some types are "weak", that is, they yield to the "other" side of the binary operator. A binary operator with a "weak" size has the type of its "strong" side. These yield weak types:

Integer literals (uint)
Pointer subtraction (int)
void*

Automatic type matching encourages sloppy typing of variables and explicit typecasting helps to notice type-related bugs before they arrive. With proper choice of types, the noise that this explicit typecasting represents is minimal.

The #define pre-processor directive

The #define directive allows you to define text expansion macros which can then be used directly in C code:

#define FOO bar

The #define directive reads the next identifier, considers it the macro name, and then reads the rest of the line and associates it verbatim with that name.

The tokenizer, whenever it encounters an identifier, checks if it corresponds to a macro name before sending it to the parser. If it does, that name is not sent to the parser. Instead, the content of the associated macro is processed through [doc/lib/macro] and fed back into the tokenizer as if that content had textually replaced the macro name. Parameters to feed to the macro must follow the macro name exactly as described in [doc/lib/macro]. Therefore:

#define RET return %<;
int foo() { RET 42 }

is equivalent to:

int foo() { return 42; }

For clarity, an opening parenthesis can directly (no whitespace allowed) follow the macro name. If it does, then it must be accompanied by a matching closing parenthesis. There's a caveat though: you must not forget that macro arguments are parsed using "wordorquote", which means it doesn't follow C tokenization rules. If your last macro argument is unquoted, it must be followed by a whitespace. For example, 'RET 42' above could also be written as 'RET(42 )' or 'RET("42")'.

The #const pre-processor directive

The #const directive works a bit like #define, except that it takes a Forth line of source code (like with #define, it's always one line). It interprets it immediately and expects this code to have a ( -- n ) signature. Then, it creates a constant with the specified names and attributes it this value. Example usage:

#const FOO 20 1+ <<

In addition to being a bit faster (in terms of compile speed) than #define because the value is yielded at pre-processing time, it allows us to go "fetch" values living in the Forth world from within C source.

Effectively, there is no other straightforward way to fetch a Forth constant from within a C unit. For example, let's say you want to have a "LF" constant like in Forth, what would you do? This:

#const LF LF

The created constant behaves exactly the same as an enum constant.

The #forth pre-processor directive

The #forth pre-processor directive executes the following line of Forth source as is, right now.

Because this C compiler generates binary code for the C code it parses right as it encounters it (with statements being its granularity level), this allows you to insert some fancy code generation yourself. In short, if it's not in the middle of a statement/expression and its ";" character, it's probably alright to insert #forth there with some assembler in there. Go crazy!

You have to make sure that the stack effect of your Forth line is ( -- ) or you're going to break the compiling process.

At the moment, the way DuskCC generates code is not formalized and documented, making the writing of such inline assembler a bit difficult, but such formalization will come soon enough, making this feature reliable.

The #include directive

The #include directive opens the file described by the following path (no quoting) and interprets it as C code as if it was inlined. The line

#include /foo/bar.h

is effectively the same as:

#forth ccc<< /foo/bar.h

".h" files in Dusk OS are used in a similar manner than on other OSes, that is, they contain structure and function definitions to be shared across multiple C units.

The #include directive doesn't do any kind of auto-loading of the associated unit. Therefore, before you #include a header file with the intention of calling function prototypes defined in it, you must ensure that the associated unit has been loaded.

The #if/#else/#endif directives

The C compiler can skip parts of the code it feeds from, regardless of its syntax, based on the value of a Forth expression. You do that with the #if directive.

When a #if is encountered, the remaining contents of the line is parsed as a Forth expression with the expected signature ( -- f ).

If the expression yields 0, the following code will be ignored until a #else or a #endif is encountered.

If the expression is not zero, we parse normally. If a #else is encountered, we skip until a #endif is encountered.

Simple, right? Yeah it is, but there's a caveat. The "skipping" part completely ignores tokenization and parsing rules. Anything can be in there, it's going to be ignored. This means that if your #else and #endif live inside a / / comment, it's going to be effective!

As with any # directive, however, we still make sure that it is preceded by a LF character, so we limit the possibilities of the "accidental" #else or #endif to beginnings of lines.

Function prototypes

A function prototype is a function without a body. Examples:

static int foo(short a, char b);
uint max(uint a, uint b);

There are two types of prototypes, static and non-static, with both a completely different usage.

Static prototypes are for forward declarations. When you declare one, memory wide enough for a jump is allocated. Then, when the real function is declared, it writes a jump to itself in that reserved space.

The static attribute of the prototype is not carried to the implementation function. You can very well have a prototype (which uses "static") that is a forward declaration to a non-static implementation.

The non-static prototype is to allow C to call a word from the system dictionary. By default, C has no visibility to the Forth dictionary because Forth words don't have C function signatures. When you declare a non-static function prototype, the CC looks into the system dictionary for a word of that name and links that symbol to the found word.

If the word is not found, it is not an error... yet. It's possible that this function prototype is part of a header file that is included by the unit that is about to implement it. In this case, DuskCC silently creates a broken function reference. If that broken function is then referenced before the actual function is defined, then it becomes an error.

You can only link words that have a signature compatible with C, that is: 0 or 1 return value.

Be aware that if you link words that do fancy things like shrinking PS or modifying RS, you are on risky grounds and you should know what you do. The best approach with these situations is to proxy the word as "void()" and use "pspop()" and "pspush()" for argument passing.

Also, note that CC's lib [comp/c/lib] already proxies quite a few system words.

The #calias directives family

When a prototype function wants to interface a forth word that has a name that can't be expressed as a C identifier, we have to resort to creating a proxy forth word for it and it's tedious (and pollutes the forth namespace).

To alleviate this, we have the "#calias" directive. It allows the creation of a prototype linking to a forth word of a different name. Like "alias", it first take the name of the target word, but instead of being followed by the name of the word to create, it's followed by a C function prototype.

For example, if you wanted to interface "0-9?" in C, you would do:

#calias 0-9? int isdigit(char c);

You can also alias the "8b" or "16b" versions of a word with #calias16/#calias8.

If you want to target a word in a namespace, you can use #caliasns:

#caliasns MyNS :myword int myns_myword(int a);

Immediate superpowers with #immediate

This C compiler can have immediate function signatures with the #immediate directive used like #calias. When it compiles a function call to such a signature, that function will be called immediately, at compile time.

The arguments are return value of such words are "Symbol" structures [comp/c/sym.fs]. Such a function can receive only 0 or 1 arguments (more than that and the argument has already been compiled as a "argpush", making it incompatible with immediateness). It must always return a result (a Symbol).

Therefore, the signature of such a word is either "( -- res )" or "( arg -- res )". Argument count is checked by the compiler.

The types in the C signature of such a word have no meaning and will typically be just "int" (the "Symbol" struct isn't exposed to C).

Implementing such words is a bit messy at the moment as it requires using an API (that is, [comp/c/sexpr.fs]) that isn't documented or stable. Also, because of the way code generation works expressions used as arguments of a compile time function might generate spurious (but harmless) code. For example:

if (n < nbelem(a->b))

Will generate code for the "a->b" part (which by the way is generated only because of an underoptimization. Soon enough, this will not generate anything).

You can look at examples of such immediate words in [comp/c/lib.fs].

Symbols, types and macro visibility and lifetime

The C compiler creates four kinds of artifacts: types, symbols, constants, macros.

Types are what is created by "struct" and "typedef". Those artifacts bind a name to type information. Once they're created, the following C code can use these names to refer to these types.

Symbols are declarations of functions and variables stored in a particular place. A variable declared outside a function in a C unit is a global variable and generates a Forth word that acts like a "create" word: it yields the variable's address.

A function declaration also generates a Forth word that call into the generated function.

If you don't want to generate a Forth word for your declaration, begin the declaration with the "static" keyword. Then, the function or variable will only be available to C code.

Constants are what is created by #const and enum {}. They are not exposed as Forth words.

Macros are what is created by #define. They are not exposed as Forth words.

Except for base types (int, uint, etc.), all of those artifacts are cleared at the beginning of a "cc<<" call. Therefore, if you want to share structures and function prototype across separate C units, you have to put them in a separate header file and #include them.

Note that clearing occurs at the beginning of cc<<, allowing you to fiddle with your artifacts with Forth code after having parsed a C unit. You can also have C units share the same "universe" by loading them with ccc<< instead of cc<. The ":c" word doesn't clear anything either.

Creating a type doesn't create a Forth word, so by default, types are invisible to Forth. It is possible, however, to export a type to a Forth structure with ":export ( self -- )". This word generates fields [doc/usage/bind] for each of the struct's members. You'll typically want to wrap an :export into "struct[ ... ]struct". The code looks like this:

struct[ MyType " MyType" findType CType :export ]struct

After that, a "MyType" struct with the same fields as the C type is available to Forth.

Symbols declared in function bodies are local variables and are cleared at the end of the function body. During lookup, they are searched before all global symbols, so a global symbol cannot shadow a local one.

Speed considerations

This C compiler is a simple one. It doesn't try to second guess the code you write, so the binary code it will generate is rather predictable. It does resolve constant expressions at compile time, but it doesn't do the more complex "0 + a = a", "1 * a = a" analysis. This means that some C idioms will generate inefficient code.

For example, "i++" is inherently slower than "++i" because a copy of "i" has to be kept in register before the increase is done. If you don't need the result of the expression, the first form is wasteful.

List of known idioms and their faster replacement:

"i++;" --> "++i;"
"a = b + c; return a + d;" --> "return b + c + d;"
"if (x==0)" --> "if (x)"
"1 + x + 2" --> "x + 1 + 2"

It's a good idea to keep those idioms in mind, but it's not worth going over the top either. The penalties are small.