Overview

What is ELLCC Exactly?

ELLCC is a set of tools and files that let you create programs. A set of tools and files like this is often called a toolchain. The programs you create can run on a PC or on a development board like the Raspberry Pi. You run ELLCC on a host. A host is a computer that typically has a keyboard and display that you can use to write and build your programs. When you build your program, you can tell ELLCC what your target is. A target is a computer on which you’ll run your program. Often, the host and target is the same and that’s the default way ELLCC works: You write, build, and run your program on one computer.

ELLCC can also work as a cross compiler. A cross compiler, or cross toolchain, can be used to write programs on one computer that will be deployed on another, typically a small system with limited resources. A great example is the Raspberry Pi mentioned above. While you can run ELLCC on the Pi, building large programs will take much longer than if you were to build them on a larger PC. If the host and target are both running Linux, you can even build and run the program on the host for testing and send it to the target when you’re reasonably confident that it is going to work.

Toolchain Basics

When you write a program you’re typically writing it in what’s called a high level language. A high level language is a human readable language like C or C++.1 Before you can run your program on the target your program has to be translated into something that the target can understand, which is often called machine code. Machine code is a strings of zeros and ones that are placed in the target’s memory and then executed. Machine code is made up of instructions which are very low level bit patterns that the target understands. The machine code of different targets are usually very different. This means that the toolchain has to know the details of each target’s instruction set to translate the high level language to the target’s instructions.

There are two main steps that a typical toolchain uses to make an executable program out of high level program source files: compiling and linking. These steps are described in the following sections.

Compiling a Source File

The first step is to translate each source file into a relocatable object file, usually with the name of the source file with a .o extension, so that main.c is translated into main.o. A relocatable object file is a file containing the machine code that implements the functions that the source file contains. In addition the relocatable object file contains information about what functions and variables the source file defines and uses, by name. This information is kept in a part of the .o file called the symbol table. The relocatable part of object file means that there is also information in the file that allows the machine code to be placed anywhere in the target’s memory.

Here’s an example with hello.c:

[test@main ~]$ cat hello.c 
#include <stdio.h>

int main()
{
  printf("hello world\n");
}
[test@main ~]$ ecc -c hello.c
[test@main ~]$ 

The -c option tells ecc to compile hello.c to an object file which by default will be named hello.o. You can examine the contents of hello.o with tools that come with ELLCC. The first one we’ll try is ecc-nm, which prints out symbol table information:

[test@main ~]$ ecc-nm hello.o
0000000000000000 T main
                 U printf
[test@main ~]$ 

This shows that hello.o has two symbols in its table: main and printf. The 0000000000000000 is the offset of the symbol main in the object file and the T says main is defined and is in the text section. There are three main sections in an object file. The text section contains executable code and read only data, the data section contains write-able initialized data and the bss section contains all the variables declared in a program that are not initialized. The symbol main is in the text section since it names a function that will be executed. Notice that the printf symbol does not have an offset and is marked with a U. This means that printf not defined by the hello.o object file but is needed by it.

You can see the machine code in the object file by disassembling it with ecc-objdump:

test@main ~]$ ecc-objdump -d hello.o
hello.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <main>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	48 83 ec 10          	sub    $0x10,%rsp
   8:	48 bf 00 00 00 00 00 	movabs $0x0,%rdi
   f:	00 00 00 
  12:	b0 00                	mov    $0x0,%al
  14:	e8 00 00 00 00       	callq  19 <main+0x19>
  19:	31 c9                	xor    %ecx,%ecx
  1b:	89 45 fc             	mov    %eax,-0x4(%rbp)
  1e:	89 c8                	mov    %ecx,%eax
  20:	48 83 c4 10          	add    $0x10,%rsp
  24:	5d                   	pop    %rbp
  25:	c3                   	retq   
[test@main ~]$ 

Notice that the symbol main is at the beginning of the object file and that the instruction at offset 14 is the call to printf. The e8 is the callq opcode and the four zero bytes after it are where the relative offset to printf will be placed when it is finally known.

As mentioned earlier, the machine code for different targets is usually very different which is why we need a cross compiler to build a program that will run on a target that is different than our host. Here’s an example cross building hello.c for a 64 bit ARM and running ecc-objdump on the object file:

[test@main ~]$ ecc -c hello.c -target arm64v8-linux
[test@main ~]$ ecc-objdump -d hello.o

hello.o:     file format elf64-littleaarch64

Disassembly of section .text:

0000000000000000 <main>:
   0:	d10083ff 	sub	sp, sp, #0x20
   4:	a9017bfd 	stp	x29, x30, [sp, #16]
   8:	910043fd 	add	x29, sp, #0x10
   c:	90000000 	adrp	x0, 0 <main>
  10:	91000000 	add	x0, x0, #0x0
  14:	94000000 	bl	0 <printf>
  18:	2a1f03e8 	mov	w8, wzr
  1c:	b81fc3a0 	stur	w0, [x29, #-4]
  20:	2a0803e0 	mov	w0, w8
  24:	a9417bfd 	ldp	x29, x30, [sp, #16]
  28:	910083ff 	add	sp, sp, #0x20
  2c:	d65f03c0 	ret
[test@main ~]$ 

Linking Object Files into a Program

The second step in building a program is to take all the object files and put them together to make a program. The linker does this job. The linker takes all the object files in your program, looks for any unresolved symbols, like printf in the example above, and tries to resolve them. The linker resolves symbols by first looking at all the object files in your program to see what symbols are defined and using those definitions. What happens if the linker can’t resolve all the symbols in the object files you’ve created? That’s where libraries come in.

An include file, like stdio.h in the example above, is just a file that defines things that you use in your programs so the compiler can see if you’re using them correctly. Some include files, like stdio.h, are provided by a toolchain to define things that are available in libraries that come with the toolchain. A library is just a collection of object files that provide the definition of functions like printf so your program can use them. When your program’s object files have unresolved references the linker will look in one or more libraries to see if the needed symbols can be resolved. If so, the definitions in the libraries will be used to resolve the symbols.

Most of the time you don’t run the linker directly. A common way to build a program is to use the compiler to run the linker, like this:

[test@main ~]$ ecc -c hello.c
[test@main ~]$ ecc -o hello hello.o
[test@main ~]$ ./hello
hello world
[test@main ~]$ 

When you provide .o files to the compiler it sends them to the linker with instructions on which libraries to use. You can see this by invoking ecc with the -v option:

[test@main ~]$ ecc -o hello hello.o -v
ecc version 2017-07-29 (http://ellcc.org) based on clang version 6.0.0 (trunk 309487)
Target: x86_64-ellcc-linux
Thread model: posix
InstalledDir: /home/test/ellcc/bin
 "/home/test/ellcc/bin/ecc-ld" -nostdlib -L/home/test/ellcc/libecc/lib/x86_64-linux -m elf_x86_64 --build-id --hash-style=gnu --eh-frame-hdr --gc-sections --defsym __dso_handle=42 -o hello -e _start -Bdynamic -dynamic-linker /home/test/ellcc/libecc/lib/x86_64-linux/libc.so /home/test/ellcc/libecc/lib/x86_64-linux/Scrt1.o hello.o -( -lc -lcompiler-rt -)
[test@main ~]$

Notice that the linker, ecc-ld, is invoked with a bunch of options. The -L option tells the linker where to look for the libraries for the target and the -l options tell the linker which libraries to search. -lc says to search the standard C library where printf and all the other standard C functions reside. If you use #include files The define functions not in the standard C library, you may have to add -l options to the ecc command line, e.g. #include <curses.h> and using functions in it would require a -lcurses at link time:

[test@main ~]$ ecc -o hello hello.o -lcurses -v
ecc version 2017-07-29 (http://ellcc.org) based on clang version 6.0.0 (trunk 309487)
Target: x86_64-ellcc-linux
Thread model: posix
InstalledDir: /home/test/ellcc/bin
 "/home/test/ellcc/bin/ecc-ld" -nostdlib -L/home/test/ellcc/libecc/lib/x86_64-linux -m elf_x86_64 --build-id --hash-style=gnu --eh-frame-hdr --gc-sections --defsym __dso_handle=42 -o hello -e _start -Bdynamic -dynamic-linker /home/test/ellcc/libecc/lib/x86_64-linux/libc.so /home/test/ellcc/libecc/lib/x86_64-linux/Scrt1.o hello.o -lcurses -( -lc -lcompiler-rt -)
[test@main ~]$ 

ecc places any -l options you give it after any object files on the command line but before the standard C library’s -lc. The placement is important because the linker processes object files and libraries in command line order in a single pass. If the curses library were placed after the standard library and curses used functions from the standard library that were not already resolved, the linker doesn’t go back and try to resolve them. It just reports them as unresolved.2

ecc-obdump on the executable program gives:

[test@main ~]$ ecc-objdump -d hello

hello:     file format elf64-x86-64


Disassembly of section .plt:

0000000000400430 <.plt>:
  400430:       ff 35 d2 0b 20 00       pushq  0x200bd2(%rip)        # 601008 <_GLOBAL_OFFSET_TABLE_+0x8>
  400436:       ff 25 d4 0b 20 00       jmpq   *0x200bd4(%rip)        # 601010 <_GLOBAL_OFFSET_TABLE_+0x10>
  40043c:       0f 1f 40 00             nopl   0x0(%rax)

0000000000400440 <printf@plt>:
  400440:       ff 25 d2 0b 20 00       jmpq   *0x200bd2(%rip)        # 601018 <printf>
  400446:       68 00 00 00 00          pushq  $0x0
  40044b:       e9 e0 ff ff ff          jmpq   400430 <.plt>

0000000000400450 <__libc_start_main@plt>:
  400450:       ff 25 ca 0b 20 00       jmpq   *0x200bca(%rip)        # 601020 <__libc_start_main>
  400456:       68 01 00 00 00          pushq  $0x1
  40045b:       e9 d0 ff ff ff          jmpq   400430 <.plt>

Disassembly of section .text:

0000000000400460 <_start>:
  400460:       48 31 ed                xor    %rbp,%rbp
  400463:       48 89 e7                mov    %rsp,%rdi
  400466:       48 8d 35 13 0a 20 00    lea    0x200a13(%rip),%rsi        # 600e80 <_DYNAMIC>
  40046d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
  400471:       e8 00 00 00 00          callq  400476 <_start_c>

0000000000400476 <_start_c>:
  400476:       50                      push   %rax
  400477:       8b 37                   mov    (%rdi),%esi
  400479:       48 8d 57 08             lea    0x8(%rdi),%rdx
  40047d:       48 8d 3d 1c 00 00 00    lea    0x1c(%rip),%rdi        # 4004a0 <main>
  400484:       48 8b 0d 65 0b 20 00    mov    0x200b65(%rip),%rcx        # 600ff0 <_init>
  40048b:       4c 8b 05 66 0b 20 00    mov    0x200b66(%rip),%r8        # 600ff8 <_fini>
  400492:       45 31 c9                xor    %r9d,%r9d
  400495:       e8 b6 ff ff ff          callq  400450 <__libc_start_main@plt>
  40049a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

00000000004004a0 <main>:
  4004a0:       55                      push   %rbp
  4004a1:       48 89 e5                mov    %rsp,%rbp
  4004a4:       48 83 ec 10             sub    $0x10,%rsp
  4004a8:       48 bf c6 04 40 00 00    movabs $0x4004c6,%rdi
  4004af:       00 00 00 
  4004b2:       b0 00                   mov    $0x0,%al
  4004b4:       e8 87 ff ff ff          callq  400440 <printf@plt>
  4004b9:       31 c9                   xor    %ecx,%ecx
  4004bb:       89 45 fc                mov    %eax,-0x4(%rbp)
  4004be:       89 c8                   mov    %ecx,%eax
  4004c0:       48 83 c4 10             add    $0x10,%rsp
  4004c4:       5d                      pop    %rbp
  4004c5:       c3                      retq   
[test@main ~]$ 

Notice that the code at main has been relocated to address 4004a0 and edited by the linker at offset 4004a8 and 4004b4. In this example, hello has been linked to use the standard C library as a shared library. A shared library is a library that is loaded along with the program at run time by the dynamic linker. That’s why the code in the .plt section exists. If the program had been linked with the -static option, hello would have been created as a statically linked file, In that case the ecc-objdump output would have been much larger as it would contain the code for the printf function and any functions that printf called.

Other interesting code in the output is at the beginning of the .text section with the symbol _start. This is called the startup code. The startup code changes depending on the environment in which the program will be run. In this case, the startup code is pulled in from the file Scrt1.o, which is the startup code for a shared library program on Linux. It was specified on the linker command line above. In addition the -e _start option given on the link line tells the linker to mark the _start symbol as the program entry point.

A simplified description of what the startup code is doing here is that it makes sure the stack pointer is aligned and calls __libc_start_main to initialize the C library, execute any constructors, and finally call main.

  1. It is a stretch to call some actual programs “human readable”.
  2. This behavior is probably the result of linkers being first developed when memory and CPU power was scarce and it would have been unfeasible to keep track of all the symbols that the linker saw.