Cgo(tchas)

May 9th, 2015

When working an existing C library into a Go project using Cgo, there are some C language features that are tricky to handle. This post will work through several examples, the highlights involving Cgo’s handling of C unions and variadic arguments.

Background

With both LambdaConf and Gophercon being held in Colorado for another year, I wanted to embark on a project that would combine the ideas of functional programming with Go. Eventually, I decided on working through the book Build Your Own Lisp, using the Go programming language. Because the material was written using the C programming language as a base, I thought translating the examples into Go would be an interesting test of Go’s reputation as a modern C. (As a C/C++ user during the day, the idea of Go as a replacement for C seemed a bit unusual given some of Go’s design decisions, but that is a discussion for another day.) The exercise involved a few technical speed bumps that led to the impetus for this series:

The book makes use of a Micro Parsers Combinators library to define and parse a grammar for a Lisp-like language. Because this library was created with C, there was an opportunity to use Cgo to integrate it into my Go-based project. While working with the API provided by mpc, I ran into a few C code examples that were unexpectedly tricky to handle for Cgo. Hopefully such examples will provide some useful tips for working with C and Go interoperability.

Function calls (and Go/C String conversions)

Suppose after reading through the parsing chapter of Build Your Own Lisp, you want to integrate mpc into a Go program that creates a parser for a polish notation grammar. The first steps involve initializing the individual parsers for numbers, operators and expressions, which will be combined to create the overall “Lispy” parser.

We’ll need to use the following C API:

mpc_parser_t *mpc_new(const char *name);

… which is a function that takes in name string, and returns a pointer:

mpc_parser_t* Number = mpc_new("number");

To start the project, let’s create a main program loop (main.go) in the same directory that the mpc.h and mpc.c files are copied to.

As usual, start off with the package declaration:

package main

… followed by a block of Cgo-related code to import the header for mpc:

/*
#cgo LDFLAGS: -lm
#include "mpc.h"
*/
import "C"

The #cgo directive includes a compiler flag -lm that brings in some math functionality used by mpc in its internal workings. This line is optional on some setups, but not on others. For example, Travic CI will fail until that dependency is resolved, while my local environment is indifferent to this setting.

To make use of the mpc_new function, we need to convert a Go string into a C type that is compatible with the input parameter:

name := "Name of parser defined"
cName := C.CString(name)
defer C.free(unsafe.Pointer(cName))
return C.mpc_new(cName)

Because C has no garbage collection, and Cgo allocates the memory on the C heap, we need to remember to clean up that memory, with defer simplifying the question of when.

As we’ll be using the function mpc_new several times, wrapping the above pattern into a function will save keystrokes in the long run:

func mpcNew(name string) *C.mpc_parser_t {
	cName := C.CString(name)
	defer C.free(unsafe.Pointer(cName))
	return C.mpc_new(cName)
}

By this point, we can now initialize a pointer to an mpc parser. (example)

Abstracting Cgo usage (Undefined symbols when linking with “go run”)

If the previous example is run with “go run main.go”, you may run into issues getting the C compiler to link the mpc files:

> go run main.go
# command-line-arguments
Undefined symbols for architecture x86_64:
  "_mpc_new", referenced from:
      __cgo_2ab80fd5553f_Cfunc_mpc_new in main.cgo2.o
     (maybe you meant: __cgo_2ab80fd5553f_Cfunc_mpc_new)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Instead, you can try building an executable first, before running it:

> go build
> ./cgotchas

We can actually get “go run” working again by moving our Cgo code into its own package, which can then be imported into a main program loop. This also has the advantage of providing a generic interface to the mpc library, that we can use in other Go projects.

Here’s a rough outline of the steps:

Move the mpc header and source files into their own package location (perhaps into a folder called “mpc”).
Declare the name of the Go package.
Export any functions or types to be used outside the package.

The goal of the last step is to minimize the amount of direct contact with Cgo. For example, instead of passing around a *C.mpc_parser_t around the code base, we could declare a new type that hides the Cgo usage:

type Parser *C.mpc_parser_t
func New(name string) Parser {
     // return a parser with the given name
}

Then after importing our new package:

import "github.com/<github_username>/cgotchas/mpc"

We are now able to call the mpc types and functions defined:

var number mpc.Parser
number = mpc.New("number")

In fact, now that the mpc-related Cgo code has been isolated into its own package, “go run main.go” will now work. I ought to look into why this organization fixes the linker error, but I would guess that there is some building of the Cgo code before it is imported into another context.

By now, we should be ready to continue building up the interface to mpc, while minimizing exposure to C code with the rest of the project. (example)

Memory Cleanup (Handling variadic-argument functions from C)

Because C does not manage memory automatically, take care when calling any C code that may allocate memory on the heap. The mpc_new function from our library is a convenient example of this memory allocation, and has a corresponding utility function that frees the resources taken:

void mpc_cleanup(int n, ...);

Go has support for variadic argument functions too, so we may try something along these lines:

func Cleanup(p Parser) {
	C.mpc_cleanup(1, p)
}

But that would result in this compilation error:

mpc/mpc.go:22:2: unexpected type: ...

While Go has support for variadic arguments, Cgo does not appear to support handling variadic arguments from C (if it does, please let me know!). In order to use mpc_cleanup, we will need to create our own wrapper in C to call that function with a fixed number of arguments.

Let’s create a new C header file, called mpc_interface.h, with a wrapper function:

#include "mpc.h"

inline void mpc_cleanup_if(mpc_parser_t* parser) {
  mpc_cleanup(1, parser);
}

We’ll also need to edit the mpc.go file to include this new header:

/*
#cgo LDFLAGS: -lm
#include "mpc_interface.h"
*/
import "C"

Because mpc_interface.h already includes mpc.h, we don’t need to include the latter again.

Finally, we can make use of our new wrapper:

package main

import "github.com/sunzenshen/cgotchas/mpc"

func main() {
	var number mpc.Parser
	number = mpc.New("number")
	mpc.Cleanup(number)
}

The project now has a way to call mpc_cleanup with 1 argument. (example)

If any C programmers in the audience are getting agitated about a critical and missing protection, hold on a second:

C macro guards

When importing our new header file, we need to watch for accidentally duplicating the contents of that file through multiple inclusion. For example, if we import the header file twice:

/*
#cgo LDFLAGS: -lm
#include "mpc_interface.h"
#include "mpc_interface.h"
*/
import "C"

My example as is would result in the following perplexing error:

In file included from mpc/mpc.go:6:
./mpc_interface.h:3:13: error: redefinition of 'mpc_cleanup_if'
inline void mpc_cleanup_if(mpc_parser_t* parser) {
            ^
./mpc_interface.h:3:13: note: previous definition is here
inline void mpc_cleanup_if(mpc_parser_t* parser) {
            ^
1 error generated.

While your code may return a slightly different error depending on the ordering of your function definitions, the key highlight is on the redefinition of a function… at the same place it was originally defined.

The (partially inaccurate to spec) way I like to think of this scenario is to consider C code as being manipulated by another set of instructions focused on the text processing of C source files. These instructions are the C preprocessor commands, indicated by the #-symbol:

#cgo LDFLAGS: -lm
#include "mpc_interface.h"

Some preprocessor commands, like the #cgo directive, are instructions for the compiler, and my metaphor doesn’t accurately describe them.

But the #include command be be thought of as instruction as saying:

This is a text function
That takes a “local” header name as an argument
And then inserts the entire contents of that file at this location

These instructions work towards the end goal of creating a single source listing of all the C code for a program. That is, the separation of source files is only an abstraction that helps organize a C project, but everything ends up in a single location.

(Ignoring the differences between static and dynamic linking. This story is getting shakier by the second!)

This creation of a single source file is the reason why “warning : No new line at end of file” is a thing in C, because if two different #include(s) were placed right next to each other, there’s a risk that the first instruction of the second header would be pasted on the same line of the last line of the first header.

Back to the error example of stacking the same #include directive, doing so caused the source code for “mpc_interface.h” to be inserted twice in a row. This led to the duplication of definition that was reported.

We can use C preprocessor commands to avoid this duplication: (example)

#ifndef MPC_INTERFACE_H
#define MPC_INTERFACE_H

// The rest of the contents in mpc_interface.h

#endif

Let’s break down the preprocessor commands:

#ifndef

#ifndef looks for a definition of a processor symbol MPC_INTERFACE_H. If a definition is found, skip over this entire code block terminated by #endif. Otherwise, if a definition was not found, read in the next lines of code…

#define

#define in a sense, takes in 2 arguments, a symbol name, and what that symbol represents.

This is how constants are often created in C, where every place that symbol is seen, the contents are replaced by the second argument.

#define ANSWER 42
// Any instance of ANSWER is this file will now be replaced with 42...

A common convention is to make symbol names all capitalized, but this is not enforced by the C compiler. The relative lack of all caps in conventional variable names makes such a naming scheme safer to use with the preprocessor.

In our example, this defines a symbol MPC_INTERFACE_H that is defined to be nothing. If we had used MPC_INTERFACE_H anywhere in the C code, that text would essentially be deleted. Preprocessor symbols are persistent until #undef are called on them (especially when reading/re-reading files), so the next time #ifndef reads this symbol, the condition will be false.

#endif

As mentioned earlier, #endif closes the conditional block initiated by #ifndef. If that condition evaluates to false, file reading will skip to after this line.

These commands avoid the error of the repeated #include instructions with the following behavior:

#include “mpc_interface.h”

Begin reading mpc_interface.h
Look up symbol MPC_INTERFACE_H
Symbol MPC_INTERFACE_H was not found
Continue reading the incoming lines
define symbol MPC_INTERFACE_H with nothing
continue reading in the rest of the lines until end of file

#include “mpc_interface.h” (second pass)

Begin reading mpc_interface.h
Look up symbol MPC_INTERFACE_H
MPC_INTERFACE_H is defined
Skip until #endif
continue reading in the rest of the lines until end of file

Maintaining wrappers for C variadic arguments

There’s a major inefficiency with this approach of creating C wrappers, as new wrapper functions need to be defined for different numbers of arguments. For example, let’s revisit the other parsers needed for the project:

number := mpc.New("number")
operator := mpc.New("operator")
expr := mpc.New("expr")
lispy := mpc.New("lispy")

To handle 4 parsers with the mpc_cleanup function, another wrapper is needed:

inline void mpc_cleanup_if
(
  mpc_parser_t* p1,
  mpc_parser_t* p2,
  mpc_parser_t* p3,
  mpc_parser_t* p4
)
{
  mpc_cleanup(4, p1, p2, p3, p4);
}

As well as a way to call it in Go:

func Cleanup(p1, p2, p3, p4 Parser) {
	C.mpc_cleanup_if(p1, p2, p3, p4)
}

This plumbing work was not that tedious for this project, as C variadic function uses were not so numerous and required only infrequent updates, much like this example.

Intermission (Hand waving turns to hand flailing)

Next I would like to cover how Go handles C’s union feature, but good chunk of additional work is needed to get to that point.

Specifically, we need to define the grammar for our parser as this C example demonstrates:

mpca_lang(MPCA_LANG_DEFAULT,
  "                                                     \
    number   : /-?[0-9]+/ ;                             \
    operator : '+' | '-' | '*' | '/' ;                  \
    expr     : <number> | '(' <operator> <expr>+ ')' ;  \
    lispy    : /^/ <operator> <expr>+ /$/ ;             \
  ",
  Number, Operator, Expr, Lispy);

Making use of that interface involves a number of steps similar to those covered earlier. For now, I’ll leave the adaptation to Go as an exercise for the reader, as well as provide an example of such work. If you would like me to expand on this section, feel free to let me know.

Accessing the fruits of labor (from a C union)

At some point, we’ll have successfully used the mpc library to extract a structured result into the following C union:

typedef union {
  mpc_err_t *error;
  mpc_val_t *output; // void pointer, interpreted as a mpc_ast_t
} mpc_result_t;

… where depending on the result of mpc’s parsing functions, mpc_result_t is either a struct containing error information, or a struct containing the parsing output.

Normally, the fields of C unions are accessed in this manner:

mpc_result_t r;
// attempt to fill out the mpc result "r" with parsed output
if (mpc_parse("<stdin>", input, Lispy, &r)) { 
  // Do something with: r.output
} else {
  // Do something with: r.error
}

However, Go does not support such syntax, and if the above syntax is emulated, the compiler reports the following error:

r.output undefined (type C.mpc_result_t has no field or method output)

Instead, the contents of the union need to be accessed through casting of the pointer:

// Assuming that we have a mpc result:
var result C.result *C.mpc_result_t
// Accessing the error field
errorPointer := (**C.mpc_err_t)(unsafe.Pointer(result))
// Remember that mpc_val_t is a void pointer, reinterpreted as an AST
astPointer := (**C.mpc_ast_t)(unsafe.Pointer(result))

Why does this work? In Cgo, C unions are represented as an array of bytes, with the size aligned to the largest member of the union.

For example, in the following union:

union foo {
    char   c;
    int    i;
    double d;
};

… the double variable accounts for an 8 byte allocation with my machine.

Print that C union, and Go will print something like this:

// Minimum value
&[0 0 0 0 0 0 0 0]
// Maximum value
&[255 255 255 255 255 255 255 255]

Meanwhile this union is sized based on the int variable, at 4 bytes:

union foo {
    char   c;
    int    i;
};

It’s also worth pointing out that the number of members does not affect the size of the byte array representing the union.

An example demonstrating union access in a project can be found here.

Final thoughts

By this point, the chapter has been completed using Go, and the rest of the book can be tackled using the approaches we just covered.

Have thoughts or feedback regarding the explanations? Could a particular section be expanded? If so, feel free to contact me @sunzenshen on Twitter, or sunzenshen @ Gmail.

Additional References

tutorials (2)