C Versus Reality

So it looks like polls across the country, from all races (presidential, Senate, House), have swung violently towards the Democrats. And I do mean violently. The Georgia Senate race, for example, has swung 15 points over two weeks, leaving a normally solid Republican state tied between the two candidates. I guess the economic meltdown of the country isn’t a total loss.<hr />

I originally intended for my next technical post to be a continuation of the test driven development discussion, but motivated by this thread, I’d like to take an moment and examine just how native code (like C or C++ generates) relates to the reality of how things work. People (who I think are idiots and should feel free to get the hell out of that thread) like to say that C++ is closer to the hardware or more low level or other similar kinds of bullshit. I want to dissect that in a more methodical, fair manor.

The basic misconception seems to be centered around the mysterious gods that C++ noobs worship, pointers. I believe that pointers are basically taught the same way to everybody – they store memory addresses of things, so you use the dereference operator to access what is actually at that memory address. This isn’t wrong, exactly. The problem is that it ignores a number of important details. C (and by extension C++) are very careful to avoid specifying any kind of detail about how their underlying memory architecture works, or saying much of anything about pointers beyond what is necessary to actually define the language behavior.

So what are pointers in C++ land? I think ToohrVyk described it quite well:

A pointer-to-X rvalue, where X is an actual first-class type, can be one of three distinct things: 1 the 'null pointer' for the type X, which represents the absence of any value. The null pointer evaluates to false in a boolean context (while all other pointers evaluate to true), and the integer constant zero evaluates to the null pointer in a pointer context. 2 an lvalue of the type X. This is the usual 'points at an object of type X'. The definition of lvalue says everything there is to know here. The lvalue and its corresponding rvalue can be accessed through dereferencing (*ptr). 3 a past-the-end pointer. Unlike the null pointer, past-the-end pointers are many, and they differ from each other through '==' comparison. They cannot be dereferenced. Then, there's the grouping of lvalues: they are grouped in buffers containing zero or more lvalues. A pointer to an lvalue can be incremented or decremented, changing its rvalue the previous or next lvalue in the buffer if it exists, otherwise resulting in either that buffer's past-the-end pointer (if incrementing) or in undefined behaviour (if decrementing). Decrementing a past-the-end pointer yields the last lvalue in the associated buffer, or undefined behaviour if the buffer is empty. Such buffers are created every time you allocate data on the stack or heap, with pointers to the first lvalue being returned in the latter case, or obtained with &var in the former. The matters are further complexified by the notion of memory layout compatibility, which allows one to see a buffer of X lvalues as a buffer of Y lvalues, under certain conditions of alignment, padding and size. These, I will not go into here, but they are the fundamental element behind casting structures to a buffer of bytes, or behind unions. The usual 'pointers are addresses' works fine, as long as you consider an address to be a synonym for an lvalue or past-the-end, though it does miss on a lot of subtleties described above. And as soon as you get the strange notion that addresses are numbers, which is almost universally inflicted upon beginners by tutorials and books, you're off course. Unlike numbers, pointers can only be compared for order in very specific cases: when they're within the same buffer. Unlike numbers, pointers cannot reliably be converted to and from numbers (though C99 has done some efforts to solve this) and can certainly not respond correctly to arithmetics on numbers. The list of discrepancies goes on. Ultimately, code such as z[1337]++; actually consists in incrementing an lvalue, not accessing a memory address and incrementing the value found there.

And that doesn’t even touch on the complexities of pointer-to-member constructs, function pointers, and other wonky details of the language. (And don’t forget that the 0 value for a pointer is symbolic and the real address assigned to it may be something else!) Even once you get past all that, you’ve got yet another hurdle: this flat memory layout is a lie. Your “memory address” could be a reference to a value in a register, or it could be replaced outright by a constant expression. And even if it is a memory address, it will be a virtual address, which could refer to main memory, a page file, memory on some other device (video card, memory mapped file), or really anything else a driver or the kernel chooses to expose. It might not be memory; it might not even exist. (Consider the case of memory mapping /dev/zero on a Linux machine.)

That doesn’t leave us with much. Nearly all of the things you can abuse pointers for are either implementation defined or outright illegal, and could do damn near anything outside carefully controlled circumstances. That’s probably why modern C++ code doesn’t really use pointers much anymore, preferring auto_ptr, shared_ptr, intrusive_ptr, weak_ptr, vector, or whatever standard class is most appropriate for representing a certain object. The simple fact is that if you are using pointers to any great effect in your code, you are probably Doing It Wrong, and creating opportunities for subtle and dangerous bugs to wreak havoc throughout your code. Here at Day 1, any use of raw pointers is immediately suspect and examined carefully in code reviews. (We do occasionally use and store raw pointers, but it’s really not a preferred approach.)

Comments