I’ve been eyeing Rust for about a year now. Here and there, I tried to use it to make a silly little program, or to implement some simple function in it to see for myself how ergonomic it really was, and what sort of machine code rustc spit out. But last weekend I found a need for a tool to clean up some preprocessor mess, and so instead of hacking together some combination of shell and Python, I decided to write it in Rust.
From my earlier attempts, I knew that there are a lot of different “pointers” but I found all the descriptions of them lacking or confusing. Specifically, Rust calls itself a systems programming language, yet I found no clear description of how the different pointers map to C—the systems programming language. Eventually, I stumbled across The Periodic Table of Rust Types, which made things a bit clearer, but I still didn’t feel like I truly understood.
During my weekend expedition to Rust land, I think I’ve grokked things enough to write this explanation of how Rust does things. As always, feedback is welcomed.
I’ll describe what happens in terms of C. To keep things simple, I will:
- assume that you are well-versed in C
- assume that you can read Rust (any intro will teach you enough)
- not bother with const for the C snippets
- not talk about mutability
In the following text, I assume that we have some struct T. The actual contents don’t matter. In other words:
struct T {
/* some members */
};
With that out of the way, let’s dive in!
*const T and *mut T
These are raw pointers. In general, you shouldn’t use them since only unsafe code can dereference them, and the whole point of Rust is to write as much safe code as possible.
Raw pointers are just like what you have in C. If you make a pointer, you end up using sizeof(struct T *) bytes for the pointer. In other words:
struct T *ptr;
&T and &mut T
These are borrowed references. They use the same amount of space as raw pointers and behave same exact way in the generated machine code. Consider this trivial example:
#[no_mangle]
pub fn raw(p: *mut usize) {
unsafe {
*p = 5;
}
}
#[no_mangle]
pub fn safe(p: &mut usize) {
*p = 5;
}
A rustc invocation later, we have:
raw()
raw: 55 pushq %rbp
raw+0x1: 48 89 e5 movq %rsp,%rbp
raw+0x4: 48 c7 07 05 00 00 movq $0x5,(%rdi)
00
raw+0xb: 5d popq %rbp
raw+0xc: c3 ret
safe()
safe: 55 pushq %rbp
safe+0x1: 48 89 e5 movq %rsp,%rbp
safe+0x4: 48 c7 07 05 00 00 movq $0x5,(%rdi)
00
safe+0xb: 5d popq %rbp
safe+0xc: c3 ret
Note that the two functions are bit-for-bit identical.
The only differences between borrowed references and raw pointers are:
- references will never point at bogus addresses (i.e., they are never NULL or uninitialized),
- the compiler doesn’t let you do arbitrary pointer arithmetic on references,
- the borrow checker will make you question your life choices for a while.
(#3 gets better over time.)
Box<T>
These are owned “pointers”. If you are a C++ programmer, you are already familiar with them. Never having truly worked with C++, I had to think about this a bit until it clicked, but it is really easy.
No matter what all the documentation and tutorials out there say, Box<T> is not a pointer but rather a structure containing a pointer to heap allocated memory just big enough to hold T. The heap allocation and freeing is handled automatically. (Allocation is done in the Box::new function, while freeing is done via the Drop trait, but that’s not relevant as far as the memory layout is concerned.) In other words, Box<T> is something like:
struct box_of_T {
struct T *heap_ptr;
};
Then, when you make a new box you end up putting only what amounts to sizeof(struct T *) on the stack and it magically starts pointing to somewhere on the heap. In other words, the Rust code like this:
let x = Box::new(T { ... });
is roughly equivalent to:
struct box_of_t x;
x.heap_ptr = malloc(sizeof(struct T));
if (!x.heap_ptr)
oom();
*x.heap_ptr = ...;
&[T] and &mut [T]
These are borrowed slices. This is where things get interesting. Even though it looks like they are just references (which, as stated earlier, translates into a simple C-style pointer), they are much more. These types of references use fat pointers—that is, a combination of a pointer and a length.
struct fat_pointer_to_T {
struct T *ptr;
size_t nelem;
};
This is incredibly powerful, since it allows bounds checking at runtime and getting a subset of a slice is essentially free!
&[T; n] and &mut [T; n]
These are borrowed references to arrays. They are different from borrowed slices. Since the length of an array is a compile-time constant (the compiler will yell at you if n is not a constant), all the bounds checking can be performed statically. And therefore there is no need to pass around the length in a fat pointer. So, they are passed around as plain ol’ pointers.
struct T *ptr;
T, [T; n], and [T]
While these aren’t pointers, I thought I’d include them here for completeness’s sake.
T
Just like in C, a struct uses as much space as its type requires (i.e., sum of the sizes of its members plus padding).
[T; n]
Just like in C, an array of structs uses n times the size of the struct.
[T]
The simple answer here is that you cannot make a [T]. That actually makes perfect sense when you consider what that type means. It is saying that we have some variable sized slice of memory that we want to access as elements of type T. Since this is variable sized, the compiler cannot possibly reserve space for it at compile time and so we get a compiler error.
The more complicated answer involves the Sized trait, which I’ve skillfully managed to avoid thus far and so you are on your own.
Summary
That was a lot of text, so I decided to compact it and make the following table. In the table, I assume that our T struct is 100 bytes in size. In other words:
/* Rust */
struct T {
stuff: [u8; 100],
}
/* C */
struct T {
uint8_t stuff[100];
};
Now, the table in its full glory:
|
Rust |
C |
Size on ILP32/LP64 (bytes) |
Value |
let x: T;
|
struct T x;
|
100/100 |
Raw pointer |
let x: *const T;
let x: *mut T;
|
struct T *x;
|
4/8 |
Reference |
let x: &T;
let x: &mut T;
|
struct T *x;
|
4/8 |
Box |
let x: Box<T>;
|
struct box_of_T {
struct T *heap_ptr;
};
struct box_of_T x;
|
4/8 |
Array of 2 |
let x: [T; 2];
|
struct T x[2];
|
200/200 |
Reference to an array of 2 |
let x: &[T; 2];
|
struct T *x;
|
4/8 |
A slice |
let x: [T];
|
struct T x[];
|
unknown at compile time |
A reference to a slice |
let x: &[T];
|
struct fat_ptr_to_T {
struct T *ptr;
size_t nelem;
};
struct fat_ptr_to_T x;
|
8/16 |
A word of caution: I assume that the sizes of the various pointers are actually implementation details and shouldn’t be relied on to be that way. (Well, with the exception of raw pointers - without those being fixed FFI would be unnecessarily complicated.)
I didn’t cover str, &str, String, and Vec<T> since I don’t consider them fundamental types, but rather convenience types built on top of slices, structs, references, and boxes.
Anyway, I hope you found this useful. If you have any feedback (good or bad), let me know.