Porting a Segmented List From C to Rust

Oct 16, 2025
6260 Words
30 Minute read

Tags:

Rust
C

TLDR: Info

Porting a segmented list from C to Rust and sharing the process. Both written from scratch and without any dependencies:

 1Allocator *a = bump_init(1024, 0);
 2LIST_TYPE(size_t);
 3LIST_size_t list = LIST_new(size_t);
 4size_t count = 64 * 1000 * 1000;
 5for (size_t i = 0; i < count; i++) {
 6    LIST_append(&list, a, i);
 7}
 8ASSERT(list.len == count, "list not of correct length");
 9CALL(a, destroy);
10free(a);

RUST

1let mut list = SegmentedList::new();
2let count = 64 * 1000 * 1000;
3for i in 0..count {
4    list.push(i);
5}
6assert_eq!(list.len(), count);

I needed to replace my growing list of three different dynamic array implementations in my bytecode interpreter: one for Nodes, one for runtime Values and one for bytecode. Building the AST and bytecode required a lot of malloc, realloc and memcpy. I hacked a workaround together by introducing a segmented bump allocator backed by mmap omitting the syscalls, but memcpy still remained.

So I wrote (and debugged) a generic segmented list in C unifying those implementations into the following features:

zero realloc
zero copy
lazy allocation on grow
backed by a bump allocator

I now want to port this to Rust and compare semantics, implementation, performance and how pretty they look.

This isn’t a “C sucks because it’s unsafe and Rust is super duper great because it borrow checks and it’s safe” article, but rather a “I love C but lately I want to use Rust to help me get out of the pain of debugging my lexer, parser and compiler interacting with my segmented bump allocator backed by mmap, my handrolled generic segmented list and everything around that”.

Segmented Lists vs Dynamic Arrays

Vectors suck at growing a lot, because:

They have to allocate a new and larger block
They have to copy their contents to the new space
They require to “update” all references into the previous space to the new space
For large mem::size_of::<T>() / sizeof(T), copies are costly and move a lot of memory

On the other hand, segmented lists suck due to:

The underlying storage being non-contiguous
Indexing requiring addition, subtraction and bitshifts for finding the segment and its offset

The tradeoff is yours to make. In C benchmarks for large and many AST nodes, segmented lists beat dynamic arrays by 2-4x.

Design

As the name implies, the list consists of segments. Each segment’s size is that of its predecessor multiplied by an implementation-defined growth factor, often 1.5 or 2, this results in geometric growth, reducing the amount of syscalls while growing progressively larger.

The first segment size is also implementation-defined, but often chosen to be 8, a multiple of 2 is, as always, favorable for aligned access and storage. Segments are lazily allocated when they are about to be used. There is no deallocation per element, but rather per segment, at least in this example. The main upside of segmented lists is their zero-copy growth. Other dynamic array implementations often require the previously stored elements to be moved, thus also providing stable pointers.

Indexing

Since a segmented list is based on a group of segments containing its elements, indexing isn’t as easy as incrementing a pointer by sizeof(T). Instead we have to compute the segment and the offset for any given index into the list ($i$). Since a list starts with a default size ($s_0$) and grows at a geometric rate ($\lambda$). We can compute the segment start with $S(i)$, the segment / block index with $b(i)$ and the offset into said segment with $o(i)$:

$$ \begin{align} S(b) &= s_0 \cdot (\lambda^{b} - 1) \\ b(i) &= \left\lfloor \log_{\lambda} \left( \frac{i}{s_0} + 1 \right) \right\rfloor \\ o(i) &= i - S(b(i)) \end{align} $$

Therefore, for our segmented list, given $s_0 := 8$, $\lambda := 2$ (I use 2, since geometric growth means less syscalls for fast growing lists, the equations hold for $\lambda > 1$) and an index $i := 18$, we can calculate:

$$ \begin{align} b(i) &= \left\lfloor \log_{\lambda} \left( \frac{i}{s_0} + 1 \right) \right\rfloor \\ b(18)&= \left\lfloor \log_{2} \left( \frac{18}{8} + 1 \right) \right\rfloor \\ &= \left\lfloor \log_{2} \left( 3.25 \right) \right\rfloor \\ &= \left\lfloor 1.7004 \right\rfloor \\ &= \underline{1} \\ S(b) &= s_0 \cdot (\lambda^{b} - 1) \\ S(1) &= 8 \cdot (2^{1} - 1) \\ &= \underline{8} \\ o(i) &= i - S(b(i)) \\ o(18)&= 18 - 8 \\ &= \underline{10} \\ \end{align} $$

Thus, attempting to index at the position $18$ requires us to access the segment at position $1$ with its inner offset of $10$.

C Implementation

This is my starting point I whipped up in an afternoon. Please keep in mind: this is my first allocator in any sense, I’m still fiddeling around and thusfar asan, valgrind, gcc, clang and my unittests tell me this is fine.

Bump allocator

In purple garden all allocators are based on the Allocator “interface” to allow for dynamically replacing the bump allocator with different other allocators, for instance garbage collectors.

 1// mem.h
 2typedef struct {
 3  size_t current;
 4  size_t allocated;
 5} Stats;
 6
 7// CALL is used to emulate method calls by calling a METHOD on SELF with
 8// SELF->ctx and __VA_ARGS__, this is useful for interface interaction, such as
 9// Allocator, which reduces alloc_bump.request(alloc_bump.ctx, 64); to
10// CALL(alloc_bump, request, 64), removing the need for passing the context in
11// manually
12#ifdef VERBOSE_ALLOCATOR
13#include <stdio.h>
14#define CALL(SELF, METHOD, ...)                                                \
15  (fprintf(stderr, "[ALLOCATOR] %s@%s::%d: %s->%s(%s)\n", __FILE__, __func__,  \
16           __LINE__, #SELF, #METHOD, #__VA_ARGS__),                            \
17   (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__))
18#else
19#define CALL(SELF, METHOD, ...) (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__)
20#endif
21
22// Allocator defines an interface abstracting different allocators, so the
23// runtime of the virtual machine does not need to know about implementation
24// details, can be used like this:
25//
26//
27//  #define ALLOC_HEAP_SIZE = 1024
28//  Allocator alloc_bump = bump_init(ALLOC_HEAP_SIZE, ALLOC_HEAP_SIZE*2);
29//
30//  size_t some_block_size = 16;
31//  void *some_block = alloc_bump.request(alloc_bump.ctx, some_block_size);
32//
33//  alloc_bump.destroy(alloc_bump.ctx);
34//
35typedef struct {
36  // Allocator::ctx refers to an internal allocator state and owned memory
37  // areas, for instance, a bump allocator would attach its meta data (current
38  // position, cap, etc) here
39  void *ctx;
40
41  // Allocator::stats is expected to return the current statistics of the
42  // underlying allocator
43  Stats (*stats)(void *ctx);
44  // Allocator::request returns a handle to a block of memory of size `size`
45  void *(*request)(void *ctx, size_t size);
46  // Allocator::destroy cleans state up and deallocates any owned memory areas
47  void (*destroy)(void *ctx);
48} Allocator;
49
50Allocator *bump_init(uint64_t min_size, uint64_t max_size);

The segmented bump allocator itself is of course pretty simple, allocate page aligned block, hand out memory by incrementing the pointer, if out of space in the current block, allocate the next one, as shown for the first three blocks below:

first three block sizes

 1#define _GNU_SOURCE
 2#include "common.h"
 3#include "mem.h"
 4#include <stdint.h>
 5#include <string.h>
 6#include <sys/mman.h>
 7#include <sys/types.h>
 8#include <unistd.h>
 9
10#define BUMP_MIN_START 4096
11// geometric series result for max amount of bytes fitting into uint64_t used to
12// count the totally allocated bytes; amounts to something like (2^64)-
13#define BUMP_MAX_BLOCKS 55
14#define BUMP_GROWTH 2
15
16// BumpResize allocator header
17//
18// The bump allocator is implemented as such, so a "regrow" (needing the next
19// block) doesnt invalidate all previously handed out pointers. ALWAYS zero all
20// handed out memory yourself
21typedef struct {
22  // the current block we are in, max is BUMP_MAX_BLOCKS
23  uint64_t pos;
24
25  // the size of the current allocated block
26  uint64_t size;
27
28  // the amount of bytes in the current block in use
29  uint64_t len;
30
31  // the max amount the bump alloc should grow to
32  uint64_t max;
33
34  // kept for Allocator->stats
35  uint64_t total_used;
36  uint64_t total_allocated;
37
38  // List of blocks the bump allocator uses to hand out memory
39  void *blocks[BUMP_MAX_BLOCKS];
40  uint64_t block_sizes[BUMP_MAX_BLOCKS];
41} BumpCtx;
42
43void *bump_request(void *ctx, size_t size) {
44  BumpCtx *b_ctx = ctx;
45  size_t align = sizeof(void *);
46  uint64_t aligned_pos = (b_ctx->len + align - 1) & ~(align - 1);
47
48  if (b_ctx->max > 0) {
49    ASSERT(b_ctx->total_allocated < b_ctx->max,
50           "Bump allocator exceeded max_size");
51  }
52
53  if (aligned_pos + size > b_ctx->size) {
54    ASSERT(b_ctx->pos + 1 < BUMP_MAX_BLOCKS, "Out of block size");
55    uint64_t new_size = b_ctx->size * BUMP_GROWTH;
56
57    void *new_block = mmap(NULL, new_size, PROT_READ | PROT_WRITE,
58                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
59    ASSERT(new_block != MAP_FAILED, "Failed to mmap new block");
60
61    b_ctx->blocks[++b_ctx->pos] = new_block;
62    b_ctx->block_sizes[b_ctx->pos] = new_size;
63    b_ctx->size = new_size;
64    b_ctx->len = 0;
65    aligned_pos = 0;
66    b_ctx->total_allocated += new_size;
67  }
68
69  void *ptr = (char *)b_ctx->blocks[b_ctx->pos] + aligned_pos;
70  b_ctx->total_used += (aligned_pos - b_ctx->len) + size;
71  b_ctx->len = aligned_pos + size;
72  return ptr;
73}
74
75Allocator *bump_init(uint64_t min_size, uint64_t max_size) {
76  BumpCtx *ctx = malloc(sizeof(BumpCtx));
77  ASSERT(ctx != NULL, "failed to bump allocator context");
78  *ctx = (BumpCtx){0};
79  ctx->size = min_size < BUMP_MIN_START ? BUMP_MIN_START : min_size;
80  ctx->max = max_size;
81  void *first_block = mmap(NULL, ctx->size, PROT_READ | PROT_WRITE,
82                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
83  ASSERT(first_block != MAP_FAILED, "Failed to mmap initial block");
84  ctx->blocks[0] = first_block;
85  ctx->total_allocated = ctx->size;
86
87  Allocator *a = malloc(sizeof(Allocator));
88  ASSERT(a != NULL, "failed to alloc bump allocator");
89  *a = (Allocator){0};
90  a->ctx = (void *)ctx;
91  a->destroy = bump_destroy;
92  a->request = bump_request;
93  a->stats = bump_stats;
94
95  return a;
96}

Deallocation is done by unmapping each allocated block.

 1void bump_destroy(void *ctx) {
 2  ASSERT(ctx != NULL, "bump_destroy on already destroyed allocator");
 3  BumpCtx *b_ctx = (BumpCtx *)ctx;
 4  for (size_t i = 0; i <= b_ctx->pos; i++) {
 5    if (b_ctx->blocks[i]) {
 6      munmap(b_ctx->blocks[i], b_ctx->block_sizes[i]);
 7      b_ctx->blocks[i] = NULL;
 8    }
 9  }
10  free(ctx);
11}

With the interface and this implementation all functions in the purple garden code base that allocate take Allocator *a (or Allocator *alloc) and requests memory with CALL(a, request, 1024). For instance in the virtual machine when creating a new space for the globals:

1vm.globals = CALL(alloc, request, (sizeof(Value) * GLOBAL_SIZE));

List macros and C “Generics”

My first step was to make use of the macro system to create a “generic” container:

TEXT

 1#include "mem.h"
 2#include "strings.h"
 3#include <string.h>
 4
 5#define LIST_DEFAULT_SIZE 8
 6#define LIST_BLOCK_COUNT 24
 7
 8#define LIST_TYPE(TYPE)                                                        \
 9  typedef struct {                                                             \
10    TYPE **blocks;                                                             \
11    size_t len;                                                              \
12    size_t type_size;                                                          \
13  } LIST_##TYPE
14
15#define LIST_new(TYPE)                                                         \
16  ({                                                                           \
17    LIST_##TYPE l = {0};                                                       \
18    l.type_size = sizeof(TYPE);                                                \
19    l;                                                                         \
20  })

These macros are then used to define and create a list via:

1LIST_TYPE(char) cl = LIST_new(char);

Now there is the need for interacting with the list, for this we need append, get and insert (plus their non bounds checking equivalents), allowing for:

 1Allocator *a = bump_init(1024, 0);
 2LIST_TYPE(char);
 3LIST_char cl = LIST_new(char);
 4LIST_append(&cl, a, 'H')
 5LIST_append(&cl, a, 'E')
 6LIST_append(&cl, a, 'L')
 7assert(LIST_get(&cl, 0) == 'H')
 8assert(LIST_get(&cl, 1) == 'E')
 9assert(LIST_get(&cl, 2) == 'L')
10assert(LIST_insert_UNSAFE(&cl, 1, 'O'))
11assert(LIST_get_UNSAFE(&cl, 1) == 'O')

All of these macros require the index into the block and the block itself. ListIdx and idx_to_block_idx do exactly that:

1struct ListIdx {
2  // which block to use for the indexing
3  size_t block;
4  // the idx into said block
5  size_t block_idx;
6};
7
8struct ListIdx idx_to_block_idx(size_t idx);

The implementation of idx_to_block_idx works in constant time and is therefore a bit worse to read than one working in linear time due to the exploitation of the geometric series, it does however follow Indexing, just with a few changes to omit heavy computations ($\log_2$, etc.).

 1inline __attribute__((always_inline, hot)) struct ListIdx idx_to_block_idx(size_t idx) {
 2  if (idx < LIST_DEFAULT_SIZE) {
 3    return (struct ListIdx){.block_idx = idx, .block = 0};
 4  }
 5
 6  size_t adjusted = idx + LIST_DEFAULT_SIZE;
 7  size_t msb_pos = 63 - __builtin_clzll(adjusted);
 8
 9  // This optimizes the block index lookup to be constant time
10  //
11  //     block 0 size = LIST_DEFAULT_SIZE
12  //     block 1 size = LIST_DEFAULT_SIZE*2
13  //     block 2 size = LIST_DEFAULT_SIZE*4
14  //     block 3 size = LIST_DEFAULT_SIZE*8
15  //
16  // The starting index of each block is a geometric series:
17  //
18  //    s(i) = LIST_DEFAULT_SIZE * (2^i - 1)
19  //
20  // We solve for i, so the following stands:
21  //
22  //    s(i) <= idx < s(i+1)
23  //
24  //    2^i - 1 <= idx / LIST_DEFAULT_SIZE < 2^(i+1) - 1
25  //    idx / LIST_DEFAULT_SIZE + 1 >= 2^i
26  //
27  // Thus: adding LIST_DEFAULT_SIZE to idx shifts the series so the msb of idx +
28  // LIST_DEFAULT_SIZE correspond to the block number
29  //
30  // Visually:
31  //
32  //     Global index:  0 1 2 3 4 5 6 7  |  8  9 10 ... 23  | 24 25 ... 55  | 56
33  //     ... Block:         0                 |  1              |  2 | 3 ...
34  //     Block size:    8                 | 16              | 32            | 64
35  //     ... idx + LIST_DEFAULT_SIZE: 0+8=8  -> MSB pos 3 -> block 0 7+8=15 ->
36  //     MSB pos 3 -> block 0 8+8=16 -> MSB pos 4 -> block 1 23+8=31-> MSB pos 4
37  //     -> block 1 24+8=32-> MSB pos 5 -> block 2
38
39  // shifting the geometric series so 2^i aligns with idx
40
41  //   log2(LIST_DEFAULT_SIZE) = 3 for LIST_DEFAULT_SIZE = 8
42#define LOG2_OF_LIST_DEFAULT_SIZE 3
43  // first block is LIST_DEFAULT_SIZE wide, this normalizes
44  size_t block = msb_pos - LOG2_OF_LIST_DEFAULT_SIZE;
45  size_t start_index_of_block =
46      (LIST_DEFAULT_SIZE << block) - LIST_DEFAULT_SIZE;
47  size_t block_idx = idx - start_index_of_block;
48
49  return (struct ListIdx){.block_idx = block_idx, .block = block};
50}

LIST_append allocates on demand, based on the segment computed by idx_to_block_idx:

TEXT

 1#define LIST_append(LIST, ALLOC, ELEM)                                         \
 2  {                                                                            \
 3    /* allocate block array if not yet allocated */                            \
 4    if ((LIST)->blocks == NULL) {                                              \
 5      (LIST)->blocks =                                                         \
 6          CALL(ALLOC, request, LIST_BLOCK_COUNT * sizeof(void *));             \
 7      ASSERT((LIST)->blocks != NULL,                                           \
 8             "LIST_append: block array allocation failed");                    \
 9    }                                                                          \
10                                                                               \
11    struct ListIdx bi = idx_to_block_idx((LIST)->len);                         \
12                                                                               \
13    /* allocate the specific block if needed */                                \
14    if ((LIST)->blocks[bi.block] == NULL) {                                    \
15      uint64_t block_size = LIST_DEFAULT_SIZE << bi.block;                     \
16      (LIST)->blocks[bi.block] =                                               \
17          CALL(ALLOC, request, block_size * (LIST)->type_size);                \
18      ASSERT((LIST)->blocks[bi.block] != NULL,                                 \
19             "LIST_append: block allocation failed");                          \
20    }                                                                          \
21                                                                               \
22    (LIST)->blocks[bi.block][bi.block_idx] = (ELEM);                           \
23    (LIST)->len++;                                                             \
24  }

The non allocating interactions like _get, _get_UNSAFE and _insert_UNSAFE do a bounds check if applicable and afterwards use the segment and the offset computed by idx_to_block_idx:

TEXT

 1#define LIST_get(LIST, IDX)                                                    \
 2  ({                                                                           \
 3    ASSERT(IDX < (LIST)->len, "List_get out of bounds");                       \
 4    struct ListIdx b_idx = idx_to_block_idx(IDX);                              \
 5    (LIST)->blocks[b_idx.block][b_idx.block_idx];                              \
 6  })
 7
 8#define LIST_get_UNSAFE(LIST, IDX)                                             \
 9  ({                                                                           \
10    struct ListIdx b_idx = idx_to_block_idx(IDX);                              \
11    (LIST)->blocks[b_idx.block][b_idx.block_idx];                              \
12  })
13
14#define LIST_insert_UNSAFE(LIST, IDX, VAL)                                     \
15  {                                                                            \
16    struct ListIdx __idx = idx_to_block_idx(IDX);                              \
17    (LIST)->blocks[__idx.block][__idx.block_idx] = VAL;                        \
18  }

Rust Implementation

“If you wish to make an apple pie from scratch, you must first invent the universe.”
-Carl Sagan

In this fashion we will:

Implement mmap and munmap in assembly using the x86 Linux syscall ABI
Implement a std::alloc::GlobalAlloc compatible allocator based on that
Implement the segmented list using the allocator
Profit.

Handrolling x86 mmap & munmap syscalls

About using the libc crate: This would be against my mentality of not using dependencies if possible and libc is a large one due to it pulling in the C runtime for two syscalls I can just wrap myself using asm!.

Of course the above is only correct on Linux #[no_std] since in some cases Rusts std depends on libc, but you get what I’m trying to convey.

My first step was porting their arguments into the Rust type system. I couldn’t use rust enums, since they require variants to be exclusive, which the bit flags for READ, WRITE, … and PRIVATE, ANONYMOUS, … aren’t. So i had to settle for a struct containing i32. To support bit or(ed) flags for the arguments I also had to quickly implement std::ops::BitOr:

RUST

 1//! constants taken from https://github.com/openbsd/src/blob/master/sys/sys/mman.h
 2#[cfg(target_os = "openbsd")]
 3const MMAP_SYSCALL: i64 = 197;
 4#[cfg(target_os = "openbsd")]
 5const MUNMAP_SYSCALL: i64 = 73;
 6
 7#[cfg(target_os = "linux")]
 8const MMAP_SYSCALL: i64 = 9;
 9#[cfg(target_os = "linux")]
10const MUNMAP_SYSCALL: i64 = 11;
11
12// Not an enum, since NONE, READ, WRITE and EXEC aren't mutually exclusive
13pub struct MmapProt(i32);
14impl MmapProt {
15    /// no permissions
16    pub const NONE: MmapProt = MmapProt(0x00);
17    /// pages can be read
18    pub const READ: MmapProt = MmapProt(0x01);
19    /// pages can be written
20    pub const WRITE: MmapProt = MmapProt(0x02);
21    /// pages can be executed
22    pub const EXEC: MmapProt = MmapProt(0x04);
23    pub fn bits(self) -> i32 {
24        self.0
25    }
26}
27
28impl std::ops::BitOr for MmapProt {
29    type Output = Self;
30    fn bitor(self, rhs: Self) -> Self::Output {
31        MmapProt(self.0 | rhs.0)
32    }
33}
34
35pub struct MmapFlags(i32);
36
37impl MmapFlags {
38    /// share changes
39    pub const SHARED: MmapFlags = MmapFlags(0x0001);
40    /// changes are private
41    pub const PRIVATE: MmapFlags = MmapFlags(0x0002);
42    /// map addr must be exactly as requested
43    pub const FIXED: MmapFlags = MmapFlags(0x0010);
44
45    /// fail if address not available
46    #[cfg(target_os = "openbsd")]
47    pub const NOREPLACE: MmapFlags = MmapFlags(0x0800); // __MAP_NOREPLACE
48    #[cfg(target_os = "linux")]
49    pub const NOREPLACE: MmapFlags = MmapFlags(0x100000); // MAP_FIXED_NOREPLACE (Linux ≥ 5.4)
50
51    /// allocated from memory, swap space
52    #[cfg(target_os = "openbsd")]
53    pub const ANONYMOUS: MmapFlags = MmapFlags(0x1000);
54    /// allocated from memory, swap space
55    #[cfg(target_os = "linux")]
56    pub const ANONYMOUS: MmapFlags = MmapFlags(0x20);
57
58    /// mapping is used for stack
59    pub const STACK: MmapFlags = MmapFlags(0x4000);
60
61    /// omit from dumps
62    pub const CONCEAL: MmapFlags = MmapFlags(0x8000);
63
64    // OpenBSD-only: avoid faulting in pages initially
65    #[cfg(target_os = "openbsd")]
66    pub const NOFAULT: MmapFlags = MmapFlags(0x2000);
67    pub fn bits(self) -> i32 {
68        self.0
69    }
70}

The functions themselves are pretty straightforward if you have ever called a syscall from assembly. We use the syscall instruction and fill some registers with the arguments according to the x86 Linux syscall ABI:

Registers	Value
`rax`	Syscall kind (`9` for `mmap`)
`rdi`	pointer or null for anonymous (`Option::None` for the latter)
`rsi`	region length
`rdx`	protection modifiers
`r10`	flags
`r8`	file descriptor (`-1` for anonymous)
`r9`	offset

The only thing tripping me up were the possible values for options and lateout, since these weren’t obvious to me. After a lot of searching I found some docs:

Option	Dacription
`nostack`	asm does not modify the stack via push, pop or red-zone

lateout writes the register contents to its argument and doesn’t care about overwriting inputs. The std::ptr::NonNull return type is applicable, since mmap can only return valid non-null memory, otherwise the rax would return MMAP_FAILED ((void *) -1 in Rust simply -1). I only use nostack, since we don’t fiddle with the stack. Other options like readonly or preserves_flags aren’t applicable, since syscall writes to rax and modifies RFLAGS

RUST

 1#[inline(always)]
 2pub fn mmap(
 3    ptr: Option<std::ptr::NonNull<u8>>,
 4    length: usize,
 5    prot: MmapProt,
 6    flags: MmapFlags,
 7    fd: i32,
 8    offset: i64,
 9) -> std::ptr::NonNull<u8> {
10    let ret: isize;
11
12    unsafe {
13        core::arch::asm!(
14            "syscall",
15            in("rax") MMAP_SYSCALL,
16            in("rdi") ptr.map(|nn| nn.as_ptr()).unwrap_or(std::ptr::null_mut()),
17            in("rsi") length,
18            in("rdx") prot.bits(),
19            in("r10") flags.bits(),
20            in("r8")  fd,
21            in("r9")  offset,
22            lateout("rax") ret,
23            clobber_abi("sysv64"),
24            options(nostack)
25        );
26    }
27    if ret < 0 {
28        let errno = -ret;
29        eprintln!(
30            "mmap failed (errno {}): {}",
31            errno,
32            std::io::Error::from_raw_os_error(errno as i32)
33        );
34        std::process::abort()
35    }
36
37    unsafe { std::ptr::NonNull::new_unchecked(ret as *mut u8) }
38}

Unmapping is easier than mmap, since it only requires a pointer to the mapped region in rdi and its size in rsi. We again use rax to check if the kernel complained about our syscall parameters.

RUST

 1#[inline(always)]
 2pub fn munmap(ptr: std::ptr::NonNull<u8>, size: usize) {
 3    let ret: isize;
 4    unsafe {
 5        core::arch::asm!(
 6            "syscall",
 7            in("rax") MUNMAP_SYSCALL,
 8            in("rdi") ptr.as_ptr(),
 9            in("rsi") size,
10            lateout("rax") ret,
11            clobber_abi("sysv64"),
12            options(nostack)
13        );
14    }
15
16    if ret < 0 {
17        let errno = -ret;
18        eprintln!(
19            "munmap failed (errno {}): {}",
20            errno,
21            std::io::Error::from_raw_os_error(errno as i32)
22        );
23        std::process::abort()
24    }
25}

Bump allocator

The bump allocator now uses these safe syscall wrappers to allocate and deallocate memory chunks, it also implements std::alloc::GlobalAlloc and needs to implement Send and Sync since its used via #[global_allocator], which is static and requires the corresponding traits.

The main component of the allocator is its metadata - SegmentedAllocCtx:

RUST

 1const MIN_SIZE: usize = 4096;
 2const MAX_BLOCKS: usize = 55;
 3const GROWTH: usize = 2;
 4
 5#[derive(Debug)]
 6struct SegmentedAllocCtx {
 7    /// idx into self.blocks
 8    cur_block: usize,
 9    /// size of the current block
10    size: usize,
11    /// bytes in use of the current block
12    pos: usize,
13    blocks: [Option<NonNull<u8>>; MAX_BLOCKS],
14    block_sizes: [usize; MAX_BLOCKS],
15}
16
17impl SegmentedAllocCtx {
18    const fn new() -> Self {
19        SegmentedAllocCtx {
20            size: MIN_SIZE,
21            cur_block: 0,
22            pos: 0,
23            blocks: [const { None }; MAX_BLOCKS],
24            block_sizes: [0; MAX_BLOCKS],
25        }
26    }
27}

Since I don’t care about thread safety and this is just a comparison between my already thread unsafe C code, this context is wrapped in an UnsafeCell:

RUST

1/// Implements a variable size bump allocator, employing mmap to allocate a starting block of
2/// 4096B, once a block is exceeded by a request, the allocator mmaps a new block double the size
3/// of the previously allocated block
4pub struct SegmentedAlloc {
5    ctx: UnsafeCell<SegmentedAllocCtx>,
6}

To cement this unsafeness, Sync and Send are implemented as nops, this also requires never using SegmentedAlloc in multithreaded contexts.

RUST

1unsafe impl Send for SegmentedAlloc {}
2unsafe impl Sync for SegmentedAlloc {}

I ported the bump allocator line by line to rust. It uses the mmap and munmap wrappers, keeps track of its state via SegmentedAllocCtx and hands out data via SegmentedAlloc::request. SegmentedAlloc::free is the basis for implementing Drop for SegmentedList which will come in the next section.

RUST

 1#[inline(always)]
 2fn align_up(val: usize, align: usize) -> usize {
 3    (val + align - 1) & !(align - 1)
 4}
 5
 6impl SegmentedAlloc {
 7    pub const fn new() -> Self {
 8        Self {
 9            ctx: UnsafeCell::new(SegmentedAllocCtx::new()),
10        }
11    }
12
13    pub fn request(&self, layout: std::alloc::Layout) -> NonNull<u8> {
14        assert!(layout.size() > 0, "Zero-size allocation is not allowed");
15
16        let ctx = unsafe { &mut *self.ctx.get() };
17
18        if ctx.blocks[0].is_none() {
19            ctx.size = MIN_SIZE;
20            ctx.cur_block = 0;
21            ctx.pos = 0;
22            ctx.block_sizes[0] = MIN_SIZE;
23            ctx.blocks[0] = Some(mmap(
24                None,
25                MIN_SIZE,
26                mmap::MmapProt::READ | mmap::MmapProt::WRITE,
27                mmap::MmapFlags::PRIVATE | mmap::MmapFlags::ANONYMOUS,
28                -1,
29                0,
30            ));
31        }
32
33        loop {
34            let block_capacity = ctx.block_sizes[ctx.cur_block];
35            debug_assert!(
36                block_capacity >= ctx.size,
37                "block_capacity should be >= ctx.size"
38            );
39
40            let offset = align_up(ctx.pos, layout.align());
41            let end_offset = offset
42                .checked_add(layout.size())
43                .expect("Allocation size overflow");
44
45            if end_offset >= block_capacity {
46                assert!(ctx.cur_block + 1 < MAX_BLOCKS, "Exceeded MAX_BLOCKS");
47                let new_size = ctx.size * GROWTH;
48                ctx.cur_block += 1;
49                ctx.block_sizes[ctx.cur_block] = new_size;
50                ctx.size = new_size;
51                ctx.pos = 0;
52                ctx.blocks[ctx.cur_block] = Some(mmap(
53                    None,
54                    new_size,
55                    mmap::MmapProt::READ | mmap::MmapProt::WRITE,
56                    mmap::MmapFlags::PRIVATE | mmap::MmapFlags::ANONYMOUS,
57                    -1,
58                    0,
59                ));
60                continue;
61            }
62
63            let block_ptr = ctx.blocks[ctx.cur_block].unwrap();
64
65            let ptr_addr = unsafe { block_ptr.as_ptr().add(offset) };
66            debug_assert!(
67                (ptr_addr as usize) % layout.align() == 0,
68                "Returned pointer is not aligned to {}",
69                layout.align()
70            );
71
72            ctx.pos = end_offset;
73
74            return NonNull::new(ptr_addr)
75                .expect("Failed to create NonNull from allocation pointer");
76        }
77    }
78
79    pub fn free(&mut self) {
80        let ctx = unsafe { &mut *self.ctx.get() };
81        for i in 0..MAX_BLOCKS {
82            let size = ctx.block_sizes[i];
83            if size == 0 {
84                break;
85            }
86
87            let Some(block) = ctx.blocks[i] else {
88                break;
89            };
90            munmap(block, size);
91        }
92    }
93}

std::alloc::GlobalAlloc is just an abstraction to calling SegmentedAlloc::request on alloc and nop on dealloc:

RUST

1unsafe impl GlobalAlloc for SegmentedAlloc {
2    unsafe fn alloc(&self, layout: std::alloc::Layout) -> *mut u8 {
3        self.request(layout).as_ptr()
4    }
5    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: std::alloc::Layout) {}
6}

We can now use the following to make our complete program use the allocator:

RUST

1use segmented_rs::alloc;
2
3#[global_allocator]
4static A: alloc::SegmentedAlloc = alloc::SegmentedAlloc::new();

Or we inject the allocator into our list to only make it use the allocator.

Segmented List

As introduced in Segmented Lists vs Dynamic Arrays, a segmented list consists of segments. Each segment is lazily allocated and doubles in size to the previous one. The first segment is of size 8, the next 16, the next 32, and so on. Keeping track of these sizes and the starting point is crucial for indexing.

RUST

 1// 24 blocks means around 134mio elements, thats enough I think
 2pub const BLOCK_COUNT: usize = 24;
 3pub const START_SIZE: usize = 8;
 4pub const BLOCK_STARTS: [usize; BLOCK_COUNT] = {
 5    let mut arr = [0usize; BLOCK_COUNT];
 6    let mut i = 0;
 7    while i < BLOCK_COUNT {
 8        arr[i] = START_SIZE * ((1 << i) - 1);
 9        i += 1;
10    }
11    arr
12};
13
14/// SegmentedIdx represents a cached index lookup into the segmented list, computed with
15/// `SegmentedList::compute_segmented_idx`, can be used with `SegmentedList::get_with_segmented_idx`
16/// and `SegmentedList::get_mut_with_segmented_idx`.
17///
18/// Primary usecase is to cache the lookup of many idxes, thus omiting the lookup computation which
19/// can be too heavy in intensive workloads.
20#[derive(Copy, Clone)]
21struct SegmentedIdx(usize, usize);
22
23/// SegmentedList is a drop in `std::vec::Vec` replacement providing zero cost growing and stable
24/// pointers even after grow with `::push`.
25///
26/// The list is implemented by chaining blocks of memory to store its elements. Each block is
27/// allocated on demand when an index falls into it (for instance during appends), starting at
28/// `START_SIZE` elements in the first block and doubling the block size for each subsequent
29/// allocation. This continues until `BLOCK_COUNT` is reached. Existing blocks are never moved or
30/// reallocated, so references into the list remain valid across growth operations.
31///
32/// This makes the SegmentedList an adequate replacement for `std::vec::Vec` when dealing with
33/// heavy and unpredictable growth workloads due the omission of copy/move overhead on expansion.
34pub struct SegmentedList<T> {
35    blocks: [*mut std::mem::MaybeUninit<T>; BLOCK_COUNT],
36    block_lengths: [usize; BLOCK_COUNT],
37    allocator: SegmentedAlloc,
38    cur_block: usize,
39    offset_in_block: usize,
40    len: usize,
41}

SegmentedList holds an array of segments (blocks), an array of their lengths (block_lengths), the allocator used to do any allocation and the count of the currently contained elements, overarching all segments - len.

The main logic for indexing is encoded in the SegmentedIdx(segment, offset) struct and its producer: idx_to_block_idx.

RUST

 1impl <T> SegmentedList<T> {
 2
 3    // [...]
 4
 5    #[inline(always)]
 6    fn idx_to_block_idx(&self, idx: usize) -> SegmentedIdx {
 7        if idx < START_SIZE {
 8            return SegmentedIdx(0, idx);
 9        }
10        let adjusted = idx + START_SIZE;
11        let msb_pos = core::mem::size_of::<usize>() * 8 - 1 - adjusted.leading_zeros() as usize;
12        let block = msb_pos - (START_SIZE.trailing_zeros() as usize);
13        SegmentedIdx(block, idx - BLOCK_STARTS[block])
14    }
15
16    // [...]
17
18}

In comparison to the C implementation, the Rust one doesnt lazy allocate the first chunk, but eagerly allocates its first segment in Self::new (I don’t remember why I did that):

RUST

 1impl <T> SegmentedList<T> {
 2    pub fn new() -> Self {
 3        let mut s = Self {
 4            blocks: [std::ptr::null_mut(); BLOCK_COUNT],
 5            block_lengths: [0; BLOCK_COUNT],
 6            allocator: SegmentedAlloc::new(),
 7            cur_block: 0,
 8            len: 0,
 9            offset_in_block: 0,
10        };
11
12        let element_count = START_SIZE;
13        let as_bytes = element_count * size_of::<T>();
14        s.blocks[0] = s
15            .allocator
16            .request(Layout::from_size_align(as_bytes, align_of::<T>()).unwrap())
17            .as_ptr() as *mut MaybeUninit<T>;
18        s.block_lengths[0] = element_count;
19        s
20    }
21
22    // [...]
23
24}

Allocating a new segment (Self::alloc_block) outside of Self::new happens when the current segment is out of space in any appending method, for instance Self::push(T):

RUST

 1impl <T> SegmentedList<T> {
 2    // [...]
 3
 4    #[inline(always)]
 5    fn alloc_block(&mut self, block: usize) {
 6        use std::alloc::Layout;
 7        use std::mem::{MaybeUninit, align_of, size_of};
 8
 9        let elems = START_SIZE << block;
10        let bytes = elems * size_of::<T>();
11        let layout = Layout::from_size_align(bytes, align_of::<T>())
12            .expect("Invalid layout for SegmentedList block");
13
14        let ptr = self.allocator.request(layout).as_ptr() as *mut MaybeUninit<T>;
15        debug_assert!(!ptr.is_null(), "SegmentedAlloc returned null");
16
17        self.blocks[block] = ptr;
18        self.block_lengths[block] = elems;
19    }
20
21    pub fn push(&mut self, v: T) {
22        if self.block_lengths[self.cur_block] == 0 {
23            self.alloc_block(self.cur_block);
24        }
25
26        unsafe {
27            (*self.blocks[self.cur_block].add(self.offset_in_block)).write(v);
28        }
29
30        self.len += 1;
31        self.offset_in_block += 1;
32
33        if self.offset_in_block == self.block_lengths[self.cur_block] {
34            self.cur_block += 1;
35            self.offset_in_block = 0;
36        }
37    }
38
39    pub fn get(&self, idx: usize) -> Option<&T> {
40        if idx >= self.len {
41            return None;
42        }
43        let SegmentedIdx(block, block_index) = self.idx_to_block_idx(idx);
44        Some(unsafe { (*self.blocks[block].add(block_index)).assume_init_ref() })
45    }
46
47    // [...]
48}

Since I want to provide somewhat of a std::vec::Vec drop in replacement, I added a truckload of methods vec also supports. Due to the already way too large nature of this article I’ll restrict myself to to_vec, capacity, clear and impl<T: Clone + Copy> Clone for SegmentedList<T>, since these are somewhat non-trivial:

RUST

 1impl<T> SegmentedList<T> {
 2    /// Collects self and its contents into a vec
 3    pub fn to_vec(mut self) -> Vec<T> {
 4        let mut result = Vec::with_capacity(self.len);
 5        let mut remaining = self.len;
 6
 7        for block_idx in 0..BLOCK_COUNT {
 8            if remaining == 0 {
 9                break;
10            }
11
12            let len = self.block_lengths[block_idx];
13            if len == 0 {
14                break;
15            }
16
17            let ptr = self.blocks[block_idx];
18            let take = remaining.min(len);
19            for i in 0..take {
20                let value = unsafe { (*ptr.add(i)).assume_init_read() };
21                result.push(value);
22            }
23            remaining -= take;
24            // We "forget" the block, no dealloc, bump allocator manages memory
25            self.blocks[block_idx] = std::ptr::null_mut();
26        }
27        result
28    }
29
30    pub fn capacity(&self) -> usize {
31        self.block_lengths.iter().copied().sum()
32    }
33
34    pub fn clear(&mut self) {
35        let mut remaining = self.len;
36        for block_idx in 0..BLOCK_COUNT {
37            if remaining == 0 {
38                break;
39            }
40            let len = self.block_lengths[block_idx];
41            let ptr = self.blocks[block_idx];
42            if len == 0 {
43                break;
44            }
45            let take = remaining.min(len);
46            for i in 0..take {
47                unsafe { (*ptr.add(i)).assume_init_drop() };
48            }
49            remaining -= take;
50        }
51        self.len = 0;
52    }
53}
54
55impl<T: Clone + Copy> Clone for SegmentedList<T> {
56    fn clone(&self) -> Self {
57        let mut new_list = SegmentedList::new();
58        new_list.len = self.len;
59
60        for block_idx in 0..BLOCK_COUNT {
61            if self.block_lengths[block_idx] == 0 {
62                break;
63            }
64            let src_ptr = self.blocks[block_idx];
65            let elems = self.block_lengths[block_idx];
66            if elems == 0 {
67                continue;
68            }
69            new_list.alloc_block(block_idx);
70            let dst_ptr = new_list.blocks[block_idx];
71
72            for i in 0..elems {
73                unsafe {
74                    let val = (*src_ptr.add(i)).assume_init();
75                    (*dst_ptr.add(i)).write(val);
76                }
77            }
78            new_list.block_lengths[block_idx] = elems;
79        }
80
81        new_list
82    }
83}

The attentive reader will have noticed I snuck some small optimisations in:

Inline alloc_block, idx_to_block_idx (-2% runtime)
Inline mmap and munmap (-4% runtime)
precompute block boundaries in BLOCK_STARTS (-41% runtime)
cache SegmentedIdx computation for Self::push via cur_block and offset_in_block
remove unnecessary indirections (-8% runtime)

segmented_rs::list::SegmentedList vs std::vec::Vec

Benchmarks are of course done with criterion:

RUST

 1// benches/list.rs
 2use criterion::{BatchSize, Criterion, black_box, criterion_group, criterion_main};
 3use segmented_rs::list::SegmentedList;
 4
 5pub fn bench_segmented_list(c: &mut Criterion) {
 6    fn bench_push<T: Clone>(c: &mut Criterion, name: &str, template: T, count: usize) {
 7        c.bench_function(name, |b| {
 8            b.iter_batched(
 9                || SegmentedList::new(),
10                |mut list| {
11                    for _ in 0..count {
12                        list.push(black_box(template.clone()));
13                    }
14                    black_box(list)
15                },
16                BatchSize::SmallInput,
17            )
18        });
19    }
20
21    bench_push(c, "segmented_list_push_u64", 123u64, 10_000);
22
23    #[derive(Clone)]
24    struct MediumElem([u8; 40]);
25    bench_push(c, "segmented_list_push_medium", MediumElem([42; 40]), 1_000);
26
27    #[derive(Clone)]
28    struct HeavyElem(Box<[u8]>);
29    bench_push(
30        c,
31        "segmented_list_push_heavy_1MiB",
32        HeavyElem(vec![161u8; 1 * 1024 * 1024].into_boxed_slice()),
33        50,
34    );
35}
36
37criterion_group!(benches, bench_segmented_list);
38criterion_main!(benches);

The same for std::vec::Vec:

RUST

 1// benches/vec.rs
 2use criterion::{BatchSize, Criterion, black_box, criterion_group, criterion_main};
 3
 4pub fn bench_vec(c: &mut Criterion) {
 5    fn bench_push<T: Clone>(c: &mut Criterion, name: &str, template: T, count: usize) {
 6        c.bench_function(name, |b| {
 7            b.iter_batched(
 8                || Vec::new(),
 9                |mut vec| {
10                    for _ in 0..count {
11                        vec.push(black_box(template.clone()));
12                    }
13                    black_box(vec)
14                },
15                BatchSize::SmallInput,
16            )
17        });
18    }
19
20    bench_push(c, "vec_push_u64", 123u64, 10_000);
21
22    #[derive(Clone)]
23    struct MediumElem([u8; 40]);
24    bench_push(c, "vec_push_medium", MediumElem([42; 40]), 1_000);
25
26    #[derive(Clone)]
27    struct HeavyElem(Box<[u8]>);
28    bench_push(
29        c,
30        "vec_push_heavy_1MiB",
31        HeavyElem(vec![161u8; 1 * 1024 * 1024].into_boxed_slice()),
32        50,
33    );
34}
35
36criterion_group!(benches, bench_vec);
37criterion_main!(benches);

Both are runnable via cargo bench --bench list:

TEXT

1segmented_list_push_u64         time:   [39.502 µs 40.156 µs 40.860 µs]
2segmented_list_push_medium      time:   [35.512 µs 35.901 µs 36.306 µs]
3segmented_list_push_heavy_1MiB  time:   [3.0590 ms 3.0932 ms 3.1345 ms]
4segmented_list_push_heavy_10MiB time:   [3.3591 ms 3.3934 ms 3.4299 ms]
5segmented_list_push_heavy_50MiB time:   [19.895 ms 20.425 ms 21.353 ms]

and cargo bench --bench vec:

TEXT

1vec_push_u64            time:   [32.955 µs 33.463 µs 33.961 µs]
2vec_push_medium         time:   [28.725 µs 29.058 µs 29.435 µs]
3vec_push_heavy_1MiB     time:   [3.3439 ms 3.3816 ms 3.4236 ms]
4vec_push_heavy_10MiB    time:   [3.7747 ms 3.8124 ms 3.8548 ms]
5vec_push_heavy_50MiB    time:   [21.718 ms 21.865 ms 22.018 ms]

So it beats vec on larger elements (starting from 1MiB), but these are only applicable when moving large amounts of giant blobs. The rust team did a great job at optimising std::vec::Vec, my “naive” implementation comes near, but only outperforms on very large workloads.

Rust Pain Points

Sync and Send have to be implemented for std::alloc::GlobalAlloc, I get the allocator has to be shared between threads, but its weird to have a nop impl

cargo test executes tests concurrently and therefore crash if not run with --test-threads=1, which was fucking hard to debug, since these are flaky as hell, sometimes it happens, sometimes it doesnt:

illegal memory access:

TEXT

1running 24 tests
2error: test failed, to rerun pass `--lib`
3
4Caused by:
5  process didn't exit successfully: 
6  `/home/teo/programming/segmented-rs/target/debug/deps/segmented_rs-68ce766f62589be2` 
7  (signal: 11, SIGSEGV: invalid memory reference)

SendError since SegmentedAlloc couldn’t be send from a thread to another

TEXT

1thread 'main' panicked at library/test/src/lib.rs:463:73:
2called `Option::unwrap()` on a `None` value
3[Thread 0x7ffff691f6c0 (LWP 22634) exited]
4
5thread 'list::tests::stress_test_large_fill' panicked at library/test/src/lib.rs:686:30:
6called `Result::unwrap()` on an `Err` value: SendError { .. }

Segfaults in Rust are worse to debug with gdb than in C, unaligned memory issues, segfaults and other invalid memory access are hard to pinpoint
No stack traces for panic! when implementing std::alloc::GlobalAlloc, just:
TEXT
```
1panic:
```
Gdb (most of the time) helps with the stacktrace when compiling with RUSTFLAGS="-C debuginfo=2".

What Rust does better than C (at least in this case)

Generics, looking at you _Generic, Rust just did it better. To be fair, even Java did it better. What even is this:

 1#define DBG(EXPR) //...
 2({
 3  _Pragma("GCC diagnostic push")
 4  _Pragma("GCC diagnostic ignored \"-Wformat\"") __auto_type _val = (EXPR);
 5  fprintf(stderr, "[%s:%d] %s = ", __FILE__, __LINE__, #EXPR);
 6  _Generic((_val),
 7      int: fprintf(stderr, "%d\n", _val),
 8      long: fprintf(stderr, "%ld\n", _val),
 9      long long: fprintf(stderr, "%lld\n", _val),
10      unsigned: fprintf(stderr, "%u\n", _val),
11      unsigned long: fprintf(stderr, "%lu\n", _val),
12      unsigned long long: fprintf(stderr, "%llu\n", _val),
13      float: fprintf(stderr, "%f\n", _val),
14      double: fprintf(stderr, "%f\n", _val),
15      const char *: fprintf(stderr, "\"%s\"\n", _val),
16      char *: fprintf(stderr, "\"%s\"\n", _val),
17      default: fprintf(stderr, "<unprintable>\n"));
18  _Pragma("GCC diagnostic pop") _val;
19})

There isn’t even a way to run different functions for differing datatypes, since each path has to compile and you can’t pass something like a double to strlen even if this isn’t really happening, gcc still complains, see Workarounds for C11 _Generic.

Drop implementations and traits in general are so much better than any C alternative, even though I can think of at least 3 shittier ways to emulate traits in C
Enums are so much fun in Rust, I love variant “bodies”? The only thing tripping me up was that one can’t use them to mirror the C flag behaviour for bitOring arguments.
Builtin testing and benchmarking (the latter at least somewhat). I really miss Gos testing behaviour in Rust, but the default “workflow” is fine for my needs (table driven tests for the win)
Compile time constructs, precomputing the starts of blocks allowed me to simplify my segment and offset lookup code by a whole lot, and its faster

Of course this comparison isn’t that fair, since C is from the 70s and Rust has learned from the shortcomings of the systems level programming languages coming before it. I still like to program in C, particulary for the challenge and the level of control it allows, but only with -fsanitize=address,undefined and running valgrind a whole lot.

Porting a Segmented List From C to Rust

Table of Contents

TLDR: Info

Segmented Lists vs Dynamic Arrays

Design

Indexing

C Implementation

Bump allocator

List macros and C “Generics”

Rust Implementation

Handrolling x86 mmap & munmap syscalls

Bump allocator

Segmented List

segmented_rs::list::SegmentedList vs std::vec::Vec

Rust Pain Points

What Rust does better than C (at least in this case)