Building a Minimal Viable Armv7 Emulator from Scratch

Table of Contents

Tags:

Tip or TLDR - I built a tiny, zero dependency armv7 userspace emulator in Rust

I wrote a minimal viable armv7 emulator in 1.3k lines of Rust without any dependencies. It parses and validates a 32-bit arm binary, maps its segments, decodes a subset of arm instructions, translates guest and host memory interactions and forwards arm Linux syscalls into x86-64 System V syscalls.

It can run a armv7 hello world binary and does so in 1.9ms (0.015ms for raw emulation without setup), while qemu takes 12.3ms (stinkarm is thus ~100-1000x slower than native armv7 execution).

After reading about the process the Linux kernel performs to execute binaries, I thought: I want to write an armv7 emulator - stinkarm. Mostly to understand the ELF format, the encoding of arm 32bit instructions, the execution of arm assembly and how it all fits together (this will help me with the JIT for my programming language I am currently designing). To fully understand everything: no dependencies. And of course Rust, since I already have enough C projects going on.

So I wrote the smallest binary I could think of:

ARMASM
1    .global _start  @ declare _start as a global
2_start:             @ start is the defacto entry point
3    mov r0, #161    @ first and only argument to the exit syscall
4    mov r7, #1      @ syscall number 1 (exit)
5    svc #0          @ trapping into the kernel (thats US, since we are translating)

To execute this arm assembly on my x86 system, I need to:

  1. Parse the ELF, validate it is armv7 and statically executable (I don’t want to write a dynamic dependency resolver and loader)
  2. Map the segments defined in ELF into the host memory, forward memory access
  3. Decode armv7 instructions and convert them into a nice Rust enum
  4. Emulate the CPU, its state and registers
  5. Execute the instructions and apply their effects to the CPU state
  6. Translate and forward syscalls

Sounds easy? It is!

Open below if you want to see me write a build script and a nix flake:

Minimalist arm setup and smallest possible arm binary

Before I start parsing ELF I’ll need a binary to emulate, so lets create a build script called bld_exmpl (so I can write a lot less) and nix flake, so the asm is converted into armv7 machine code in a armv7 binary on my non armv7 system :^)

RUST
 1// tools/bld_exmpl
 2use clap::Parser;
 3use std::fs;
 4use std::path::Path;
 5use std::process::Command;
 6
 7/// Build all ARM assembly examples into .elf binaries
 8#[derive(Parser)]
 9struct Args {
10    /// Directory containing .S examples
11    #[arg(long, default_value = "examples")]
12    examples_dir: String,
13}
14
15fn main() -> Result<(), Box<dyn std::error::Error>> {
16    let args = Args::parse();
17    let dir = Path::new(&args.examples_dir);
18
19    for entry in fs::read_dir(dir)? {
20        let entry = entry?;
21        let path = entry.path();
22        if path.extension().and_then(|s| s.to_str()) == Some("S") {
23            let name = path.file_stem().unwrap().to_str().unwrap();
24            let output = dir.join(format!("{}.elf", name));
25            build_asm(&path, &output)?;
26        }
27    }
28
29    Ok(())
30}
31
32fn build_asm(input: &Path, output: &Path) -> Result<(), Box<dyn std::error::Error>> {
33    println!("Building {} -> {}", input.display(), output.display());
34
35    let obj_file = input.with_extension("o");
36
37    let status = Command::new("arm-none-eabi-as")
38        .arg("-march=armv7-a")
39        .arg(input)
40        .arg("-o")
41        .arg(&obj_file)
42        .status()?;
43
44    if !status.success() {
45        return Err(format!("Assembler failed for {}", input.display()).into());
46    }
47
48    let status = Command::new("arm-none-eabi-ld")
49        .arg("-Ttext=0x8000")
50        .arg(&obj_file)
51        .arg("-o")
52        .arg(output)
53        .status()?;
54
55    if !status.success() {
56        return Err(format!("Linker failed for {}", output.display()).into());
57    }
58
59    Ok(fs::remove_file(obj_file)?)
60}
TOML
 1# Cargo.toml
 2[package]
 3name = "stinkarm"
 4version = "0.1.0"
 5edition = "2024"
 6default-run = "stinkarm"
 7
 8[dependencies]
 9clap = { version = "4.5.51", features = ["derive"] }
10
11[[bin]]
12name = "stinkarm"
13path = "src/main.rs"
14
15[[bin]]
16name = "bld_exmpl"
17path = "tools/bld_exmpl.rs"
NIX
 1{
 2  description = "stinkarm — ARMv7 userspace binary emulator for x86 linux systems";
 3  inputs = {
 4    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
 5    flake-utils.url = "github:numtide/flake-utils";
 6  };
 7  outputs = { self, nixpkgs, flake-utils, ... }:
 8    flake-utils.lib.eachDefaultSystem (system:
 9      let
10        pkgs = import nixpkgs { inherit system; };
11      in {
12        devShells.default = pkgs.mkShell {
13          buildInputs = with pkgs; [
14            gcc-arm-embedded
15            binutils
16            qemu
17          ];
18        };
19      }
20  );
21}

Parsing ELF

So there are some resources for parsing ELF, two of them I used a whole lot:

  1. man elf (remember to export MANPAGER='nvim +Man!')
  2. gabi.xinuos.com

At a high level, ELF (32bit, for armv7) consists of headers and segments, it holds an Elf header, multiple program headers and the rest I don’t care about, since this emulator is only for static binaries, no dynamically linked support.

Elf32_Ehdr

The ELF header is exactly 52 bytes long and holds all data I need to find the program headers and whether I even want to emulate the binary I’m currently parsing. These criteria are defined as members of the Identifier at the beg of the header.

In terms of byte layout:

TEXT
 1+------------------------+--------+--------+----------------+----------------+----------------+----------------+----------------+--------+---------+--------+---------+--------+--------+
 2|       identifier       |  type  |machine |    version     |     entry      |     phoff      |     shoff      |     flags      | ehsize |phentsize| phnum  |shentsize| shnum  |shstrndx|
 3|          16B           |   2B   |   2B   |       4B       |       4B       |       4B       |       4B       |       4B       |   2B   |   2B    |   2B   |   2B    |   2B   |   2B   |
 4+------------------------+--------+--------+----------------+----------------+----------------+----------------+----------------+--------+---------+--------+---------+--------+--------+
 5           \|/
 6            |
 7            |
 8            v
 9+----------------+------+------+-------+------+-----------+------------------------+
10|     magic      |class | data |version|os_abi|abi_version|          pad           |
11|       4B       |  1B  |  1B  |  1B   |  1B  |    1B     |           7B           |
12+----------------+------+------+-------+------+-----------+------------------------+

Most resources show C based examples, the rust ports are below:

RUST
 1/// Representing the ELF Object File Format header in memory, equivalent to Elf32_Ehdr in 2. ELF
 2/// header in https://gabi.xinuos.com/elf/02-eheader.html
 3///
 4/// Types are taken from https://gabi.xinuos.com/elf/01-intro.html#data-representation Table 1.1
 5/// 32-Bit Data Types:
 6///
 7/// | Elf32_ | Rust |
 8/// | ------ | ---- |
 9/// | Addr   | u32  |
10/// | Off    | u32  |
11/// | Half   | u16  |
12/// | Word   | u32  |
13/// | Sword  | i32  |
14#[derive(Debug, Clone, Copy, PartialEq, Eq)]
15pub struct Header {
16    /// initial bytes mark the file as an object file and provide machine-independent data with
17    /// which to decode and interpret the file’s contents
18    pub ident: Identifier,
19    pub r#type: Type,
20    pub machine: Machine,
21    /// identifies the object file version, always EV_CURRENT (1)
22    pub version: u32,
23    /// the virtual address to which the system first transfers control, thus starting
24    /// the process. If the file has no associated entry point, this member holds zero
25    pub entry: u32,
26    /// the program header table’s file offset in bytes. If the file has no program header table,
27    /// this member holds zero
28    pub phoff: u32,
29    /// the section header table’s file offset in bytes. If the file has no section header table, this
30    /// member holds zero
31    pub shoff: u32,
32    /// processor-specific flags associated with the file
33    pub flags: u32,
34    /// the ELF header’s size in bytes
35    pub ehsize: u16,
36    /// the size in bytes of one entry in the file’s program header table; all entries are the same
37    /// size
38    pub phentsize: u16,
39    /// the number of entries in the program header table. Thus the product of e_phentsize and e_phnum
40    /// gives the table’s size in bytes. If a file has no program header table, e_phnum holds the value
41    /// zero
42    pub phnum: u16,
43    /// section header’s size in bytes. A section header is one entry in the section header table; all
44    /// entries are the same size
45    pub shentsize: u16,
46    /// number of entries in the section header table. Thus the product of e_shentsize and e_shnum
47    /// gives the section header table’s size in bytes. If a file has no section header table,
48    /// e_shnum holds the value zero.
49    pub shnum: u16,
50    /// the section header table index of the entry associated with the section name string table.
51    /// If the file has no section name string table, this member holds the value SHN_UNDEF
52    pub shstrndx: u16,
53}

The identifier is 16 bytes long and holds the previously mentioned info so I can check if I want to emulate the binary, for instance the endianness and the bit class, in the TryFrom implementation I strictly check what is parsed:

RUST
 1/// 2.2 ELF Identification: https://gabi.xinuos.com/elf/02-eheader.html#elf-identification
 2#[repr(C)]
 3#[derive(Debug, Clone, Copy, PartialEq, Eq)]
 4pub struct Identifier {
 5    /// 0x7F, 'E', 'L', 'F'
 6    pub magic: [u8; 4],
 7    /// file class or capacity
 8    ///
 9    /// | Name          | Value | Meaning       |
10    /// | ------------- | ----- | ------------- |
11    /// | ELFCLASSNONE  | 0     | Invalid class |
12    /// | ELFCLASS32    | 1     | 32-bit        |
13    /// | ELFCLASS64    | 2     | 64-bit        |
14    pub class: u8,
15    /// data encoding, endian
16    ///
17    /// | Name         | Value |
18    /// | ------------ | ----- |
19    /// | ELFDATANONE  | 0     |
20    /// | ELFDATA2LSB  | 1     |
21    /// | ELFDATA2MSB  | 2     |
22    pub data: u8,
23    /// file version, always EV_CURRENT (1)
24    pub version: u8,
25    /// operating system identification
26    ///
27    /// - if no extensions are used: 0
28    /// - meaning depends on e_machine
29    pub os_abi: u8,
30    /// value depends on os_abi
31    pub abi_version: u8,
32    // padding bytes (9-15)
33    _pad: [u8; 7],
34}
35
36impl TryFrom<&[u8]> for Identifier {
37    type Error = &'static str;
38
39    fn try_from(bytes: &[u8]) -> Result<Self, Self::Error> {
40        if bytes.len() < 16 {
41            return Err("e_ident too short for ELF");
42        }
43
44        // I don't want to cast via unsafe as_ptr and as Header because the header could outlive the
45        // source slice, thus we just do it the old plain indexing way
46        let ident = Self {
47            magic: bytes[0..4].try_into().unwrap(),
48            class: bytes[4],
49            data: bytes[5],
50            version: bytes[6],
51            os_abi: bytes[7],
52            abi_version: bytes[8],
53            _pad: bytes[9..16].try_into().unwrap(),
54        };
55
56        if ident.magic != [0x7f, b'E', b'L', b'F'] {
57            return Err("Unexpected EI_MAG0 to EI_MAG3, wanted 0x7f E L F");
58        }
59
60        const ELFCLASS32: u8 = 1;
61        const ELFDATA2LSB: u8 = 1;
62        const EV_CURRENT: u8 = 1;
63
64        if ident.version != EV_CURRENT {
65            return Err("Unsupported EI_VERSION value");
66        }
67
68        if ident.class != ELFCLASS32 {
69            return Err("Unexpected EI_CLASS: ELFCLASS64, wanted ELFCLASS32 (ARMv7)");
70        }
71
72        if ident.data != ELFDATA2LSB {
73            return Err("Unexpected EI_DATA: big-endian, wanted little");
74        }
75
76        Ok(ident)
77    }

Type and Machine are just enums encoding meaning in the Rust type system:

RUST
 1#[repr(u16)]
 2#[derive(Debug, Clone, Copy, PartialEq, Eq)]
 3pub enum Type {
 4    None = 0,
 5    Relocatable = 1,
 6    Executable = 2,
 7    SharedObject = 3,
 8    Core = 4,
 9    LoOs = 0xfe00,
10    HiOs = 0xfeff,
11    LoProc = 0xff00,
12    HiProc = 0xffff,
13}
14
15impl TryFrom<u16> for Type {
16    type Error = &'static str;
17
18    fn try_from(value: u16) -> Result<Self, Self::Error> {
19        match value {
20            0 => Ok(Type::None),
21            1 => Ok(Type::Relocatable),
22            2 => Ok(Type::Executable),
23            3 => Ok(Type::SharedObject),
24            4 => Ok(Type::Core),
25            0xfe00 => Ok(Type::LoOs),
26            0xfeff => Ok(Type::HiOs),
27            0xff00 => Ok(Type::LoProc),
28            0xffff => Ok(Type::HiProc),
29            _ => Err("Invalid u16 value for e_type"),
30        }
31    }
32}
33
34
35#[repr(u16)]
36#[allow(non_camel_case_types)]
37#[derive(Debug, Clone, Copy, PartialEq, Eq)]
38pub enum Machine {
39    EM_ARM = 40,
40}
41
42impl TryFrom<u16> for Machine {
43    type Error = &'static str;
44
45    fn try_from(value: u16) -> Result<Self, Self::Error> {
46        match value {
47            40 => Ok(Machine::EM_ARM),
48            _ => Err("Unsupported machine"),
49        }
50    }
51}

Since all of Header’s members implement TryFrom we can implement TryFrom<&[u8]> for Header and propagate all occurring errors in member parsing cleanly via ?:

RUST
 1impl TryFrom<&[u8]> for Header {
 2    type Error = &'static str;
 3
 4    fn try_from(b: &[u8]) -> Result<Self, Self::Error> {
 5        if b.len() < 52 {
 6            return Err("not enough bytes for Elf32_Ehdr (ELF header)");
 7        }
 8
 9        let header = Self {
10            ident: b[0..16].try_into()?,
11            r#type: le16!(b[16..18]).try_into()?,
12            machine: le16!(b[18..20]).try_into()?,
13            version: le32!(b[20..24]),
14            entry: le32!(b[24..28]),
15            phoff: le32!(b[28..32]),
16            shoff: le32!(b[32..36]),
17            flags: le32!(b[36..40]),
18            ehsize: le16!(b[40..42]),
19            phentsize: le16!(b[42..44]),
20            phnum: le16!(b[44..46]),
21            shentsize: le16!(b[46..48]),
22            shnum: le16!(b[48..50]),
23            shstrndx: le16!(b[50..52]),
24        };
25
26        match header.r#type {
27            Type::Executable => (),
28            _ => {
29                return Err("Unsupported ELF type, only ET_EXEC (static executables) is supported");
30            }
31        }
32
33        Ok(header)
34    }
35}

The attentive reader will see me using le16! and le32! for parsing bytes into unsigned integers of different classes (le is short for little endian):

RUST
 1#[macro_export]
 2macro_rules! le16 {
 3    ($bytes:expr) => {{
 4        let b: [u8; 2] = $bytes
 5            .try_into()
 6            .map_err(|_| "Failed to create u16 from 2*u8")?;
 7        u16::from_le_bytes(b)
 8    }};
 9}
10
11#[macro_export]
12macro_rules! le32 {
13    ($bytes:expr) => {{
14        let b: [u8; 4] = $bytes
15            .try_into()
16            .map_err(|_| "Failed to create u32 from 4*u8")?;
17        u32::from_le_bytes(b)
18    }};
19}

Elf32_Phdr

TEXT
1+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
2|      type      |     offset     |     vaddr      |     paddr      |     filesz     |     memsz      |     flags      |     align      |
3|       4B       |       4B       |       4B       |       4B       |       4B       |       4B       |       4B       |       4B       |
4+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+

For me, the most important fields in Header are phoff and phentsize, since we can use these to index into the binary to locate the program headers (Phdr).

RUST
 1/// Phdr, equivalent to Elf32_Phdr, see: https://gabi.xinuos.com/elf/07-pheader.html
 2///
 3/// All of its member are u32, be it Elf32_Word, Elf32_Off or Elf32_Addr
 4#[derive(Debug)]
 5pub struct Pheader {
 6    pub r#type: Type,
 7    pub offset: u32,
 8    pub vaddr: u32,
 9    pub paddr: u32,
10    pub filesz: u32,
11    pub memsz: u32,
12    pub flags: Flags,
13    pub align: u32,
14}
15
16impl Pheader {
17    /// extracts Pheader from raw, starting from offset
18    pub fn from(raw: &[u8], offset: usize) -> Result<Self, String> {
19        let end = offset.checked_add(32).ok_or("Offset overflow")?;
20        if raw.len() < end {
21            return Err("Not enough bytes to parse Elf32_Phdr, need at least 32".into());
22        }
23
24        let p_raw = &raw[offset..end];
25        let r#type = p_raw[0..4].try_into()?;
26        let flags = p_raw[24..28].try_into()?;
27        let align = le32!(p_raw[28..32]);
28
29        if align > 1 && !align.is_power_of_two() {
30            return Err(format!("Invalid p_align: {}", align));
31        }
32
33        Ok(Self {
34            r#type,
35            offset: le32!(p_raw[4..8]),
36            vaddr: le32!(p_raw[8..12]),
37            paddr: le32!(p_raw[12..16]),
38            filesz: le32!(p_raw[16..20]),
39            memsz: le32!(p_raw[20..24]),
40            flags,
41            align,
42        })
43    }
44}

Type holds info about what type of segment the header defines:

RUST
 1#[derive(Debug, Clone, Copy, PartialEq, Eq)]
 2#[repr(C)]
 3pub enum Type {
 4    NULL = 0,
 5    LOAD = 1,
 6    DYNAMIC = 2,
 7    INTERP = 3,
 8    NOTE = 4,
 9    SHLIB = 5,
10    PHDR = 6,
11    TLS = 7,
12    LOOS = 0x60000000,
13    HIOS = 0x6fffffff,
14    LOPROC = 0x70000000,
15    HIPROC = 0x7fffffff,
16}

Flag defines the permission flags the segment should have once it is dumped into memory:

RUST
 1#[derive(Debug, Clone, Copy, PartialEq, Eq)]
 2#[repr(transparent)]
 3pub struct Flags(u32);
 4
 5impl Flags {
 6    pub const NONE: Self = Flags(0x0);
 7    pub const X: Self = Flags(0x1);
 8    pub const W: Self = Flags(0x2);
 9    pub const R: Self = Flags(0x4);
10}

Full ELF parsing

Putting Elf32_Ehdr and Elf32_Phdr parsing together:

RUST
 1/// Representing an ELF32 binary in memory
 2///
 3/// This does not include section headers (Elf32_Shdr), but only program headers (Elf32_Phdr), see either `man elf` and/or https://gabi.xinuos.com/elf/03-sheader.html
 4#[derive(Debug)]
 5pub struct Elf {
 6    pub header: header::Header,
 7    pub pheaders: Vec<pheader::Pheader>,
 8}
 9
10impl TryFrom<&[u8]> for Elf {
11    type Error = String;
12
13    fn try_from(b: &[u8]) -> Result<Self, String> {
14        let header = header::Header::try_from(b).map_err(|e| e.to_string())?;
15
16        let mut pheaders = Vec::with_capacity(header.phnum as usize);
17        for i in 0..header.phnum {
18            let offset = header.phoff as usize + i as usize * header.phentsize as usize;
19            let ph = pheader::Pheader::from(b, offset)?;
20            pheaders.push(ph);
21        }
22
23        Ok(Elf { header, pheaders })
24    }
25}

The equivalent to readelf -l:

TEXT
 1Elf {
 2    header: Header {
 3        ident: Identifier {
 4            magic: [127, 69, 76, 70],
 5            class: 1,
 6            data: 1,
 7            version: 1,
 8            os_abi: 0,
 9            abi_version: 0,
10            _pad: [0, 0, 0, 0, 0, 0, 0]
11        },
12        type: Executable,
13        machine: EM_ARM,
14        version: 1,
15        entry: 32768,
16        phoff: 52,
17        shoff: 4572,
18        flags: 83886592,
19        ehsize: 52,
20        phentsize: 32,
21        phnum: 1,
22        shentsize: 40,
23        shnum: 8,
24        shstrndx: 7
25    },
26    pheaders: [
27        Pheader {
28            type: LOAD,
29            offset: 4096,
30            vaddr: 32768,
31            paddr: 32768,
32            filesz: 12,
33            memsz: 12,
34            flags: Flags(5),
35            align: 4096
36        }
37    ]
38}

Or in the debug output of stinkarm:

TEXT
 1[     0.613ms] opening binary "examples/asm.elf"
 2[     0.721ms] parsing ELF...
 3[     0.744ms] \
 4ELF Header:
 5  Magic:              [7f, 45, 4c, 46]
 6  Class:              ELF32
 7  Data:               Little endian
 8  Type:               Executable
 9  Machine:            EM_ARM
10  Version:            1
11  Entry point:        0x8000
12  Program hdr offset: 52 (32 bytes each)
13  Section hdr offset: 4572
14  Flags:              0x05000200
15  EH size:            52
16  # Program headers:  1
17  # Section headers:  8
18  Str tbl index:      7
19
20Program Headers:
21  Type       Offset   VirtAddr   PhysAddr   FileSz    MemSz  Flags  Align
22  LOAD     0x001000 0x00008000 0x00008000 0x00000c 0x00000c    R|X 0x1000

Dumping ELF segments into memory

Since the only reason for parsing the elf headers is to know where to put what segment with which permissions, I want to quickly interject on why we have to put said segments at these specific addresses. The main reason is that all pointers, all offsets and pc related decoding has to be done relative to Elf32_Ehdr.entry, here 0x8000. The linker also generated all instruction arguments according to this value.

Before mapping each segment at its Pheader::vaddr, we have to understand: One doesn’t simply mmap with MAP_FIXED or MAP_NOREPLACE into the virtual address 0x8000. The Linux kernel won’t let us, and rightfully so, man mmap says:

If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there.

And /proc/sys/vm/mmap_min_addr on my system is u16::MAX (2^16)-1=65535. So mapping our segment to 0x8000 (32768) is not allowed:

RUST
 1let segment = sys::mmap::mmap(
 2    // this is only UB if dereferenced, its just a hint, so its safe here
 3    Some(unsafe { std::ptr::NonNull::new_unchecked(0x8000 as *mut u8) }),
 4    4096,
 5    sys::mmap::MmapProt::WRITE,
 6    sys::mmap::MmapFlags::ANONYMOUS
 7        | sys::mmap::MmapFlags::PRIVATE
 8        | sys::mmap::MmapFlags::NOREPLACE,
 9    -1,
10    0,
11)
12.unwrap();

Running the above with our vaddr of 0x8000 results in:

TEXT
1thread 'main' panicked at src/main.rs:33:6:
2called `Result::unwrap()` on an `Err` value: "mmap failed (errno 1): Operation not permitted
3(os error 1)"

It only works in elevated permission mode, which is something I dont want to run my emulator in.

Translating guest memory access to host memory access

The obvious fix is to not mmap below u16::MAX and let the kernel choose where we dump our segment:

RUST
1let segment = sys::mmap::mmap(
2    None,
3    4096,
4    MmapProt::WRITE,
5    MmapFlags::ANONYMOUS | MmapFlags::PRIVATE,
6    -1,
7    0,
8).unwrap();

But this means the segment of the process to emulate is not at 0x8000, but anywhere the kernel allows. So we need to add a translation layer between guest and host memory: (If you’re familiar with how virtual memory works, its similar but one more indirection)

TEXT
1+--guest--+
2| 0x80000 | ------------+
3+---------+             |
4                        |
5                    Mem::translate
6                        |
7+------host------+      |
8| 0x7f5b4b8f8000 | <----+
9+----------------+

Putting this into rust:

  • map_region registers a region of memory and allows Mem to take ownership for calling munmap on these segments once it goes out of scope
  • translate takes a guest addr and translates it to a host addr
RUST
 1struct MappedSegment {
 2    host_ptr: *mut u8,
 3    len: u32,
 4}
 5
 6pub struct Mem {
 7    maps: BTreeMap<u32, MappedSegment>,
 8}
 9
10impl Mem {
11    pub fn map_region(&mut self, guest_addr: u32, len: u32, host_ptr: *mut u8) {
12        self.maps
13            .insert(guest_addr, MappedSegment { host_ptr, len });
14    }
15
16    /// translate a guest addr to a host addr we can write and read from
17    pub fn translate(&self, guest_addr: u32) -> Option<*mut u8> {
18        // Find the greatest key <= guest_addr.
19        let (&base, seg) = self.maps.range(..=guest_addr).next_back()?;
20        if guest_addr < base.wrapping_add(seg.len) {
21            let offset = guest_addr.wrapping_sub(base);
22            Some(unsafe { seg.host_ptr.add(offset as usize) })
23        } else {
24            None
25        }
26    }
27
28    pub fn read_u32(&self, guest_addr: u32) -> Option<u32> {
29        let ptr = self.translate(guest_addr)?;
30        unsafe { Some(u32::from_le(*(ptr as *const u32))) }
31    }
32}

This fix has the added benfit of allowing us to sandbox guest memory fully, so we can validate each memory access before we allow a guest to host memory interaction.

Mapping segments with their permissions

The basic idea is similar to the way a JIT compiler works:

  1. create a mmap section with W permissions
  2. write bytes from elf into section
  3. zero rest of defined size
  4. change permission of section with mprotect to the permissions defined in the Pheader
RUST
 1/// mapping applies the configuration of self to the current memory context by creating the
 2/// segments with the corresponding permission bits, vaddr, etc
 3pub fn map(&self, raw: &[u8], guest_mem: &mut mem::Mem) -> Result<(), String> {
 4    // zero memory needed case, no clue if this actually ever happens, but we support it
 5    if self.memsz == 0 {
 6        return Ok(());
 7    }
 8
 9    if self.vaddr == 0 {
10        return Err("program header has a zero virtual address".into());
11    }
12
13    // we need page alignement, so either Elf32_Phdr.p_align or 4096
14    let (start, _end, len) = self.alignments();
15
16    // Instead of mapping at the guest vaddr (Linux doesnt't allow for low addresses),
17    // we allocate memory wherever the host kernel gives us.
18    // This keeps guest memory sandboxed: guest addr != host addr.
19    let segment = mem::mmap::mmap(
20        None,
21        len as usize,
22        MmapProt::WRITE,
23        MmapFlags::ANONYMOUS | MmapFlags::PRIVATE,
24        -1,
25        0,
26    )?;
27
28    let segment_ptr = segment.as_ptr();
29    let segment_slice = unsafe { std::slice::from_raw_parts_mut(segment_ptr, len as usize) };
30
31    let file_slice: &[u8] =
32        &raw[self.offset as usize..(self.offset.wrapping_add(self.filesz)) as usize];
33
34    // compute offset inside the mmapped slice where the segment should start
35    let offset = (self.vaddr - start) as usize;
36
37    // copy the segment contents to the mmaped segment
38    segment_slice[offset..offset + file_slice.len()].copy_from_slice(file_slice);
39
40    // we need to zero the remaining bytes
41    if self.memsz > self.filesz {
42        segment_slice
43            [offset.wrapping_add(file_slice.len())..offset.wrapping_add(self.memsz as usize)]
44            .fill(0);
45    }
46
47    // record mapping in guest memory table, so CPU can translate guest vaddr to host pointer
48    guest_mem.map_region(self.vaddr, len, segment_ptr);
49
50    // we change the permissions for our segment from W to the segments requested bits
51    mem::mmap::mprotect(segment, len as usize, self.flags.into())
52}
53
54/// returns (start, end, len)
55fn alignments(&self) -> (u32, u32, u32) {
56    // we need page alignement, so either Elf32_Phdr.p_align or 4096
57    let align = match self.align {
58        0 => 0x1000,
59        _ => self.align,
60    };
61    let start = self.vaddr & !(align - 1);
62    let end = (self.vaddr.wrapping_add(self.memsz).wrapping_add(align) - 1) & !(align - 1);
63    let len = end - start;
64    (start, end, len)
65}

Map is called in the emulators entry point:

RUST
1let elf: elf::Elf = (&buf as &[u8]).try_into().expect("Failed to parse binary");
2let mut mem = mem::Mem::new();
3for phdr in elf.pheaders {
4    if phdr.r#type == elf::pheader::Type::LOAD {
5        phdr.map(&buf, &mut mem)
6            .expect("Mapping program header failed");
7    }
8}

Decoding armv7

We can now request a word (32bit) from our LOAD segment which contains the .text section bytes one can inspect via objdump:

TEXT
 1$ arm-none-eabi-objdump -d examples/exit.elf
 2
 3examples/exit.elf:     file format elf32-littlearm
 4
 5
 6Disassembly of section .text:
 7
 800008000 <_start>:
 9    8000:       e3a000a1        mov     r0, #161        @ 0xa1
10    8004:       e3a07001        mov     r7, #1
11    8008:       ef000000        svc     0x00000000

So we use Mem::read_u32(0x8000) and get 0xe3a000a1.

Decoding armv7 instructions seems doable at a glance, but it is a deeper rabbit-hole than I expected, prepare for a bit shifting, implicit behaviour and intertwined meaning heavy section:

Instructions are more or less grouped into four groups:

  1. Branch and control
  2. Data processing
  3. Load and store
  4. Other (syscalls & stuff)

Each armv7 instruction is 32 bit in size, (in general) its layout is as follows:

TEXT
1+--------+------+------+------+------------+---------+
2|  cond  |  op  |  Rn  |  Rd  |  Operand2  |  shamt  |
3|   4b   |  4b  |  4b  |  4b  |     12b    |   4b    |
4+--------+------+------+------+------------+---------+
bit rangenamedescription
0..4condcontains EQ, NE, etc
4..8opfor instance 0b1101 for mov
8..12rnsource register
12..16rddestination register
16..28operand2immediate value or shifted register
28..32shamtshift amount

Rust representation

Since cond decides whether or not the instruction is executed, I decided on the following struct to be the decoded instruction:

RUST
 1#[derive(Debug, Copy, Clone)]
 2pub struct InstructionContainer {
 3    pub cond: u8,
 4    pub instruction: Instruction,
 5}
 6
 7#[derive(Debug, Copy, Clone)]
 8pub enum Instruction {
 9    MovImm { rd: u8, rhs: u32 },
10    Svc,
11    LdrLiteral { rd: u8, addr: u32 },
12    Unknown(u32),
13}

These 4 instructions are enough to support both the minimal binary at the intro and the asm hello world:

ARMASM
1    .global _start
2_start:
3    mov r0, #161
4    mov r7, #1
5    svc #0
ARMASM
 1    .section .rodata
 2msg:
 3    .asciz "Hello, world!\n"
 4
 5    .section .text
 6    .global _start
 7_start:
 8    ldr r0, =1
 9    ldr r1, =msg
10    mov r2, #14
11    mov r7, #4
12    svc #0
13
14    mov r0, #0
15    mov r7, #1
16    svc #0

General instruction detection

Our decoder is a function accepting a word, the program counter (we need this later for decoding the offset for ldr) and returning the aforementioned instruction container:

RUST
1pub fn decode_word(word: u32, caddr: u32) -> InstructionContainer

Referring to the diagram shown before, I know the first 4 bit are the condition, so I can extract these first. I also take the top 3 bits to identify the instruction class (load and store, branch or data processing immediate):

RUST
1// ...
2let cond = ((word >> 28) & 0xF) as u8;
3let top = ((word >> 25) & 0x7) as u8;

Immediate mov

Since there are immediate moves and non immediate moves, both 0b000 and 0b001 are valid top values we want to support.

RUST
1// ...
2if top == 0b000 || top == 0b001 {
3    let i_bit = ((word >> 25) & 0x1) != 0;
4    let opcode = ((word >> 21) & 0xF) as u8;
5    if i_bit {
6        // ...
7    }
8}

If the i bit is set, we can extract convert the opcode from its bits into something I can read a lot better:

RUST
 1#[derive(Debug, Clone, Copy, PartialEq, Eq)]
 2#[repr(u8)]
 3enum Op {
 4    // ...
 5    Mov = 0b1101,
 6}
 7
 8static OP_TABLE: [Op; 16] = [
 9    // ...
10    Op::Mov,
11];
12
13#[inline(always)]
14fn op_from_bits(bits: u8) -> Op {
15    debug_assert!(bits <= 0b1111);
16    unsafe { *OP_TABLE.get_unchecked(bits as usize) }
17}

We can now plug this in, match on the only ddi (data processing immediate) we know and extract both the destination register (rd) and the raw immediate value:

RUST
 1if top == 0b000 || top == 0b001 {
 2    // Data-processing immediate (ddi) (top 0b000 or 0b001 when I==1)
 3    let i_bit = ((word >> 25) & 0x1) != 0;
 4    let opcode = ((word >> 21) & 0xF) as u8;
 5    if i_bit {
 6        match op_from_bits(opcode) {
 7            Op::Mov => {
 8                let rd = ((word >> 12) & 0xF) as u8;
 9                let imm12 = word & 0xFFF;
10                // ...
11            }
12            _ => todo!(),
13        }
14    }
15}

From the examples before one can see the immediate value is prefixed with #. To move the value 161 into r0 we do:

ASM
1mov r0, #161

Since we know there are only 12 bits available for the immediate the arm engineers came up with rotation of the resulting integer by the remaining 4 bits:

RUST
1#[inline(always)]
2fn decode_rotated_imm(imm12: u32) -> u32 {
3    let rotate = ((imm12 >> 8) & 0b1111) * 2;
4    (imm12 & 0xff).rotate_right(rotate)
5}

Plugging this back in results in us being able to fully decode mov r0,#161:

RUST
 1if top == 0b000 || top == 0b001 {
 2    let i_bit = ((word >> 25) & 0x1) != 0;
 3    let opcode = ((word >> 21) & 0xF) as u8;
 4    if i_bit {
 5        match op_from_bits(opcode) {
 6            Op::Mov => {
 7                let rd = ((word >> 12) & 0xF) as u8;
 8                let imm12 = word & 0xFFF;
 9                let rhs = decode_rotated_imm(imm12);
10                return InstructionContainer {
11                    cond,
12                    instruction: Instruction::MovImm { rd, rhs },
13                };
14            }
15            _ => todo!(),
16        }
17    }
18}

As seen when dbg!-ing the cpu steps:

TEXT
1[src/cpu/mod.rs:114:13] decoder::decode_word(word, self.pc()) =
2InstructionContainer {
3    cond: 14,
4    instruction: MovImm {
5        rd: 0,
6        rhs: 161,
7    },
8}

Load and Store

ldr is part of the load and store instruction group and is needed for the accessing of Hello World! in .rodata and putting a ptr to it into a register.

In comparison to immediate mov we have to do a little trick, since we only want to match for load and store that matches:

  • single register modification
  • load and store with immediate

So we only decode:

ARMASM
1LDR Rd, [Rn, #imm]
2LDR Rd, [Rn], #imm
3@ etc

Thus we match with (top >> 1) & 0b11 == 0b01 and start extracting a whole bucket load of bit flags:

RUST
 1if (top >> 1) & 0b11 == 0b01 {
 2    let p = ((word >> 24) & 1) != 0;
 3    let u = ((word >> 23) & 1) != 0;
 4    let b = ((word >> 22) & 1) != 0;
 5    let w = ((word >> 21) & 1) != 0;
 6    let l = ((word >> 20) & 1) != 0;
 7    let rn = ((word >> 16) & 0xF) as u8;
 8    let rd = ((word >> 12) & 0xF) as u8;
 9    let imm12 = (word & 0xFFF) as u32;
10
11    // Literal‑pool version
12    if l && rn == 0b1111 && p && u && !w && !b {
13        let pc_seen = caddr.wrapping_add(8);
14        let literal_addr = pc_seen.wrapping_add(imm12);
15
16        return InstructionContainer {
17            cond,
18            instruction: Instruction::LdrLiteral {
19                rd,
20                addr: literal_addr,
21            },
22        };
23    }
24
25    todo!("only LDR with p&u&!w&!b is implemented")
26}
bitdescription
ppre-indexed addressing, offset added before load
uadd (1) vs subtract (0) offset
bword (0) or byte (1) sized access
w(no=0) write back to base
lload (1), or store (0)

ldr Rn, <addr> matches exactly load, base register is PC (rn==0b1111), pre-indexed addressing, added offset, no write back and no byte sized access (l && rn == 0b1111 && p && u && !w && !b).

Syscalls

Syscalls are the only way to interact with the Linux kernel (as far as I know), so we definitely need to implement both decoding and forwarding. Bits 27-24 are 1111 for system calls. The immediate value is irrelevant for us, since the Linux syscall handler either way discards the value:

RUST
1if ((word >> 24) & 0xF) as u8 == 0b1111 {
2    return InstructionContainer {
3        cond,
4        // technically arm says svc has a 24bit immediate but we don't care about it, since the
5        // Linux kernel also doesn't
6        instruction: Instruction::Svc,
7    };
8}

We can now fully decode all instructions for both the simple exit and the more advanced hello world binary:

TEXT
1[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 0, rhs: 161, }
2[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 7, rhs: 1, }
3[src/cpu/mod.rs:121:15] instruction = Svc
TEXT
1[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 0, rhs: 1, }
2[src/cpu/mod.rs:121:15] instruction = LdrLiteral { rd: 1, addr: 32800, }
3[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 2, rhs: 14, }
4[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 7, rhs: 4, }
5[src/cpu/mod.rs:121:15] instruction = Svc
6[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 0, rhs: 0, }
7[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 7, rhs: 1, }
8[src/cpu/mod.rs:121:15] instruction = Svc

Emulating the CPU

This is by FAR the easiest part, I only struggled with the double indirection for ldr (I simply didn’t know about it), but each problem at its time :^).

RUST
 1pub struct Cpu<'cpu> {
 2    /// r0-r15 (r13=SP, r14=LR, r15=PC)
 3    pub r: [u32; 16],
 4    pub cpsr: u32,
 5    pub mem: &'cpu mut mem::Mem,
 6    /// only set by ArmSyscall::Exit to propagate exit code to the host
 7    pub status: Option<i32>,
 8}
 9
10impl<'cpu> Cpu<'cpu> {
11    pub fn new(mem: &'cpu mut mem::Mem, pc: u32) -> Self {
12        let mut s = Self {
13            r: [0; 16],
14            cpsr: 0x60000010,
15            mem,
16            status: None,
17        };
18        s.r[15] = pc;
19        s
20    }

Instantiating the cpu:

RUST
1let mut cpu = cpu::Cpu::new(&mut mem, elf.header.entry);

Conditional Instructions?

When writing the decoder I was confused by the 4 conditional bits. I always though one does conditional execution by using a branch to jump over instructions that shouldnt be executed. That was before I learned for arm, both ways are supported (the armv7 reference says this feature should only be used if there arent multiple instructions depending on the same condition, otherwise one should use branches) - so I need to support this too:

RUST
 1impl<'cpu> Cpu<'cpu> {
 2    #[inline(always)]
 3    fn cond_passes(&self, cond: u8) -> bool {
 4        match cond {
 5            0x0 => (self.cpsr >> 30) & 1 == 1, // EQ: Z == 1
 6            0x1 => (self.cpsr >> 30) & 1 == 0, // NE
 7            0xE => true,                       // AL (always)
 8            0xF => false,                      // NV (never)
 9            _ => false,                        // strict false
10        }
11    }
12}

Instruction dispatch

After implementing the necessary checks and setup for emulating the cpu, the CPU can now check if an instruction is to be executed, match on the decoded instruction and run the associated logic:

RUST
 1impl<'cpu> Cpu<'cpu> {
 2    #[inline(always)]
 3    fn pc(&self) -> u32 {
 4        self.r[15] & !0b11
 5    }
 6
 7    /// moves pc forward a word
 8    #[inline(always)]
 9    fn advance(&mut self) {
10        self.r[15] = self.r[15].wrapping_add(4);
11    }
12
13    pub fn step(&mut self) -> Result<bool, err::Err> {
14        let Some(word) = self.mem.read_u32(self.pc()) else {
15            return Ok(false);
16        };
17
18        if word == 0 {
19            // zero instruction means we hit zeroed out rest of the page
20            return Ok(false);
21        }
22
23        let InstructionContainer { instruction, cond } = decoder::decode_word(word, self.pc());
24
25        if !self.cond_passes(cond) {
26            self.advance();
27            return Ok(true);
28        }
29
30        match instruction {
31            decoder::Instruction::MovImm { rd, rhs } => {
32                self.r[rd as usize] = rhs;
33            }
34            decoder::Instruction::Unknown(w) => {
35                return Err(err::Err::UnknownOrUnsupportedInstruction(w));
36            }
37            i => {
38                stinkln!(
39                    "found unimplemented instruction, exiting: {:#x}:={:?}",
40                    word,
41                    i
42                );
43                self.status = Some(1);
44            }
45        }
46
47        self.advance();
48
49        Ok(true)
50    }
51}

LDR and addresses in literal pools

While Translating guest memory access to host memory access goes into depth on translating / forwarding guest memory access to host memory adresses, this chapter will focus on the layout of literals in armv7 and how ldr indirects memory access.

Lets first take a look at the ldr instruction of our hello world example:

ARMASM
 1    .section .rodata
 2    @ define a string with the `msg` label
 3msg:
 4    @ asciz is like asciii but zero terminated
 5    .asciz "Hello world!\n"
 6
 7    .section .text
 8    .global _start
 9_start:
10    @ load the literal pool addr of msg into r1
11    ldr r1, =msg

The as documentation says:

LDR

ARMASM
1ldr <register>, = <expression>

If expression evaluates to a numeric constant then a MOV or MVN instruction will be used in place of the LDR instruction, if the constant can be generated by either of these instructions. Otherwise the constant will be placed into the nearest literal pool (if it not already there) and a PC relative LDR instruction will be generated.

Now this may not make sense at a first glance, why would =msg be assembled into an address to the address of the literal. But an armv7 instruction can not encode a full address, it is impossible due to the instruction being restricted to an 8-bit value rotated right by an even number of bits. The ldr instructions argument points to a literal pool entry, this entry is a 32-bit value and reading it produces the actual address of msg.

When decoding we can see ldr points to a memory address (32800 or 0x8020) in the section we mmaped earlier:

TEXT
1[src/cpu/mod.rs:121:15] instruction = LdrLiteral { rd: 1, addr: 32800 }

Before accessing guest memory, we must translate said addr to a host addr:

TEXT
 1+--ldr.addr--+
 2|   0x8020   |
 3+------------+
 4      |
 5      |             +-------------Mem::read_u32(addr)-------------+
 6      |             |                                             |
 7      |             |   +--guest--+                               |
 8      |             |   |  0x8020 | ------------+                 |
 9      |             |   +---------+             |                 |
10      |             |                           |                 |
11      +-----------> |                       Mem::translate        |
12                    |                           |                 |
13                    |   +------host------+      |                 |
14                    |   | 0x7ffff7f87020 | <----+                 |
15                    |   +----------------+                        |
16                    |                                             |
17                    +---------------------------------------------+
18                                           |
19+--literal-ptr--+                          |
20|     0x8024    | <------------------------+
21+---------------+

Or in code:

RUST
 1impl<'cpu> Cpu<'cpu> {
 2    pub fn step(&mut self) -> Result<bool, err::Err> {
 3        // ...
 4        match instruction {
 5            decoder::Instruction::LdrLiteral { rd, addr } => {
 6                self.r[rd as usize] = self.mem.read_u32(addr).expect("Segfault");
 7            }
 8        }
 9        // ...
10    }
11}

Any other instruction using a addr will have to also go through the Mem::translate indirection.

Forwarding Syscalls and other feature flag based logic

Since stinkarm has three ways of dealing with syscalls (deny, sandbox, forward). I decided on handling the selection of the appropriate logic at cpu creation time via a function pointer attached to the CPU as the syscall_handler field:

RUST
 1type SyscallHandlerFn = fn(&mut Cpu, ArmSyscall) -> i32;
 2
 3pub struct Cpu<'cpu> {
 4    /// r0-r15 (r13=SP, r14=LR, r15=PC)
 5    pub r: [u32; 16],
 6    pub cpsr: u32,
 7    pub mem: &'cpu mut mem::Mem,
 8    syscall_handler: SyscallHandlerFn,
 9    pub status: Option<i32>,
10}
11
12impl<'cpu> Cpu<'cpu> {
13    pub fn new(conf: &'cpu config::Config, mem: &'cpu mut mem::Mem, pc: u32) -> Self {
14        // ... 
15
16        // simplified, in stinkarm this gets wrapped if the user specifies
17        // syscall traces via -lsyscalls or -v
18        s.syscall_handler = match conf.syscalls {
19            SyscallMode::Forward => translation::syscall_forward,
20            SyscallMode::Sandbox => sandbox::syscall_sandbox,
21            SyscallMode::Deny => sandbox::syscall_stub,
22        };
23        // ...
24    }
25}

Calling conventions, armv7 vs x86

In our examples I obviously used the armv7 syscall calling convention. But this convention differs from the calling convention of our x86 (technically its x86-64 System V AMD64 ABI) host by a lot.

While armv7 uses r7 for the syscall number and r0-r5 for the syscall arguments, x86 uses rax for the syscall id and rdi, rsi, rdx, r10, r8 and r9 for the syscall arguments (rcx can’t be used since syscall clobbers it, thus Linux goes with r10).

Also the syscall numbers differ between armv7 and x86, sys_write is 1 on x86 and 4 on armv7. If you are interested in either calling conventions, syscall ids and documentation, do visit The Chromium Projects- Linux System Call Table, it is generated from Linux headers and fairly readable.

Table version:

usagearmv7x86-64
syscall idr7rax
returnr0rax
arg0r0rdi
arg1r1rsi
arg2r2rdx
arg3r3r10
arg4r4r8
arg5r5r9

So something like writing TEXT123 to stdout looks like this on arm:

ARMASM
 1    .section .rodata
 2txt:
 3    .asciz "TEXT123\n"
 4
 5    .section .text
 6    .global _start
 7_start:
 8    ldr r0, =1
 9    ldr r1, =txt
10    mov r2, #8
11    mov r7, #4
12    svc #0

While it looks like the following on x86:

ASM
 1    .section .rodata
 2txt:
 3    .string "TEXT123\n"
 4
 5    .section .text
 6    .global _start
 7_start:
 8    movq $1, %rax
 9    movq $1, %rdi
10    leaq txt(%rip), %rsi
11    movq $8, %rdx
12    syscall

Hooking the syscall handler up

After made the calling convention differences clear, the handling of a syscall is simply to execute this handler and use r7 to convert the armv7 syscall number to the x86 syscall number:

RUST
 1impl<'cpu> Cpu<'cpu> {
 2    pub fn step(&mut self) -> Result<bool, err::Err> {
 3        // ...
 4
 5        match instruction {
 6            // ...
 7            decoder::Instruction::Svc => {
 8                self.r[0] = (self.syscall_handler)(self, ArmSyscall::try_from(self.r[7])?) as u32;
 9            }
10            // ...
11        }
12        // ...
13    }
14}

Of course for this to work the syscall has to be implemented and even decodable. At least for the decoding, there is the ArmSyscall enum:

RUST
 1#[derive(Debug)]
 2#[allow(non_camel_case_types)]
 3pub enum ArmSyscall {
 4    restart = 0x00,
 5    exit = 0x01,
 6    fork = 0x02,
 7    read = 0x03,
 8    write = 0x04,
 9    open = 0x05,
10    close = 0x06,
11}
12
13impl TryFrom<u32> for ArmSyscall {
14    type Error = err::Err;
15
16    fn try_from(value: u32) -> Result<Self, Self::Error> {
17        Ok(match value {
18            0x00 => Self::restart,
19            0x01 => Self::exit,
20            0x02 => Self::fork,
21            0x03 => Self::read,
22            0x04 => Self::write,
23            0x05 => Self::open,
24            0x06 => Self::close,
25            _ => return Err(err::Err::UnknownSyscall(value)),
26        })
27    }
28}

By default the sandboxing mode is selected, but I will go into detail on both sandboxing and denying syscalls later, first I want to focus on the implementation of the translation layer from armv7 to x86 syscalls:

RUST
1pub fn syscall_forward(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2    match syscall {
3        // none are implemented, dump debug print
4        c => todo!("{:?}", c),
5    }
6}

Handling the only exception: exit

Since exit means the guest wants to exit, we can’t just forward this to the host system, simply because this would exit the emulator before it would be able to do cleanup and unmap memory regions allocated.

RUST
1pub fn syscall_forward(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2    match syscall {
3        ArmSyscall::exit => {
4            cpu.status = Some(cpu.r[0] as i32);
5            0
6        }
7        // ...
8    }
9}

To both know we hit the exit syscall (we need to, otherwise the emulator executes further) and propagate the exit code to the host system, we set the Cpu::status field to Some(r0), which is the argument to the syscall.

This field is then used in the emulator entry point / main loop:

RUST
 1fn main() {
 2    let mut cpu = cpu::Cpu::new(&conf, &mut mem, elf.header.entry);
 3
 4    loop {
 5        match cpu.step() { /**/ }
 6
 7        // Cpu::status is only some if sys_exit was called, we exit the
 8        // emulation loop
 9        if cpu.status.is_some() {
10            break;
11        }
12    }
13
14    let status = cpu.status.unwrap_or(0);
15    // cleaning up used memory via munmap
16    mem.destroy();
17    // propagating the status code to the host system
18    exit(status);
19}

Implementing: sys_write

The write syscall is not as spectacular as sys_exit: writing a buf of len to a file descriptor.

registerdescription
raxsyscall number (1 for write)
rdifile descriptor (0 for stdin, 1 for stdout, 2 for stderr)
rsia pointer to the buffer
rdxthe length of the buffer rsi is pointing to

It is necessary for doing the O of I/O tho, otherwise there won’t be any Hello, World!s on the screen.

RUST
 1use crate::{cpu, sys};
 2
 3pub fn write(cpu: &mut cpu::Cpu, fd: u32, buf: u32, len: u32) -> i32 {
 4    // fast path for zero length buffer
 5    if len == 0 {
 6        return 0;
 7    }
 8
 9    // Option::None returned from translate indicates invalid memory access
10    let Some(buf_ptr) = cpu.mem.translate(buf) else {
11        // so we return 'Bad Address'
12        return -(sys::Errno::EFAULT as i32);
13    };
14
15    let ret: i64;
16    unsafe {
17        core::arch::asm!(
18            "syscall",
19            // syscall number
20            in("rax") 1_u64,
21            in("rdi") fd as u64,
22            in("rsi") buf_ptr as u64,
23            in("rdx") len as u64,
24            lateout("rax") ret,
25            // we clobber rcx
26            out("rcx") _,
27            // and r11
28            out("r11") _,
29            // we don't modify the stack
30            options(nostack),
31        );
32    }
33
34    ret.try_into().unwrap_or(i32::MAX)
35}

Adding it to translation::syscall_forward with it’s arguments according to the calling convention we established before:

RUST
1pub fn syscall_forward(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2    match syscall {
3        // ...
4        ArmSyscall::write => sys::write(cpu, cpu.r[0], cpu.r[1], cpu.r[2]),
5        // ...
6    }
7}

Executing helloWorld.elf now results in:

SHELL
1$ stinkarm -Cforward example/helloWorld.elf
2Hello, world!
3$ echo $status
40

Deny and Sandbox - restricting syscalls

The simplest sandboxing mode is to deny, the more complex is to allow some syscall interactions while others are denied. The latter requires checking arguments to syscalls, not just the syscall kind.

Lets start with the easier syscall handler: deny. Deny simply returns ENOSYS to all invoked syscalls:

RUST
1pub fn syscall_deny(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2    if let ArmSyscall::exit = syscall {
3        cpu.status = Some(cpu.r[0] as i32)
4    };
5
6    -(sys::Errno::ENOSYS as i32)
7}

Thus executing the hello world and enabling syscall logs results in neither sys_write nor sys_exit going through and ENOSYS being returned for both in r0:

TEXT
1$ stinkarm -Cdeny -lsyscalls examples/helloWorld.elf
2148738 write(fd=1, buf=0x8024, len=14) [deny]
3=ENOSYS
4148738 exit(code=0) [deny]
5=ENOSYS

sandbox at a high level is the same as deny, check for conditions before executing a syscall, if they don’t match, disallow the syscall:

RUST
 1pub fn syscall_sandbox(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
 2    match syscall {
 3        ArmSyscall::exit => {
 4            cpu.status = Some(cpu.r[0] as i32);
 5            0
 6        }
 7        ArmSyscall::write => {
 8            let (r0, r1, r2) = (cpu.r[0], cpu.r[1], cpu.r[2]);
 9            // only allow writing to stdout, stderr and stdin
10            if r0 > 2 {
11                return -(sys::Errno::ENOSYS as i32);
12            }
13
14            sys::write(cpu, r0, r1, r2)
15        }
16        _ => todo!("{:?}", syscall),
17    }
18}

For instance we only allow writing to stdin, stdout and stderr, no other file descriptors. One could also add pointer range checks, buffer length checks and other hardening measures here. Emulating the hello world example with this mode (which is the default mode):

TEXT
1$ stinkarm -Csandbox -lsyscalls examples/helloWorld.elf
2150147 write(fd=1, buf=0x8024, len=14) [sandbox]
3Hello, world!
4=14
5150147 exit(code=0) [sandbox]
6=0

Fin

So there you have it, emulating armv7 in six steps:

  1. parsing and validating a 32-bit armv7 Elf binary
  2. mapping segments into host address space
  3. decoding a non-trivial subset of armv7 instructions
  4. handling program counter relative literal loads
  5. translating memory interactions from guest to host
  6. forwarding armv7 Linux syscalls into their x86-64 System V counterparts

Diving into the Elf and armv7 spec without any previous relevant experience, except the asm module I had in uni, was a bit overwhelming at first. Armv7 decoding was by far the most annoying part of the project and I still don’t like the bizarre argument ordering for x86-64 syscalls.

The whole project is about 1284 lines of Rust, has zero dependencies1 and is as far as I know working correctly2.

Microbenchmark Performance

It executes a real armv7 hello world binary in ~0.015ms of guest execution-only time, excluding process startup and parsing. The e2e execution with all stages I outlined before, it takes about 2ms.

TEXT
 1$ stinkarm -v examples/helloWorld.elf
 2[     0.070ms] opening binary "examples/helloWorld.elf"
 3[     0.097ms] parsing ELF...
 4[     0.101ms] \
 5ELF Header:
 6  Magic:              [7f, 45, 4c, 46]
 7  Class:              ELF32
 8  Data:               Little endian
 9  Type:               Executable
10  Machine:            EM_ARM
11  Version:            1
12  Entry point:        0x8000
13  Program hdr offset: 52 (32 bytes each)
14  Section hdr offset: 4696
15  Flags:              0x05000200
16  EH size:            52
17  # Program headers:  1
18  # Section headers:  9
19  Str tbl index:      8
20
21Program Headers:
22  Type       Offset   VirtAddr   PhysAddr   FileSz    MemSz  Flags  Align
23  LOAD     0x001000 0x00008000 0x00008000 0x000033 0x000033    R|X 0x1000
24
25[     0.126ms] mapped program header `LOAD` of 51B (G=0x8000 -> H=0x7ffff7f87000)
26[     0.129ms] jumping to entry G=0x8000 at H=0x7ffff7f87000
27[     0.131ms] starting the emulator
28153719 write(fd=1, buf=0x8024, len=14) [sandbox]
29Hello, world!
30=14
31153719 exit(code=0) [sandbox]
32=0
33[     0.149ms] exiting with `0`

Comparing the whole pipeline (parsing elf, segment mapping, cpu setup, etc) to qemu we arrive at the following micro benchmark results. To be fair, qemu does a whole lot more than stinkarm, it has a jit, a full linux-user runtime, a dynamic loader, etc.

TEXT
1$ hyperfine "./target/release/stinkarm examples/helloWorld.elf" -N --warmup 10
2Benchmark 1: ./target/release/stinkarm examples/helloWorld.elf
3  Time (mean ± σ):       1.9 ms ±   0.3 ms    [User: 0.2 ms, System: 1.4 ms]
4  Range (min … max):     1.6 ms …   3.4 ms    1641 runs
5
6$ hyperfine "qemu-arm ./examples/helloWorld.elf" -N --warmup 10
7Benchmark 1: qemu-arm ./examples/helloWorld.elf
8  Time (mean ± σ):      12.3 ms ±   1.5 ms    [User: 3.8 ms, System: 8.0 ms]
9  Range (min … max):     8.8 ms …  19.8 ms    226 runs

  1. except clap, I dont want to parse cli flags for the 20th time this year ↩︎

  2. afaik, write me an email contact at xnacly.me if you found a bug please ↩︎