Tip or TLDR - I built a tiny, zero dependency armv7 userspace emulator in Rust
I wrote a minimal viable armv7 emulator in 1.3k lines of Rust without any dependencies. It parses and validates a 32-bit arm binary, maps its segments, decodes a subset of arm instructions, translates guest and host memory interactions and forwards arm Linux syscalls into x86-64 System V syscalls.
It can run a armv7 hello world binary and does so in 1.9ms (0.015ms for raw emulation without setup), while qemu takes 12.3ms (stinkarm is thus ~100-1000x slower than native armv7 execution).
After reading about the process the Linux kernel performs to execute binaries,
I thought: I want to write an armv7 emulator - stinkarm. Mostly to understand the ELF
format, the encoding of arm 32bit instructions, the execution of arm assembly
and how it all fits together (this will help me with the JIT for my programming
language I am currently designing). To fully understand everything: no
dependencies. And of course Rust, since I already have enough C projects going
on.
So I wrote the smallest binary I could think of:
1 .global _start @ declare _start as a global
2_start: @ start is the defacto entry point
3 mov r0, #161 @ first and only argument to the exit syscall
4 mov r7, #1 @ syscall number 1 (exit)
5 svc #0 @ trapping into the kernel (thats US, since we are translating)
To execute this arm assembly on my x86 system, I need to:
- Parse the ELF, validate it is armv7 and statically executable (I don’t want to write a dynamic dependency resolver and loader)
- Map the segments defined in ELF into the host memory, forward memory access
- Decode armv7 instructions and convert them into a nice Rust enum
- Emulate the CPU, its state and registers
- Execute the instructions and apply their effects to the CPU state
- Translate and forward syscalls
Sounds easy? It is!
Open below if you want to see me write a build script and a nix flake:Minimalist arm setup and smallest possible arm binary
Before I start parsing ELF I’ll need a binary to emulate, so lets create a
build script called bld_exmpl (so I can write a lot less) and nix flake, so
the asm is converted into armv7 machine code in a armv7 binary on my non armv7
system :^)
1// tools/bld_exmpl
2use clap::Parser;
3use std::fs;
4use std::path::Path;
5use std::process::Command;
6
7/// Build all ARM assembly examples into .elf binaries
8#[derive(Parser)]
9struct Args {
10 /// Directory containing .S examples
11 #[arg(long, default_value = "examples")]
12 examples_dir: String,
13}
14
15fn main() -> Result<(), Box<dyn std::error::Error>> {
16 let args = Args::parse();
17 let dir = Path::new(&args.examples_dir);
18
19 for entry in fs::read_dir(dir)? {
20 let entry = entry?;
21 let path = entry.path();
22 if path.extension().and_then(|s| s.to_str()) == Some("S") {
23 let name = path.file_stem().unwrap().to_str().unwrap();
24 let output = dir.join(format!("{}.elf", name));
25 build_asm(&path, &output)?;
26 }
27 }
28
29 Ok(())
30}
31
32fn build_asm(input: &Path, output: &Path) -> Result<(), Box<dyn std::error::Error>> {
33 println!("Building {} -> {}", input.display(), output.display());
34
35 let obj_file = input.with_extension("o");
36
37 let status = Command::new("arm-none-eabi-as")
38 .arg("-march=armv7-a")
39 .arg(input)
40 .arg("-o")
41 .arg(&obj_file)
42 .status()?;
43
44 if !status.success() {
45 return Err(format!("Assembler failed for {}", input.display()).into());
46 }
47
48 let status = Command::new("arm-none-eabi-ld")
49 .arg("-Ttext=0x8000")
50 .arg(&obj_file)
51 .arg("-o")
52 .arg(output)
53 .status()?;
54
55 if !status.success() {
56 return Err(format!("Linker failed for {}", output.display()).into());
57 }
58
59 Ok(fs::remove_file(obj_file)?)
60} 1# Cargo.toml
2[package]
3name = "stinkarm"
4version = "0.1.0"
5edition = "2024"
6default-run = "stinkarm"
7
8[dependencies]
9clap = { version = "4.5.51", features = ["derive"] }
10
11[[bin]]
12name = "stinkarm"
13path = "src/main.rs"
14
15[[bin]]
16name = "bld_exmpl"
17path = "tools/bld_exmpl.rs" 1{
2 description = "stinkarm — ARMv7 userspace binary emulator for x86 linux systems";
3 inputs = {
4 nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
5 flake-utils.url = "github:numtide/flake-utils";
6 };
7 outputs = { self, nixpkgs, flake-utils, ... }:
8 flake-utils.lib.eachDefaultSystem (system:
9 let
10 pkgs = import nixpkgs { inherit system; };
11 in {
12 devShells.default = pkgs.mkShell {
13 buildInputs = with pkgs; [
14 gcc-arm-embedded
15 binutils
16 qemu
17 ];
18 };
19 }
20 );
21}Parsing ELF
So there are some resources for parsing ELF, two of them I used a whole lot:
man elf(remember toexport MANPAGER='nvim +Man!')- gabi.xinuos.com
At a high level, ELF (32bit, for armv7) consists of headers and segments, it holds an Elf header, multiple program headers and the rest I don’t care about, since this emulator is only for static binaries, no dynamically linked support.
Elf32_Ehdr
The ELF header is exactly 52 bytes long and holds all data I need to find the
program headers and whether I even want to emulate the binary I’m currently
parsing. These criteria are defined as members of the Identifier at the beg
of the header.
In terms of byte layout:
1+------------------------+--------+--------+----------------+----------------+----------------+----------------+----------------+--------+---------+--------+---------+--------+--------+
2| identifier | type |machine | version | entry | phoff | shoff | flags | ehsize |phentsize| phnum |shentsize| shnum |shstrndx|
3| 16B | 2B | 2B | 4B | 4B | 4B | 4B | 4B | 2B | 2B | 2B | 2B | 2B | 2B |
4+------------------------+--------+--------+----------------+----------------+----------------+----------------+----------------+--------+---------+--------+---------+--------+--------+
5 \|/
6 |
7 |
8 v
9+----------------+------+------+-------+------+-----------+------------------------+
10| magic |class | data |version|os_abi|abi_version| pad |
11| 4B | 1B | 1B | 1B | 1B | 1B | 7B |
12+----------------+------+------+-------+------+-----------+------------------------+Most resources show C based examples, the rust ports are below:
1/// Representing the ELF Object File Format header in memory, equivalent to Elf32_Ehdr in 2. ELF
2/// header in https://gabi.xinuos.com/elf/02-eheader.html
3///
4/// Types are taken from https://gabi.xinuos.com/elf/01-intro.html#data-representation Table 1.1
5/// 32-Bit Data Types:
6///
7/// | Elf32_ | Rust |
8/// | ------ | ---- |
9/// | Addr | u32 |
10/// | Off | u32 |
11/// | Half | u16 |
12/// | Word | u32 |
13/// | Sword | i32 |
14#[derive(Debug, Clone, Copy, PartialEq, Eq)]
15pub struct Header {
16 /// initial bytes mark the file as an object file and provide machine-independent data with
17 /// which to decode and interpret the file’s contents
18 pub ident: Identifier,
19 pub r#type: Type,
20 pub machine: Machine,
21 /// identifies the object file version, always EV_CURRENT (1)
22 pub version: u32,
23 /// the virtual address to which the system first transfers control, thus starting
24 /// the process. If the file has no associated entry point, this member holds zero
25 pub entry: u32,
26 /// the program header table’s file offset in bytes. If the file has no program header table,
27 /// this member holds zero
28 pub phoff: u32,
29 /// the section header table’s file offset in bytes. If the file has no section header table, this
30 /// member holds zero
31 pub shoff: u32,
32 /// processor-specific flags associated with the file
33 pub flags: u32,
34 /// the ELF header’s size in bytes
35 pub ehsize: u16,
36 /// the size in bytes of one entry in the file’s program header table; all entries are the same
37 /// size
38 pub phentsize: u16,
39 /// the number of entries in the program header table. Thus the product of e_phentsize and e_phnum
40 /// gives the table’s size in bytes. If a file has no program header table, e_phnum holds the value
41 /// zero
42 pub phnum: u16,
43 /// section header’s size in bytes. A section header is one entry in the section header table; all
44 /// entries are the same size
45 pub shentsize: u16,
46 /// number of entries in the section header table. Thus the product of e_shentsize and e_shnum
47 /// gives the section header table’s size in bytes. If a file has no section header table,
48 /// e_shnum holds the value zero.
49 pub shnum: u16,
50 /// the section header table index of the entry associated with the section name string table.
51 /// If the file has no section name string table, this member holds the value SHN_UNDEF
52 pub shstrndx: u16,
53}The identifier is 16 bytes long and holds the previously mentioned info so I
can check if I want to emulate the binary, for instance the endianness and the
bit class, in the TryFrom implementation I strictly check what is parsed:
1/// 2.2 ELF Identification: https://gabi.xinuos.com/elf/02-eheader.html#elf-identification
2#[repr(C)]
3#[derive(Debug, Clone, Copy, PartialEq, Eq)]
4pub struct Identifier {
5 /// 0x7F, 'E', 'L', 'F'
6 pub magic: [u8; 4],
7 /// file class or capacity
8 ///
9 /// | Name | Value | Meaning |
10 /// | ------------- | ----- | ------------- |
11 /// | ELFCLASSNONE | 0 | Invalid class |
12 /// | ELFCLASS32 | 1 | 32-bit |
13 /// | ELFCLASS64 | 2 | 64-bit |
14 pub class: u8,
15 /// data encoding, endian
16 ///
17 /// | Name | Value |
18 /// | ------------ | ----- |
19 /// | ELFDATANONE | 0 |
20 /// | ELFDATA2LSB | 1 |
21 /// | ELFDATA2MSB | 2 |
22 pub data: u8,
23 /// file version, always EV_CURRENT (1)
24 pub version: u8,
25 /// operating system identification
26 ///
27 /// - if no extensions are used: 0
28 /// - meaning depends on e_machine
29 pub os_abi: u8,
30 /// value depends on os_abi
31 pub abi_version: u8,
32 // padding bytes (9-15)
33 _pad: [u8; 7],
34}
35
36impl TryFrom<&[u8]> for Identifier {
37 type Error = &'static str;
38
39 fn try_from(bytes: &[u8]) -> Result<Self, Self::Error> {
40 if bytes.len() < 16 {
41 return Err("e_ident too short for ELF");
42 }
43
44 // I don't want to cast via unsafe as_ptr and as Header because the header could outlive the
45 // source slice, thus we just do it the old plain indexing way
46 let ident = Self {
47 magic: bytes[0..4].try_into().unwrap(),
48 class: bytes[4],
49 data: bytes[5],
50 version: bytes[6],
51 os_abi: bytes[7],
52 abi_version: bytes[8],
53 _pad: bytes[9..16].try_into().unwrap(),
54 };
55
56 if ident.magic != [0x7f, b'E', b'L', b'F'] {
57 return Err("Unexpected EI_MAG0 to EI_MAG3, wanted 0x7f E L F");
58 }
59
60 const ELFCLASS32: u8 = 1;
61 const ELFDATA2LSB: u8 = 1;
62 const EV_CURRENT: u8 = 1;
63
64 if ident.version != EV_CURRENT {
65 return Err("Unsupported EI_VERSION value");
66 }
67
68 if ident.class != ELFCLASS32 {
69 return Err("Unexpected EI_CLASS: ELFCLASS64, wanted ELFCLASS32 (ARMv7)");
70 }
71
72 if ident.data != ELFDATA2LSB {
73 return Err("Unexpected EI_DATA: big-endian, wanted little");
74 }
75
76 Ok(ident)
77 }Type and Machine are just enums encoding meaning in the Rust type system:
1#[repr(u16)]
2#[derive(Debug, Clone, Copy, PartialEq, Eq)]
3pub enum Type {
4 None = 0,
5 Relocatable = 1,
6 Executable = 2,
7 SharedObject = 3,
8 Core = 4,
9 LoOs = 0xfe00,
10 HiOs = 0xfeff,
11 LoProc = 0xff00,
12 HiProc = 0xffff,
13}
14
15impl TryFrom<u16> for Type {
16 type Error = &'static str;
17
18 fn try_from(value: u16) -> Result<Self, Self::Error> {
19 match value {
20 0 => Ok(Type::None),
21 1 => Ok(Type::Relocatable),
22 2 => Ok(Type::Executable),
23 3 => Ok(Type::SharedObject),
24 4 => Ok(Type::Core),
25 0xfe00 => Ok(Type::LoOs),
26 0xfeff => Ok(Type::HiOs),
27 0xff00 => Ok(Type::LoProc),
28 0xffff => Ok(Type::HiProc),
29 _ => Err("Invalid u16 value for e_type"),
30 }
31 }
32}
33
34
35#[repr(u16)]
36#[allow(non_camel_case_types)]
37#[derive(Debug, Clone, Copy, PartialEq, Eq)]
38pub enum Machine {
39 EM_ARM = 40,
40}
41
42impl TryFrom<u16> for Machine {
43 type Error = &'static str;
44
45 fn try_from(value: u16) -> Result<Self, Self::Error> {
46 match value {
47 40 => Ok(Machine::EM_ARM),
48 _ => Err("Unsupported machine"),
49 }
50 }
51}Since all of Header’s members implement TryFrom we can implement
TryFrom<&[u8]> for Header and propagate all occurring errors in member parsing
cleanly via ?:
1impl TryFrom<&[u8]> for Header {
2 type Error = &'static str;
3
4 fn try_from(b: &[u8]) -> Result<Self, Self::Error> {
5 if b.len() < 52 {
6 return Err("not enough bytes for Elf32_Ehdr (ELF header)");
7 }
8
9 let header = Self {
10 ident: b[0..16].try_into()?,
11 r#type: le16!(b[16..18]).try_into()?,
12 machine: le16!(b[18..20]).try_into()?,
13 version: le32!(b[20..24]),
14 entry: le32!(b[24..28]),
15 phoff: le32!(b[28..32]),
16 shoff: le32!(b[32..36]),
17 flags: le32!(b[36..40]),
18 ehsize: le16!(b[40..42]),
19 phentsize: le16!(b[42..44]),
20 phnum: le16!(b[44..46]),
21 shentsize: le16!(b[46..48]),
22 shnum: le16!(b[48..50]),
23 shstrndx: le16!(b[50..52]),
24 };
25
26 match header.r#type {
27 Type::Executable => (),
28 _ => {
29 return Err("Unsupported ELF type, only ET_EXEC (static executables) is supported");
30 }
31 }
32
33 Ok(header)
34 }
35}The attentive reader will see me using le16! and le32! for parsing bytes
into unsigned integers of different classes (le is short for little endian):
1#[macro_export]
2macro_rules! le16 {
3 ($bytes:expr) => {{
4 let b: [u8; 2] = $bytes
5 .try_into()
6 .map_err(|_| "Failed to create u16 from 2*u8")?;
7 u16::from_le_bytes(b)
8 }};
9}
10
11#[macro_export]
12macro_rules! le32 {
13 ($bytes:expr) => {{
14 let b: [u8; 4] = $bytes
15 .try_into()
16 .map_err(|_| "Failed to create u32 from 4*u8")?;
17 u32::from_le_bytes(b)
18 }};
19}Elf32_Phdr
1+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
2| type | offset | vaddr | paddr | filesz | memsz | flags | align |
3| 4B | 4B | 4B | 4B | 4B | 4B | 4B | 4B |
4+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+For me, the most important fields in Header are phoff and phentsize,
since we can use these to index into the binary to locate the program headers (Phdr).
1/// Phdr, equivalent to Elf32_Phdr, see: https://gabi.xinuos.com/elf/07-pheader.html
2///
3/// All of its member are u32, be it Elf32_Word, Elf32_Off or Elf32_Addr
4#[derive(Debug)]
5pub struct Pheader {
6 pub r#type: Type,
7 pub offset: u32,
8 pub vaddr: u32,
9 pub paddr: u32,
10 pub filesz: u32,
11 pub memsz: u32,
12 pub flags: Flags,
13 pub align: u32,
14}
15
16impl Pheader {
17 /// extracts Pheader from raw, starting from offset
18 pub fn from(raw: &[u8], offset: usize) -> Result<Self, String> {
19 let end = offset.checked_add(32).ok_or("Offset overflow")?;
20 if raw.len() < end {
21 return Err("Not enough bytes to parse Elf32_Phdr, need at least 32".into());
22 }
23
24 let p_raw = &raw[offset..end];
25 let r#type = p_raw[0..4].try_into()?;
26 let flags = p_raw[24..28].try_into()?;
27 let align = le32!(p_raw[28..32]);
28
29 if align > 1 && !align.is_power_of_two() {
30 return Err(format!("Invalid p_align: {}", align));
31 }
32
33 Ok(Self {
34 r#type,
35 offset: le32!(p_raw[4..8]),
36 vaddr: le32!(p_raw[8..12]),
37 paddr: le32!(p_raw[12..16]),
38 filesz: le32!(p_raw[16..20]),
39 memsz: le32!(p_raw[20..24]),
40 flags,
41 align,
42 })
43 }
44}Type holds info about what type of segment the header defines:
1#[derive(Debug, Clone, Copy, PartialEq, Eq)]
2#[repr(C)]
3pub enum Type {
4 NULL = 0,
5 LOAD = 1,
6 DYNAMIC = 2,
7 INTERP = 3,
8 NOTE = 4,
9 SHLIB = 5,
10 PHDR = 6,
11 TLS = 7,
12 LOOS = 0x60000000,
13 HIOS = 0x6fffffff,
14 LOPROC = 0x70000000,
15 HIPROC = 0x7fffffff,
16}Flag defines the
permission flags the segment should have once it is dumped into memory:
1#[derive(Debug, Clone, Copy, PartialEq, Eq)]
2#[repr(transparent)]
3pub struct Flags(u32);
4
5impl Flags {
6 pub const NONE: Self = Flags(0x0);
7 pub const X: Self = Flags(0x1);
8 pub const W: Self = Flags(0x2);
9 pub const R: Self = Flags(0x4);
10}Full ELF parsing
Putting Elf32_Ehdr and Elf32_Phdr parsing together:
1/// Representing an ELF32 binary in memory
2///
3/// This does not include section headers (Elf32_Shdr), but only program headers (Elf32_Phdr), see either `man elf` and/or https://gabi.xinuos.com/elf/03-sheader.html
4#[derive(Debug)]
5pub struct Elf {
6 pub header: header::Header,
7 pub pheaders: Vec<pheader::Pheader>,
8}
9
10impl TryFrom<&[u8]> for Elf {
11 type Error = String;
12
13 fn try_from(b: &[u8]) -> Result<Self, String> {
14 let header = header::Header::try_from(b).map_err(|e| e.to_string())?;
15
16 let mut pheaders = Vec::with_capacity(header.phnum as usize);
17 for i in 0..header.phnum {
18 let offset = header.phoff as usize + i as usize * header.phentsize as usize;
19 let ph = pheader::Pheader::from(b, offset)?;
20 pheaders.push(ph);
21 }
22
23 Ok(Elf { header, pheaders })
24 }
25}The equivalent to readelf -l:
1Elf {
2 header: Header {
3 ident: Identifier {
4 magic: [127, 69, 76, 70],
5 class: 1,
6 data: 1,
7 version: 1,
8 os_abi: 0,
9 abi_version: 0,
10 _pad: [0, 0, 0, 0, 0, 0, 0]
11 },
12 type: Executable,
13 machine: EM_ARM,
14 version: 1,
15 entry: 32768,
16 phoff: 52,
17 shoff: 4572,
18 flags: 83886592,
19 ehsize: 52,
20 phentsize: 32,
21 phnum: 1,
22 shentsize: 40,
23 shnum: 8,
24 shstrndx: 7
25 },
26 pheaders: [
27 Pheader {
28 type: LOAD,
29 offset: 4096,
30 vaddr: 32768,
31 paddr: 32768,
32 filesz: 12,
33 memsz: 12,
34 flags: Flags(5),
35 align: 4096
36 }
37 ]
38}Or in the debug output of stinkarm:
1[ 0.613ms] opening binary "examples/asm.elf"
2[ 0.721ms] parsing ELF...
3[ 0.744ms] \
4ELF Header:
5 Magic: [7f, 45, 4c, 46]
6 Class: ELF32
7 Data: Little endian
8 Type: Executable
9 Machine: EM_ARM
10 Version: 1
11 Entry point: 0x8000
12 Program hdr offset: 52 (32 bytes each)
13 Section hdr offset: 4572
14 Flags: 0x05000200
15 EH size: 52
16 # Program headers: 1
17 # Section headers: 8
18 Str tbl index: 7
19
20Program Headers:
21 Type Offset VirtAddr PhysAddr FileSz MemSz Flags Align
22 LOAD 0x001000 0x00008000 0x00008000 0x00000c 0x00000c R|X 0x1000Dumping ELF segments into memory
Since the only reason for parsing the elf headers is to know where to put what
segment with which permissions, I want to quickly interject on why we have to
put said segments at these specific addresses. The main reason is that all
pointers, all offsets and pc related decoding has to be done relative to
Elf32_Ehdr.entry, here 0x8000. The linker also generated all instruction
arguments according to this value.
Before mapping each segment at its Pheader::vaddr, we have to understand:
One doesn’t simply mmap with MAP_FIXED or MAP_NOREPLACE into the virtual
address 0x8000. The Linux kernel won’t let us, and rightfully so, man mmap
says:
If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there.
And /proc/sys/vm/mmap_min_addr on my system is u16::MAX (2^16)-1=65535. So
mapping our segment to 0x8000 (32768) is not allowed:
1let segment = sys::mmap::mmap(
2 // this is only UB if dereferenced, its just a hint, so its safe here
3 Some(unsafe { std::ptr::NonNull::new_unchecked(0x8000 as *mut u8) }),
4 4096,
5 sys::mmap::MmapProt::WRITE,
6 sys::mmap::MmapFlags::ANONYMOUS
7 | sys::mmap::MmapFlags::PRIVATE
8 | sys::mmap::MmapFlags::NOREPLACE,
9 -1,
10 0,
11)
12.unwrap();Running the above with our vaddr of 0x8000 results in:
1thread 'main' panicked at src/main.rs:33:6:
2called `Result::unwrap()` on an `Err` value: "mmap failed (errno 1): Operation not permitted
3(os error 1)"It only works in elevated permission mode, which is something I dont want to run my emulator in.
Translating guest memory access to host memory access
The obvious fix is to not mmap below u16::MAX and let the kernel choose where
we dump our segment:
1let segment = sys::mmap::mmap(
2 None,
3 4096,
4 MmapProt::WRITE,
5 MmapFlags::ANONYMOUS | MmapFlags::PRIVATE,
6 -1,
7 0,
8).unwrap();But this means the segment of the process to emulate is not at 0x8000, but
anywhere the kernel allows. So we need to add a translation layer between guest
and host memory: (If you’re familiar with how virtual memory works, its similar
but one more indirection)
1+--guest--+
2| 0x80000 | ------------+
3+---------+ |
4 |
5 Mem::translate
6 |
7+------host------+ |
8| 0x7f5b4b8f8000 | <----+
9+----------------+Putting this into rust:
map_regionregisters a region of memory and allowsMemto take ownership for calling munmap on these segments once it goes out of scopetranslatetakes a guest addr and translates it to a host addr
1struct MappedSegment {
2 host_ptr: *mut u8,
3 len: u32,
4}
5
6pub struct Mem {
7 maps: BTreeMap<u32, MappedSegment>,
8}
9
10impl Mem {
11 pub fn map_region(&mut self, guest_addr: u32, len: u32, host_ptr: *mut u8) {
12 self.maps
13 .insert(guest_addr, MappedSegment { host_ptr, len });
14 }
15
16 /// translate a guest addr to a host addr we can write and read from
17 pub fn translate(&self, guest_addr: u32) -> Option<*mut u8> {
18 // Find the greatest key <= guest_addr.
19 let (&base, seg) = self.maps.range(..=guest_addr).next_back()?;
20 if guest_addr < base.wrapping_add(seg.len) {
21 let offset = guest_addr.wrapping_sub(base);
22 Some(unsafe { seg.host_ptr.add(offset as usize) })
23 } else {
24 None
25 }
26 }
27
28 pub fn read_u32(&self, guest_addr: u32) -> Option<u32> {
29 let ptr = self.translate(guest_addr)?;
30 unsafe { Some(u32::from_le(*(ptr as *const u32))) }
31 }
32}This fix has the added benfit of allowing us to sandbox guest memory fully, so we can validate each memory access before we allow a guest to host memory interaction.
Mapping segments with their permissions
The basic idea is similar to the way a JIT compiler works:
- create a
mmapsection withWpermissions - write bytes from elf into section
- zero rest of defined size
- change permission of section with
mprotectto the permissions defined in thePheader
1/// mapping applies the configuration of self to the current memory context by creating the
2/// segments with the corresponding permission bits, vaddr, etc
3pub fn map(&self, raw: &[u8], guest_mem: &mut mem::Mem) -> Result<(), String> {
4 // zero memory needed case, no clue if this actually ever happens, but we support it
5 if self.memsz == 0 {
6 return Ok(());
7 }
8
9 if self.vaddr == 0 {
10 return Err("program header has a zero virtual address".into());
11 }
12
13 // we need page alignement, so either Elf32_Phdr.p_align or 4096
14 let (start, _end, len) = self.alignments();
15
16 // Instead of mapping at the guest vaddr (Linux doesnt't allow for low addresses),
17 // we allocate memory wherever the host kernel gives us.
18 // This keeps guest memory sandboxed: guest addr != host addr.
19 let segment = mem::mmap::mmap(
20 None,
21 len as usize,
22 MmapProt::WRITE,
23 MmapFlags::ANONYMOUS | MmapFlags::PRIVATE,
24 -1,
25 0,
26 )?;
27
28 let segment_ptr = segment.as_ptr();
29 let segment_slice = unsafe { std::slice::from_raw_parts_mut(segment_ptr, len as usize) };
30
31 let file_slice: &[u8] =
32 &raw[self.offset as usize..(self.offset.wrapping_add(self.filesz)) as usize];
33
34 // compute offset inside the mmapped slice where the segment should start
35 let offset = (self.vaddr - start) as usize;
36
37 // copy the segment contents to the mmaped segment
38 segment_slice[offset..offset + file_slice.len()].copy_from_slice(file_slice);
39
40 // we need to zero the remaining bytes
41 if self.memsz > self.filesz {
42 segment_slice
43 [offset.wrapping_add(file_slice.len())..offset.wrapping_add(self.memsz as usize)]
44 .fill(0);
45 }
46
47 // record mapping in guest memory table, so CPU can translate guest vaddr to host pointer
48 guest_mem.map_region(self.vaddr, len, segment_ptr);
49
50 // we change the permissions for our segment from W to the segments requested bits
51 mem::mmap::mprotect(segment, len as usize, self.flags.into())
52}
53
54/// returns (start, end, len)
55fn alignments(&self) -> (u32, u32, u32) {
56 // we need page alignement, so either Elf32_Phdr.p_align or 4096
57 let align = match self.align {
58 0 => 0x1000,
59 _ => self.align,
60 };
61 let start = self.vaddr & !(align - 1);
62 let end = (self.vaddr.wrapping_add(self.memsz).wrapping_add(align) - 1) & !(align - 1);
63 let len = end - start;
64 (start, end, len)
65}Map is called in the emulators entry point:
1let elf: elf::Elf = (&buf as &[u8]).try_into().expect("Failed to parse binary");
2let mut mem = mem::Mem::new();
3for phdr in elf.pheaders {
4 if phdr.r#type == elf::pheader::Type::LOAD {
5 phdr.map(&buf, &mut mem)
6 .expect("Mapping program header failed");
7 }
8}Decoding armv7
We can now request a word (32bit) from our LOAD segment which contains
the .text section bytes one can inspect via objdump:
1$ arm-none-eabi-objdump -d examples/exit.elf
2
3examples/exit.elf: file format elf32-littlearm
4
5
6Disassembly of section .text:
7
800008000 <_start>:
9 8000: e3a000a1 mov r0, #161 @ 0xa1
10 8004: e3a07001 mov r7, #1
11 8008: ef000000 svc 0x00000000So we use Mem::read_u32(0x8000) and get 0xe3a000a1.
Decoding armv7 instructions seems doable at a glance, but it is a deeper rabbit-hole than I expected, prepare for a bit shifting, implicit behaviour and intertwined meaning heavy section:
Instructions are more or less grouped into four groups:
- Branch and control
- Data processing
- Load and store
- Other (syscalls & stuff)
Each armv7 instruction is 32 bit in size, (in general) its layout is as follows:
1+--------+------+------+------+------------+---------+
2| cond | op | Rn | Rd | Operand2 | shamt |
3| 4b | 4b | 4b | 4b | 12b | 4b |
4+--------+------+------+------+------------+---------+| bit range | name | description |
|---|---|---|
| 0..4 | cond | contains EQ, NE, etc |
| 4..8 | op | for instance 0b1101 for mov |
| 8..12 | rn | source register |
| 12..16 | rd | destination register |
| 16..28 | operand2 | immediate value or shifted register |
| 28..32 | shamt | shift amount |
Rust representation
Since cond decides whether or not the instruction is
executed, I decided on the following struct to be the decoded
instruction:
1#[derive(Debug, Copy, Clone)]
2pub struct InstructionContainer {
3 pub cond: u8,
4 pub instruction: Instruction,
5}
6
7#[derive(Debug, Copy, Clone)]
8pub enum Instruction {
9 MovImm { rd: u8, rhs: u32 },
10 Svc,
11 LdrLiteral { rd: u8, addr: u32 },
12 Unknown(u32),
13}These 4 instructions are enough to support both the minimal binary at the intro and the asm hello world:
1 .global _start
2_start:
3 mov r0, #161
4 mov r7, #1
5 svc #0 1 .section .rodata
2msg:
3 .asciz "Hello, world!\n"
4
5 .section .text
6 .global _start
7_start:
8 ldr r0, =1
9 ldr r1, =msg
10 mov r2, #14
11 mov r7, #4
12 svc #0
13
14 mov r0, #0
15 mov r7, #1
16 svc #0General instruction detection
Our decoder is a function accepting a word, the program counter (we need
this later for decoding the offset for ldr) and returning the
aforementioned instruction container:
1pub fn decode_word(word: u32, caddr: u32) -> InstructionContainerReferring to the diagram shown before, I know the first 4 bit are the condition, so I can extract these first. I also take the top 3 bits to identify the instruction class (load and store, branch or data processing immediate):
1// ...
2let cond = ((word >> 28) & 0xF) as u8;
3let top = ((word >> 25) & 0x7) as u8;Immediate mov
Since there are immediate moves and non immediate moves, both 0b000 and
0b001 are valid top values we want to support.
1// ...
2if top == 0b000 || top == 0b001 {
3 let i_bit = ((word >> 25) & 0x1) != 0;
4 let opcode = ((word >> 21) & 0xF) as u8;
5 if i_bit {
6 // ...
7 }
8}If the i bit is set, we can extract convert the opcode from its bits into something I can read a lot better:
1#[derive(Debug, Clone, Copy, PartialEq, Eq)]
2#[repr(u8)]
3enum Op {
4 // ...
5 Mov = 0b1101,
6}
7
8static OP_TABLE: [Op; 16] = [
9 // ...
10 Op::Mov,
11];
12
13#[inline(always)]
14fn op_from_bits(bits: u8) -> Op {
15 debug_assert!(bits <= 0b1111);
16 unsafe { *OP_TABLE.get_unchecked(bits as usize) }
17}We can now plug this in, match on the only ddi (data processing immediate) we know and extract both the destination register (rd) and the raw immediate value:
1if top == 0b000 || top == 0b001 {
2 // Data-processing immediate (ddi) (top 0b000 or 0b001 when I==1)
3 let i_bit = ((word >> 25) & 0x1) != 0;
4 let opcode = ((word >> 21) & 0xF) as u8;
5 if i_bit {
6 match op_from_bits(opcode) {
7 Op::Mov => {
8 let rd = ((word >> 12) & 0xF) as u8;
9 let imm12 = word & 0xFFF;
10 // ...
11 }
12 _ => todo!(),
13 }
14 }
15}From the examples before one can see the immediate value is prefixed with
#. To move the value 161 into r0 we do:
1mov r0, #161Since we know there are only 12 bits available for the immediate the arm engineers came up with rotation of the resulting integer by the remaining 4 bits:
1#[inline(always)]
2fn decode_rotated_imm(imm12: u32) -> u32 {
3 let rotate = ((imm12 >> 8) & 0b1111) * 2;
4 (imm12 & 0xff).rotate_right(rotate)
5}Plugging this back in results in us being able to fully decode mov r0,#161:
1if top == 0b000 || top == 0b001 {
2 let i_bit = ((word >> 25) & 0x1) != 0;
3 let opcode = ((word >> 21) & 0xF) as u8;
4 if i_bit {
5 match op_from_bits(opcode) {
6 Op::Mov => {
7 let rd = ((word >> 12) & 0xF) as u8;
8 let imm12 = word & 0xFFF;
9 let rhs = decode_rotated_imm(imm12);
10 return InstructionContainer {
11 cond,
12 instruction: Instruction::MovImm { rd, rhs },
13 };
14 }
15 _ => todo!(),
16 }
17 }
18}As seen when dbg!-ing the cpu steps:
1[src/cpu/mod.rs:114:13] decoder::decode_word(word, self.pc()) =
2InstructionContainer {
3 cond: 14,
4 instruction: MovImm {
5 rd: 0,
6 rhs: 161,
7 },
8}Load and Store
ldr is part of the load and store instruction group and is needed for
the accessing of Hello World! in .rodata and putting a ptr to it
into a register.
In comparison to immediate mov we have to do a little trick, since we only want to match for load and store that matches:
- single register modification
- load and store with immediate
So we only decode:
1LDR Rd, [Rn, #imm]
2LDR Rd, [Rn], #imm
3@ etc
Thus we match with (top >> 1) & 0b11 == 0b01 and start extracting a
whole bucket load of bit flags:
1if (top >> 1) & 0b11 == 0b01 {
2 let p = ((word >> 24) & 1) != 0;
3 let u = ((word >> 23) & 1) != 0;
4 let b = ((word >> 22) & 1) != 0;
5 let w = ((word >> 21) & 1) != 0;
6 let l = ((word >> 20) & 1) != 0;
7 let rn = ((word >> 16) & 0xF) as u8;
8 let rd = ((word >> 12) & 0xF) as u8;
9 let imm12 = (word & 0xFFF) as u32;
10
11 // Literal‑pool version
12 if l && rn == 0b1111 && p && u && !w && !b {
13 let pc_seen = caddr.wrapping_add(8);
14 let literal_addr = pc_seen.wrapping_add(imm12);
15
16 return InstructionContainer {
17 cond,
18 instruction: Instruction::LdrLiteral {
19 rd,
20 addr: literal_addr,
21 },
22 };
23 }
24
25 todo!("only LDR with p&u&!w&!b is implemented")
26}| bit | description |
|---|---|
| p | pre-indexed addressing, offset added before load |
| u | add (1) vs subtract (0) offset |
| b | word (0) or byte (1) sized access |
| w | (no=0) write back to base |
| l | load (1), or store (0) |
ldr Rn, <addr> matches exactly load, base register is PC (rn==0b1111), pre-indexed
addressing, added offset, no write back and no byte sized access (l && rn == 0b1111 && p && u && !w && !b).
Syscalls
Syscalls are the only way to interact with the Linux kernel (as far as I
know), so we definitely need to implement both decoding and forwarding.
Bits 27-24 are 1111 for system calls. The immediate value is
irrelevant for us, since the Linux syscall handler either way discards
the value:
1if ((word >> 24) & 0xF) as u8 == 0b1111 {
2 return InstructionContainer {
3 cond,
4 // technically arm says svc has a 24bit immediate but we don't care about it, since the
5 // Linux kernel also doesn't
6 instruction: Instruction::Svc,
7 };
8}We can now fully decode all instructions for both the simple exit and the more advanced hello world binary:
1[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 0, rhs: 161, }
2[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 7, rhs: 1, }
3[src/cpu/mod.rs:121:15] instruction = Svc1[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 0, rhs: 1, }
2[src/cpu/mod.rs:121:15] instruction = LdrLiteral { rd: 1, addr: 32800, }
3[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 2, rhs: 14, }
4[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 7, rhs: 4, }
5[src/cpu/mod.rs:121:15] instruction = Svc
6[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 0, rhs: 0, }
7[src/cpu/mod.rs:121:15] instruction = MovImm { rd: 7, rhs: 1, }
8[src/cpu/mod.rs:121:15] instruction = SvcEmulating the CPU
This is by FAR the easiest part, I only struggled with the double
indirection for ldr (I simply didn’t know about it), but each problem
at its time :^).
1pub struct Cpu<'cpu> {
2 /// r0-r15 (r13=SP, r14=LR, r15=PC)
3 pub r: [u32; 16],
4 pub cpsr: u32,
5 pub mem: &'cpu mut mem::Mem,
6 /// only set by ArmSyscall::Exit to propagate exit code to the host
7 pub status: Option<i32>,
8}
9
10impl<'cpu> Cpu<'cpu> {
11 pub fn new(mem: &'cpu mut mem::Mem, pc: u32) -> Self {
12 let mut s = Self {
13 r: [0; 16],
14 cpsr: 0x60000010,
15 mem,
16 status: None,
17 };
18 s.r[15] = pc;
19 s
20 }Instantiating the cpu:
1let mut cpu = cpu::Cpu::new(&mut mem, elf.header.entry);Conditional Instructions?
When writing the decoder I was confused by the 4 conditional bits. I always though one does conditional execution by using a branch to jump over instructions that shouldnt be executed. That was before I learned for arm, both ways are supported (the armv7 reference says this feature should only be used if there arent multiple instructions depending on the same condition, otherwise one should use branches) - so I need to support this too:
1impl<'cpu> Cpu<'cpu> {
2 #[inline(always)]
3 fn cond_passes(&self, cond: u8) -> bool {
4 match cond {
5 0x0 => (self.cpsr >> 30) & 1 == 1, // EQ: Z == 1
6 0x1 => (self.cpsr >> 30) & 1 == 0, // NE
7 0xE => true, // AL (always)
8 0xF => false, // NV (never)
9 _ => false, // strict false
10 }
11 }
12}Instruction dispatch
After implementing the necessary checks and setup for emulating the cpu, the CPU can now check if an instruction is to be executed, match on the decoded instruction and run the associated logic:
1impl<'cpu> Cpu<'cpu> {
2 #[inline(always)]
3 fn pc(&self) -> u32 {
4 self.r[15] & !0b11
5 }
6
7 /// moves pc forward a word
8 #[inline(always)]
9 fn advance(&mut self) {
10 self.r[15] = self.r[15].wrapping_add(4);
11 }
12
13 pub fn step(&mut self) -> Result<bool, err::Err> {
14 let Some(word) = self.mem.read_u32(self.pc()) else {
15 return Ok(false);
16 };
17
18 if word == 0 {
19 // zero instruction means we hit zeroed out rest of the page
20 return Ok(false);
21 }
22
23 let InstructionContainer { instruction, cond } = decoder::decode_word(word, self.pc());
24
25 if !self.cond_passes(cond) {
26 self.advance();
27 return Ok(true);
28 }
29
30 match instruction {
31 decoder::Instruction::MovImm { rd, rhs } => {
32 self.r[rd as usize] = rhs;
33 }
34 decoder::Instruction::Unknown(w) => {
35 return Err(err::Err::UnknownOrUnsupportedInstruction(w));
36 }
37 i => {
38 stinkln!(
39 "found unimplemented instruction, exiting: {:#x}:={:?}",
40 word,
41 i
42 );
43 self.status = Some(1);
44 }
45 }
46
47 self.advance();
48
49 Ok(true)
50 }
51}LDR and addresses in literal pools
While Translating guest memory access to host memory
access goes into depth
on translating / forwarding guest memory access to host memory adresses, this
chapter will focus on the layout of literals in armv7 and how ldr indirects
memory access.
Lets first take a look at the ldr instruction of our hello world example:
1 .section .rodata
2 @ define a string with the `msg` label
3msg:
4 @ asciz is like asciii but zero terminated
5 .asciz "Hello world!\n"
6
7 .section .text
8 .global _start
9_start:
10 @ load the literal pool addr of msg into r1
11 ldr r1, =msgThe as
documentation
says:
LDRARMASM1ldr <register>, = <expression>If expression evaluates to a numeric constant then a MOV or MVN instruction will be used in place of the LDR instruction, if the constant can be generated by either of these instructions. Otherwise the constant will be placed into the nearest literal pool (if it not already there) and a PC relative LDR instruction will be generated.
Now this may not make sense at a first glance, why would =msg be assembled
into an address to the address of the literal. But an armv7 instruction can
not encode a full address, it is impossible due to the instruction being
restricted to an 8-bit value rotated right by an even number of bits. The ldr
instructions argument points to a literal pool entry, this entry is a 32-bit
value and reading it produces the actual address of msg.
When decoding we can see ldr points to a memory address (32800 or 0x8020) in
the section we mmaped earlier:
1[src/cpu/mod.rs:121:15] instruction = LdrLiteral { rd: 1, addr: 32800 }Before accessing guest memory, we must translate said addr to a host addr:
1+--ldr.addr--+
2| 0x8020 |
3+------------+
4 |
5 | +-------------Mem::read_u32(addr)-------------+
6 | | |
7 | | +--guest--+ |
8 | | | 0x8020 | ------------+ |
9 | | +---------+ | |
10 | | | |
11 +-----------> | Mem::translate |
12 | | |
13 | +------host------+ | |
14 | | 0x7ffff7f87020 | <----+ |
15 | +----------------+ |
16 | |
17 +---------------------------------------------+
18 |
19+--literal-ptr--+ |
20| 0x8024 | <------------------------+
21+---------------+Or in code:
1impl<'cpu> Cpu<'cpu> {
2 pub fn step(&mut self) -> Result<bool, err::Err> {
3 // ...
4 match instruction {
5 decoder::Instruction::LdrLiteral { rd, addr } => {
6 self.r[rd as usize] = self.mem.read_u32(addr).expect("Segfault");
7 }
8 }
9 // ...
10 }
11}Any other instruction using a addr will have to also go through the
Mem::translate indirection.
Forwarding Syscalls and other feature flag based logic
Since stinkarm has three ways of dealing with syscalls (deny, sandbox,
forward). I decided on handling the selection of the appropriate logic at cpu
creation time via a function pointer attached to the CPU as the
syscall_handler field:
1type SyscallHandlerFn = fn(&mut Cpu, ArmSyscall) -> i32;
2
3pub struct Cpu<'cpu> {
4 /// r0-r15 (r13=SP, r14=LR, r15=PC)
5 pub r: [u32; 16],
6 pub cpsr: u32,
7 pub mem: &'cpu mut mem::Mem,
8 syscall_handler: SyscallHandlerFn,
9 pub status: Option<i32>,
10}
11
12impl<'cpu> Cpu<'cpu> {
13 pub fn new(conf: &'cpu config::Config, mem: &'cpu mut mem::Mem, pc: u32) -> Self {
14 // ...
15
16 // simplified, in stinkarm this gets wrapped if the user specifies
17 // syscall traces via -lsyscalls or -v
18 s.syscall_handler = match conf.syscalls {
19 SyscallMode::Forward => translation::syscall_forward,
20 SyscallMode::Sandbox => sandbox::syscall_sandbox,
21 SyscallMode::Deny => sandbox::syscall_stub,
22 };
23 // ...
24 }
25}Calling conventions, armv7 vs x86
In our examples I obviously used the armv7 syscall calling convention. But this convention differs from the calling convention of our x86 (technically its x86-64 System V AMD64 ABI) host by a lot.
While armv7 uses r7 for the syscall number and r0-r5 for the syscall
arguments, x86 uses rax for the syscall id and rdi, rsi, rdx, r10,
r8 and r9 for the syscall arguments (rcx can’t be used since syscall
clobbers it, thus Linux goes with r10).
Also the syscall numbers differ between armv7 and x86, sys_write is 1 on
x86 and 4 on armv7. If you are interested in either calling conventions,
syscall ids and documentation, do visit The Chromium Projects- Linux System
Call
Table,
it is generated from Linux headers and fairly readable.
Table version:
| usage | armv7 | x86-64 |
|---|---|---|
| syscall id | r7 | rax |
| return | r0 | rax |
| arg0 | r0 | rdi |
| arg1 | r1 | rsi |
| arg2 | r2 | rdx |
| arg3 | r3 | r10 |
| arg4 | r4 | r8 |
| arg5 | r5 | r9 |
So something like writing TEXT123 to stdout looks like this on arm:
1 .section .rodata
2txt:
3 .asciz "TEXT123\n"
4
5 .section .text
6 .global _start
7_start:
8 ldr r0, =1
9 ldr r1, =txt
10 mov r2, #8
11 mov r7, #4
12 svc #0While it looks like the following on x86:
1 .section .rodata
2txt:
3 .string "TEXT123\n"
4
5 .section .text
6 .global _start
7_start:
8 movq $1, %rax
9 movq $1, %rdi
10 leaq txt(%rip), %rsi
11 movq $8, %rdx
12 syscallHooking the syscall handler up
After made the calling convention differences clear, the handling of a syscall
is simply to execute this handler and use r7 to convert the armv7 syscall
number to the x86 syscall number:
1impl<'cpu> Cpu<'cpu> {
2 pub fn step(&mut self) -> Result<bool, err::Err> {
3 // ...
4
5 match instruction {
6 // ...
7 decoder::Instruction::Svc => {
8 self.r[0] = (self.syscall_handler)(self, ArmSyscall::try_from(self.r[7])?) as u32;
9 }
10 // ...
11 }
12 // ...
13 }
14}Of course for this to work the syscall has to be implemented and even
decodable. At least for the decoding, there is the ArmSyscall enum:
1#[derive(Debug)]
2#[allow(non_camel_case_types)]
3pub enum ArmSyscall {
4 restart = 0x00,
5 exit = 0x01,
6 fork = 0x02,
7 read = 0x03,
8 write = 0x04,
9 open = 0x05,
10 close = 0x06,
11}
12
13impl TryFrom<u32> for ArmSyscall {
14 type Error = err::Err;
15
16 fn try_from(value: u32) -> Result<Self, Self::Error> {
17 Ok(match value {
18 0x00 => Self::restart,
19 0x01 => Self::exit,
20 0x02 => Self::fork,
21 0x03 => Self::read,
22 0x04 => Self::write,
23 0x05 => Self::open,
24 0x06 => Self::close,
25 _ => return Err(err::Err::UnknownSyscall(value)),
26 })
27 }
28}By default the sandboxing mode is selected, but I will go into detail on both sandboxing and denying syscalls later, first I want to focus on the implementation of the translation layer from armv7 to x86 syscalls:
1pub fn syscall_forward(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2 match syscall {
3 // none are implemented, dump debug print
4 c => todo!("{:?}", c),
5 }
6}Handling the only exception: exit
Since exit means the guest wants to exit, we can’t just forward this to the host system, simply because this would exit the emulator before it would be able to do cleanup and unmap memory regions allocated.
1pub fn syscall_forward(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2 match syscall {
3 ArmSyscall::exit => {
4 cpu.status = Some(cpu.r[0] as i32);
5 0
6 }
7 // ...
8 }
9}To both know we hit the exit syscall (we need to, otherwise the emulator
executes further) and propagate the exit code to the host system, we set the
Cpu::status field to Some(r0), which is the argument to the syscall.
This field is then used in the emulator entry point / main loop:
1fn main() {
2 let mut cpu = cpu::Cpu::new(&conf, &mut mem, elf.header.entry);
3
4 loop {
5 match cpu.step() { /**/ }
6
7 // Cpu::status is only some if sys_exit was called, we exit the
8 // emulation loop
9 if cpu.status.is_some() {
10 break;
11 }
12 }
13
14 let status = cpu.status.unwrap_or(0);
15 // cleaning up used memory via munmap
16 mem.destroy();
17 // propagating the status code to the host system
18 exit(status);
19}Implementing: sys_write
The write syscall is not as spectacular as sys_exit: writing a buf of len
to a file descriptor.
| register | description |
|---|---|
| rax | syscall number (1 for write) |
| rdi | file descriptor (0 for stdin, 1 for stdout, 2 for stderr) |
| rsi | a pointer to the buffer |
| rdx | the length of the buffer rsi is pointing to |
It is necessary for doing the O of I/O tho, otherwise there won’t be any
Hello, World!s on the screen.
1use crate::{cpu, sys};
2
3pub fn write(cpu: &mut cpu::Cpu, fd: u32, buf: u32, len: u32) -> i32 {
4 // fast path for zero length buffer
5 if len == 0 {
6 return 0;
7 }
8
9 // Option::None returned from translate indicates invalid memory access
10 let Some(buf_ptr) = cpu.mem.translate(buf) else {
11 // so we return 'Bad Address'
12 return -(sys::Errno::EFAULT as i32);
13 };
14
15 let ret: i64;
16 unsafe {
17 core::arch::asm!(
18 "syscall",
19 // syscall number
20 in("rax") 1_u64,
21 in("rdi") fd as u64,
22 in("rsi") buf_ptr as u64,
23 in("rdx") len as u64,
24 lateout("rax") ret,
25 // we clobber rcx
26 out("rcx") _,
27 // and r11
28 out("r11") _,
29 // we don't modify the stack
30 options(nostack),
31 );
32 }
33
34 ret.try_into().unwrap_or(i32::MAX)
35}Adding it to translation::syscall_forward with it’s arguments according to the
calling convention we established before:
1pub fn syscall_forward(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2 match syscall {
3 // ...
4 ArmSyscall::write => sys::write(cpu, cpu.r[0], cpu.r[1], cpu.r[2]),
5 // ...
6 }
7}Executing helloWorld.elf now results in:
1$ stinkarm -Cforward example/helloWorld.elf
2Hello, world!
3$ echo $status
40Deny and Sandbox - restricting syscalls
The simplest sandboxing mode is to deny, the more complex is to allow some syscall interactions while others are denied. The latter requires checking arguments to syscalls, not just the syscall kind.
Lets start with the easier syscall handler: deny. Deny simply returns
ENOSYS to all invoked syscalls:
1pub fn syscall_deny(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2 if let ArmSyscall::exit = syscall {
3 cpu.status = Some(cpu.r[0] as i32)
4 };
5
6 -(sys::Errno::ENOSYS as i32)
7}Thus executing the hello world and enabling syscall logs results in neither
sys_write nor sys_exit going through and ENOSYS being returned for both
in r0:
1$ stinkarm -Cdeny -lsyscalls examples/helloWorld.elf
2148738 write(fd=1, buf=0x8024, len=14) [deny]
3=ENOSYS
4148738 exit(code=0) [deny]
5=ENOSYSsandbox at a high level is the same as deny, check for conditions before
executing a syscall, if they don’t match, disallow the syscall:
1pub fn syscall_sandbox(cpu: &mut super::Cpu, syscall: ArmSyscall) -> i32 {
2 match syscall {
3 ArmSyscall::exit => {
4 cpu.status = Some(cpu.r[0] as i32);
5 0
6 }
7 ArmSyscall::write => {
8 let (r0, r1, r2) = (cpu.r[0], cpu.r[1], cpu.r[2]);
9 // only allow writing to stdout, stderr and stdin
10 if r0 > 2 {
11 return -(sys::Errno::ENOSYS as i32);
12 }
13
14 sys::write(cpu, r0, r1, r2)
15 }
16 _ => todo!("{:?}", syscall),
17 }
18}For instance we only allow writing to stdin, stdout and stderr, no other file descriptors. One could also add pointer range checks, buffer length checks and other hardening measures here. Emulating the hello world example with this mode (which is the default mode):
1$ stinkarm -Csandbox -lsyscalls examples/helloWorld.elf
2150147 write(fd=1, buf=0x8024, len=14) [sandbox]
3Hello, world!
4=14
5150147 exit(code=0) [sandbox]
6=0Fin
So there you have it, emulating armv7 in six steps:
- parsing and validating a 32-bit armv7 Elf binary
- mapping segments into host address space
- decoding a non-trivial subset of armv7 instructions
- handling program counter relative literal loads
- translating memory interactions from guest to host
- forwarding armv7 Linux syscalls into their x86-64 System V counterparts
Diving into the Elf and armv7 spec without any previous relevant experience, except the asm module I had in uni, was a bit overwhelming at first. Armv7 decoding was by far the most annoying part of the project and I still don’t like the bizarre argument ordering for x86-64 syscalls.
The whole project is about 1284 lines of Rust, has zero dependencies1 and is as far as I know working correctly2.
Microbenchmark Performance
It executes a real armv7 hello world binary in ~0.015ms of guest execution-only time, excluding process startup and parsing. The e2e execution with all stages I outlined before, it takes about 2ms.
1$ stinkarm -v examples/helloWorld.elf
2[ 0.070ms] opening binary "examples/helloWorld.elf"
3[ 0.097ms] parsing ELF...
4[ 0.101ms] \
5ELF Header:
6 Magic: [7f, 45, 4c, 46]
7 Class: ELF32
8 Data: Little endian
9 Type: Executable
10 Machine: EM_ARM
11 Version: 1
12 Entry point: 0x8000
13 Program hdr offset: 52 (32 bytes each)
14 Section hdr offset: 4696
15 Flags: 0x05000200
16 EH size: 52
17 # Program headers: 1
18 # Section headers: 9
19 Str tbl index: 8
20
21Program Headers:
22 Type Offset VirtAddr PhysAddr FileSz MemSz Flags Align
23 LOAD 0x001000 0x00008000 0x00008000 0x000033 0x000033 R|X 0x1000
24
25[ 0.126ms] mapped program header `LOAD` of 51B (G=0x8000 -> H=0x7ffff7f87000)
26[ 0.129ms] jumping to entry G=0x8000 at H=0x7ffff7f87000
27[ 0.131ms] starting the emulator
28153719 write(fd=1, buf=0x8024, len=14) [sandbox]
29Hello, world!
30=14
31153719 exit(code=0) [sandbox]
32=0
33[ 0.149ms] exiting with `0`Comparing the whole pipeline (parsing elf, segment mapping, cpu setup, etc) to
qemu we arrive at the following micro benchmark results. To be fair, qemu
does a whole lot more than stinkarm, it has a jit, a full linux-user runtime, a
dynamic loader, etc.
1$ hyperfine "./target/release/stinkarm examples/helloWorld.elf" -N --warmup 10
2Benchmark 1: ./target/release/stinkarm examples/helloWorld.elf
3 Time (mean ± σ): 1.9 ms ± 0.3 ms [User: 0.2 ms, System: 1.4 ms]
4 Range (min … max): 1.6 ms … 3.4 ms 1641 runs
5
6$ hyperfine "qemu-arm ./examples/helloWorld.elf" -N --warmup 10
7Benchmark 1: qemu-arm ./examples/helloWorld.elf
8 Time (mean ± σ): 12.3 ms ± 1.5 ms [User: 3.8 ms, System: 8.0 ms]
9 Range (min … max): 8.8 ms … 19.8 ms 226 runs