touchHLE in depth, part 1: a day in the life of a function call

2023-04-13 (last correction: 2022-04-15)

This is the first in what will hopefully be a series of technical write-ups of various aspects of touchHLE, a high-level emulator for iPhone OS applications. (See also: the touchHLE announcement blog post from two months ago.) These write-ups will be aimed at an audience familiar with systems programming, but which isn't necessarily familiar with emulator development, ARMv6 assembly, Rust, Objective-C, and so on.

touchHLE's defining design choices

The goal I set for touchHLE was to run early iPhone and iPod touch games on modern computers, without requiring a copy of iPhone OS itself. This presents two major challenges:

iPhone OS apps consist of native code, initially for ARMv6 and later also for ARMv7-A. These are legacy, 32-bit Arm ISAs. Most modern computers' CPUs, even Apple's, can't directly run code written for these ISAs.
iPhone OS apps rely on the services provided by iPhone OS. Without them, apps will not be able to accomplish very much.

The first problem has various off-the-shelf solutions, but as it happens, one of those was written by a friend of mine, so I tried it first. Dynarmic is a dynamic recompiler that can translate code from several Arm ISAs to x86-64 or ARMv8-A (64-bit Arm) code, and it was very easy to integrate. touchHLE sets up a sandbox region of virtual memory that will belong to the app, loads the app's code into it, and tells Dynarmic to start execution at the app's entry point. Dynarmic then takes control until some event happens which touchHLE has to handle, e.g. hitting a time limit or encountering a special instruction. It works well, so I haven't had to try alternatives.

As for the second problem, thankfully every feature of iPhone OS is accessed through dynamically-linked libraries (most of which are “frameworks”), and they have stable ABIs, so they're practical to replace. One obvious approach would be to use the real Apple libraries, just from a different Apple platform, e.g. macOS or the iPhone Simulator. However, I really didn't want to rely on Apple keeping deprecated APIs around. After all, if Apple were more generous with backwards compatibility, this project would never have needed to exist. I also really wanted touchHLE to work on non-Apple platforms.

With that constraint in mind, the only option was to use an independent reimplementation of those libraries. Thankfully, the ones apps rely on are all publicly documented, so reimplementing them is realistic. I could have made use of the various open-source projects that reimplement bits and pieces of them, but considering touchHLE's requirements are very different from most of those projects — iPhone OS/iOS rather than macOS, ARMv6 rather than x86-64, binary rather than source compatiblity, games rather than business or command-line apps, etc — I decided to just do everything from scratch. Well, hubris and it being more fun this way were also factors. ^^

Then there's the question of how to reimplement them. On a real iPhone or iPod touch, system libraries look very much like the apps that use them: they're both native Arm binaries and they live in the same process. Since touchHLE has ARMv6 emulation, I could have chosen to write C or Objective-C code and compile it to an ARMv6 dynamic library, just like Apple did. In that case, I would have exposed a syscall interface so the libraries can perform I/O, just like iPhone OS does.

But that's not what I chose to do. I thought it would be fun to have the app binary be the only code that runs under emulation, and implement all those libraries in native code for the host machine (i.e. non-emulated, x86-64 or ARMv8-A code). I also decided to do it in Rust, because I like Rust.

These decisions are what really set touchHLE apart from other emulation projects. I don't think there's many projects out there that implement 32-bit C and Objective-C runtimes using 64-bit Rust code! They're also where touchHLE got its name from: emulating only the app code, not the libraries or OS, makes this a “high-level” emulator in the tradition of UltraHLE.

An example function

So, if the app binary is running under emulation, but the system libraries it's using aren't, how does it call functions in them? Let's look at a real example.

The executable for Super Monkey Ball (2008) version 1.3 contains a function called RandInit, which gets called during the game's initialisation. If we disassemble it with Ghidra, it looks like this:

00010330 80 b5           push       { r7, lr }
00010332 00 af           add        r7,sp,#0x0
00010334 00 20           movs       r0,#0x0
00010336 2f f0 be eb     blx        __symbol_stub4::_time
0001033a 2f f0 de ea     blx        __symbol_stub4::_mach_absolute_time
0001033e 2f f0 72 eb     blx        __symbol_stub4::_srand
00010342 80 bd           pop        { r7, pc }

(From left to right: instruction address, encoding, mnemonic, and operands.)

This is ARMv6 code (more specifically, Thumb code). The original C code would have been something like:

void RandInit(void) {
    time(NULL);
    int seed = mach_absolute_time();
    srand(seed);
}

Hopefully it's clear what this function is doing: it's getting some representation of the current time as an integer, and using that integer as a seed for the random number generator. (The fact it calls two different time functions and only uses the result of one of them is probably a mistake, don't worry about it.) In order to do that, it calls three functions, two from the C standard library (time and srand) and one that's specific to Mach (mach_absolute_time).

On Apple plaforms, all of those functions live in a dynamic library called libSystem. As discussed earlier, touchHLE provides its own replacement implementations for proprietary Apple libraries like these, and it does them with un-emulated, 64-bit Rust code. So, to execute RandInit, execution will have to flow back and forth between the app's ARMv6 code emulated by dynarmic, and touchHLE's code. Since this distinction will be coming up a lot, let's call the former “guest code”, and the latter “host code”.

With that out of the way, let's start following the path of execution! The first two instructions of RandInit are calling-convention boilerplate that we don't need to worry about. The first interesting instructions are these two:

00010334 00 20           movs       r0,#0x0
00010336 2f f0 be eb     blx        __symbol_stub4::_time

This is the time(NULL); call. movs r0,#0x0 stores a value of 0 (i.e. NULL) into the register r0, which is used for the first argument of a function call. Then we have a blx, which is the normal branch instruction used for function calls on ARMv6. The destination is the _time symbol in the __symbol_stub4 section of the app binary, so after that instruction is executed, that's where the program counter (current instruction) will be.

Lazy dynamic linking on iPhone OS

You might expect that the function being branched to would be the actual time function, or at least that it would be on the real iPhone OS. But it isn't, and this is due to how Apple's lazy dynamic linking works. Before I continue, I want to acknowledge Alex Drummond's excellent blog post, Inside a Hello World executable on OS X, which was my introduction to this, and which I can recommend if you want a more in-depth understanding.

Lazy linking seems to be the default for dynamically-linked functions on iPhone OS. What that means in practice is that the actual linking is delayed until the function is first called. Presumably this is used to make applications start up faster.

In order for this to work effectively, there needs to be some kind of indirection, so that the first call will go to the dynamic linker (which will do the linking and then hand over to the intended function), but subsequent calls will bypass it. Apple achieves this by adding two special sections to the app binary: __symbol_stub4, which contains lots of near-identical tiny stub functions, and __la_symbol_ptr, which contains lots of function pointers. Each symbol (function or function pointer, respectively) in these sections corresponds to a dynamically-linked function.

In our example, we've just branched to the _time function in the __symbol_stub4 section. Let's see what it looks like:

0003fab4 00 c0 9f e5     ldr        r12,[pc]
0003fab8 00 f0 9c e5     ldr        pc,[r12]
0003fabc 00 d3 04 00     addr       __la_symbol_ptr::_time

I don't expect you to understand what this does unless you're very familiar with 32-bit Arm assembly and Ghidra's particular flavour of disassembler-speak. Long story short, it loads the function pointer at the _time symbol in the __la_symbol_ptr section, then immediately branches to it (more specifically, it does a tail call).

So, whereas the source code had a call to the time function, Apple's compiler has actually generated a call to a corresponding stub function, and that stub function loads a function pointer and calls it. The same is done for all other calls to dynamically-linked function calls in the app.

Why do things this way? As mentioned earlier, this provides indirection, but an equally important property is that it avoids obscuring which function the app wants to call. Initially, all of those function pointers will point to some function in the dynamic linker that does lazy linking. Since there's one pointer for each dynamically-linked function, and the stub function loads a pointer to that pointer into r12, the dynamic linker can infer which function it needs to do linking for by inspecting that register. (Edit 2023-04-15: In earlier versions of this post, I said it looks at the call stack to see which stub function it was called from, but that's obviously wrong because it's a tail call. Sorry about that!) Once the dynamic linker has done its thing, it'll update the function pointer so it points to the function the app wanted to call, and then call it.

Lazy dynamic linking in touchHLE: invoking the linker

That describes how lazy dynamic linking works on iPhone OS, but touchHLE deviates from it a bit.

The problem is that, because touchHLE's dynamic linker is host code, the function that does lazy linking doesn't have a true function pointer that can be called from guest code, and the same problem exists for all the functions in touchHLE's implementations of iPhone OS system libraries. It would be possible to synthesise a fake function pointer by allocating memory visible to guest code, and putting some kind of special instruction in there — in other words, by creating a stub function. But then there'd be an extra layer of stubs in every function call: not only would there be the dynamic linking stub, but also a “fake function pointer” stub. What if there's a better way?

Well, I think I came up with one. Instead of creating new stub functions, touchHLE rewrites the existing stubs when it loads an app binary. ^^

Let's have a look at how that works. Once again, in our example, we've just branched to the _time function in the __symbol_stub4 section. The disassembly I showed last time was done using Ghidra, which shows the code in the app binary. But touchHLE has rewritten it after loading the binary, so this time I'll use touchHLE's GDB support to disassemble the code in memory that will actually be executed:

(gdb) disass 0x3fab4,0x3fabc
Dump of assembler code from 0x3fab4 to 0x3fabc:
   0x0003fab4:	svc	0x00000000
   0x0003fab8:	bx	lr
End of assembler dump.

As you can see, the new stub is completely different! svc is ARMv6's instruction for doing syscalls, and it takes an immediate operand that can contain arbitrary information, usually the kind of syscall the app wants to do. So svc 0x00000000 could mean “do a syscall of kind 0”. Then there is bx lr, which is one of the standard ways to return from a function on ARMv6.

Normally a syscall is how a user-mode process (an app) interacts with an operating system kernel, in order to do things like managing memory or perform I/O, and iPhone OS apps are no different. An app will call a function that's part of the OS's syscall library, for example the POSIX write function, and then that function will use a special instruction sequence, perhaps using svc, in order to perform the actual syscall. On an OS like Linux, which has a stable syscall ABI, an app can safely skip the library function and directly use those special instructions, but Apple's OSes have an unstable syscall ABI, so an instruction like svc should never be present in an iPhone app binary.

In touchHLE's case, there is no iPhone OS kernel involved, and touchHLE makes no attempt to be compatible with its syscall ABI. Instead, touchHLE uses its own ad-hoc “syscall ABI” for signalling when execution of guest code should be suspended, and host code should take over. Since touchHLE is taking the place of the iPhone OS kernel here, I think these are still syscalls in spirit, even if they're sometimes doing things that aren't usually handled by an OS kernel. In this case, svc 0 means “do lazy dynamic linking for this function”.

Now that I've explained what that svc 0 should do, let's see how it actually does it. When dynarmic encounters an svc instruction, it knows this is a syscall instruction, so the guest code is requesting the OS kernel to step in. Dynarmic only emulates user-mode code, so it expects the emulator using it (touchHLE in this case) to handle it instead, and it lets the emulator provide a callback for doing so. touchHLE's callback for handling svc makes note of the immediate operand on the instruction (0 in this case) and then asks dynarmic to halt execution.

Dynarmic dutifully halts the execution of guest code, saves the state of the emulated CPU (the contents of registers), and then hands control back to touchHLE. Now we're firmly in the land of host code! When Dynarmic halts execution, it informs touchHLE of why it did so, and in this case it's because touchHLE requested it to halt when handling the svc instruction. touchHLE then looks at the halt reason, sees it's because of the svc, and refers back to the immediate operand it made note of before. Now touchHLE can decide what to do based on that immediate value. In this case, of course, it invokes the linker.

Lazy dynamic linking in touchHLE: linking host code

When touchHLE's lazy linking handler is called, it knows the address of the svc 0 instruction. It can also assume that this instruction belongs to one of the stubs in the __symbol_stub4 section of the app binary. With this in mind, it can look up that stub in a part of the app binary called the “indirect symbol table” (see Alex Drummond's aforementioned post if you want to know more). That lookup returns a symbol with various information attached to it, but the only thing touchHLE's dynamic linker cares about is its name. If you remember, our example code was calling the _time stub in the __symbol_stub4 section, so the resulting string is "_time" in this case.

Now the actual linking can happen. Once it knows the name of the function, the linker can try to find an implementation of it to “link to”. In this case it will find a host code implementation. These implementations are really just ordinary Rust functions that are part of touchHLE's code, and they're exposed to the linker through simple tables of function names and (host code) function pointers. The following Rust code is simplified a bit for readability's sake, but it's representative of the actual algorithm:

fn find_function(name: &str) -> HostFunction {
    let functions = [
        ("_time", &time),
        ("_mach_absolute_time", &mach_absolute_time),
        ("_srand", &srand),
        // …
    ];
    for (function_name, function_pointer) in functions {
        if function_name == name {
            return function_pointer;
        }
    }
    panic!("Call to unimplemented function {}", name);
}

Notice that it can simply panic (abort) if the function can't be found. Since lazy linking only happens once a function is called for the first time, this panic only happens once the function gets called. This turns out to be extremely convenient for iterative development. When I was trying to get Super Monkey Ball working, I could run the app, wait for it to panic because of an unimplemented function, implement that function, see if it worked, and then repeat that process for the next function, and so on; the process has been similar for other apps too. It also meant I could avoid implementing some functions that the app could, but doesn't actually, call. I held off on implementing printf for a while, because when that game is starting up, it only calls that function if it encounters an error, and the errors were usually my fault. ^^

Anyway, assuming the dynamic linker does successfully look up a function pointer, it then it needs to “link” the function. Recall that touchHLE rewrites the stubs to begin with svc 0, which invokes the dynamic linker. Linking the function works the same way: the linker maintains a big Vec (growable array) of function pointers, and whenever it links a new function, it pushes it to the back of that Vec, gets the index of the function pointer in that Vec, and rewrites the stub again to begin with an svc whose immediate is that index plus some offset.

If I use GDB again and set a breakpoint that is hit sometime after time is first called, we can see this has happened:

(gdb) disass 0x3fab4,0x3fabc
Dump of assembler code from 0x3fab4 to 0x3fabc:
   0x0003fab4:	svc	0x0000005e
   0x0003fab8:	bx	lr
End of assembler dump.

So svc 0x5e has been assigned to the time function. If time is called again, touchHLE can skip invoking the dynamic linker, instead just taking the value 0x5e, subtracting the offset, and using it to index into the Vec of function pointers.

In any case, whether this is the first call or a subsequent call, the next step is to actually call that function pointer.

My Little Trait: Generics Are Magic

I mentioned that these host code implementations of functions are just ordinary Rust functions. To illustrate that point, here's a slightly simplified version of touchHLE's implementation of time:

type time_t = i32;
fn time(env: &mut Environment, out: MutPtr<time_t>) -> time_t {
    let time = SystemTime::now()
        .duration_since(SystemTime::UNIX_EPOCH)
        .unwrap()
        .as_secs() as time_t;
    if !out.is_null() {
        env.mem.write(out, time);
    }
    time
}

You may remember that the reconstructed original C code was trying to make the call time(NULL), with just one argument, but notice that this function takes two: env and out. Readers familiar with Rust will probably also have noticed the use of a strange MutPtr type and env.mem.write call, which is not what normal Rust code working with pointers looks like. Those quirks aside, this is a normal Rust function with no special awareness of the ARMv6 world that guest code lives in, and on my machine it is compiled to x86-64 code. How, then, can ARMv6 code call it?

As established earlier, touchHLE uses svc instructions to transfer control from guest code to host code, and to indicate which function to call. If all the functions implemented in host code took no arguments and returned nothing, then I could simply write function_pointer(); and be done with it. However, as in this example, many of these functions will have arguments and return values, so I'll need to do something about those.

⚠️ At this point, if you are afraid of generics or of learning what a Rust trait is, you should immediately close this tab and step away from your computer. ⚠️

More concretely, what I want is a magical function with a signature like this:

fn call_host_function(func: HostFunction, env: &mut Environment);

HostFunction would be a type that's like a function pointer, but can point to any of our host code implementations, regardless of what signature it might have (so it could point to any of the functions in our example, for instance). &mut Environment would be a mutable reference to a type, Environment, that contains the entire state of the emulated world: the contents of the registers and memory that the guest code works with, and perhaps other things. This function would call that function pointer, pass in the right arguments, and do the right thing with its return value. If I had a function like this, then I could simply write call_host_function(function_pointer, environment);.

Well, thankfully I chose Rust for this project, so I was actually able to implement this magical function, and it's called CallFromGuest::call_from_guest. It's written in very generics-, macro- and trait-heavy Rust code, but hopefully I can explain it fairly simply. ^^; By the way, HostFunction and Environment are also real types.

CallFromGuest is a trait. For the unfamiliar, a trait is similar to an Java-style interface or C++-style abstract class, insofar as it's a set of methods with a particular meaning that you can call on any type which implements it. But traits are more powerful: if you define a new trait, you can implement it not only on your own types, but also on types you didn't define yourself. In this case, CallFromGuest is a trait implemented on certain function pointer types, including the function pointer type that our time function has. Since function pointers are not a type defined by touchHLE, but rather a fundamental primitive type defined by the Rust language, this wouldn't have been possible without this extra power.

More specifically, CallFromGuest is implemented on all function pointer types where the first argument is &mut Environment, there are no more than nine subsequent paremeters, all those subsequent parameters' types have GuestArg implemented on them, and GuestRet is implemented on the return type. (Nine is an arbitrary limit I can easily change, but it'd be nice to not need it.) GuestArg is a trait for types I want to use for arguments coming from guest code, and GuestRet is a trait for types I want to use for return values going to guest code.

Collectively, these three traits' job is to translate between Apple's ARMv6 ABI for function calls (which is AAPCS32 with a few quirks) and normal function calls in Rust host code. Thanks to the magic of monomorphisation, it should be as close to “zero cost” as a maintainable solution to this can be. They are also what enables that magical ability to have a single function pointer type that works for all these function signatures, with only a little nudging of the compiler required.

Actually making Rust speak Apple's ARMv6 ABI

Type system shenanigans aside, what these traits actually do is fairly simple. To go back to our example, we know the time implementation is a function with this signature:

fn time(env: &mut Environment, out: MutPtr<time_t>) -> time_t;

Once you expand the macros, remove the generics and make some other changes to improve readability, the CallFromGuest implementation for this function signature is no more complicated than this:

impl CallFromGuest for fn(&mut Environment, MutPtr<time_t>) -> time_t {
    fn call_from_guest(&self, env: &mut Environment) {
        let arg: MutPtr<time_t> = GuestArg::from_regs(env.cpu.regs()[0..]);
        let retval: time_t = self(env, arg);
        GuestRet::to_regs(retval, env.cpu.regs_mut());
    }
}

This does just three things: it uses GuestArg to get the first argument from registers, starting at register zero; it calls the function pointer (the time implementation here) passing in that argument; and then it uses GuestRet to put the return value into registers. If there'd been more arguments, it would have used GuestArg once for each, and used a different register offset for each.

The GuestArg implementation for MutPtr<time_t> is also quite simple. The MutPtr type represents a mutable pointer to somewhere in memory accessible to guest code. Since the guest code is ARMv6 code, it's 32-bit, and therefore pointers are 32-bit. Once again, I'm simplifying for the sake of readability, but it's still representative of what it actually does:

struct MutPtr(u32);
impl GuestArg for MutPtr {
    const REG_COUNT: usize = 1;
    fn from_regs(regs: &[u32]) -> Self {
        MutPtr(regs[0])
    }
    fn to_regs(self, regs: &mut [u32]) {
        regs[0] = self.0;
    }
}

The 32-bit pointer type is just a wrapper around a 32-bit unsigned integer. Each register is treated as 32-bit unsigned integer, so to read a pointer from registers, it just takes the first register value and wraps it up, and to write a pointer to registers, it just unwraps it and puts it in the first register. The REG_COUNT constant specifies how many registers are needed, and it only comes into play if there's several arguments, because it's used to compute the register offsets for subsequent arguments (argument 0 goes in register 0, argument 1 will go in register 1, etc).

As for our return value, which is a time_t, that's just an alias of i32. The GuestRet implementation for i32 is not much different from the GuestArg implementation I just showed, so I'll leave it up to your imagination.

Finishing the example

Okay, at long last I'm at the point where I don't need to explain any more concepts, so we can zip through the rest of the example function.

So, CallFromGuest and GuestArg do their thing, and the implementation of time is called. It sees thats its first argument is NULL, so it doesn't write anything to memory, but it does return some integer representing the current time. GuestRet does its thing, and so now r0 contains that return value. When I used GDB to set a breakpoint just now, r0 was 1681415614, for example.

Now that the host code has executed the time function as requested, it's time for touchHLE to tell dynarmic to resume execution of the guest code. After the svc 0x5e instruction we just executed, the next instruction is bx lr, which as you may recall, is one way to return from a function on ARMv6. Once dynarmic executes that… we're finally done with that _time stub! 🎉

We're back to the RandInit function. As a reminder, there's now two more function calls left:

0001033a 2f f0 de ea     blx        __symbol_stub4::_mach_absolute_time
0001033e 2f f0 72 eb     blx        __symbol_stub4::_srand

Everything is very similar for calling _mach_absolute_time: this is another dynamic linker stub, so the dynamic linker has to look up, link and call this function, and once again CallFromGuest is involved. The signature of this function is a bit different, though. Here's touchHLE's implementation of it, this time with no simplifications:

fn mach_absolute_time(env: &mut Environment) -> u64 {
    let now = Instant::now();
    now.duration_since(env.startup_time)
        .as_nanos()
        .try_into()
        .unwrap()
}

Notice that there's no arguments other than env, which means that we're reading no arguments from registers, and hence there's no use of GuestArg. Alas, this means that the time we just computed and put in r0 is going to go to waste. As I said earlier, I think RandInit calling two different time functions is a bug.

Once this function has done its thing, there's returning the value to registers to worry about. The return value in this case is also something new: u64 is a 64-bit unsigned integer, so it needs to be split across two 32-bit registers. Here's the GuestRet implementation for it, completely unsimplified:

impl GuestRet for u64 {
    fn from_regs(regs: &[u32]) -> Self {
        let mut bytes = [0u8; 8];
        bytes[0..4].copy_from_slice(&regs[0].to_le_bytes());
        bytes[4..8].copy_from_slice(&regs[1].to_le_bytes());
        u64::from_le_bytes(bytes)
    }
    fn to_regs(self, regs: &mut [u32]) {
        let bytes = self.to_le_bytes();
        regs[0] = u32::from_le_bytes(bytes[0..4].try_into().unwrap());
        regs[1] = u32::from_le_bytes(bytes[4..8].try_into().unwrap());
    }
}

Anyway, now r0 and r1 contain two halves of some extremely precise time value. Once again, touchHLE tells dynarmic to resume execution, and it hits the next instruction, which is once again bx lr, so we're back to RandInit yet again.

Now there's just one more call left, namely to _srand. This, too, is a dynamic linker stub, and follows the usual procedure. Here's srand's implementation:

fn srand(env: &mut Environment, seed: u32) {
    env.libc_state.stdlib.rand = seed;
}

Notice this takes a u32, yet the previous function returned a u64. This kind of type mismatch isn't a bug, it's pretty normal in fact. The u32 argument will be read from r0, the same register that half of the time value from earlier was put in. Since iPhone OS is a little-endian system, that first register contains the least significant bits of that time value, and therefore we can be pretty sure they're non-zero and change frequently, so they're a perfectly good seed value.

Anyway, this function returns nothing. In C terms, it “returns void”, but in Rust terms, it “returns ()”, though you don't have to write that out. This type does actually get a GuestRet implementation:

impl GuestRet for () {
    fn to_regs(self, _regs: & mut [u32]) {}
    fn from_regs(_regs: &[u32]) -> Self {}
}

Yep, it does absolutely nothing!

With that done, once again touchHLE tells dynarmic to resume execution, but this time, it never has to stop again for the remainder of RandInit. Super Monkey Ball has finally seeded the random number generator.

Closing words

Thank you for for having the patience to keep reading to this point. This post took me more than three days to write! I really hope it'll have succeeded in shedding some light on how touchHLE works. If this has piqued your interest, you might want to have a look at touchHLE's source code, which has lots of comments explaining the what, why and how of various modules and functions. As I said at the beginning, this post is intended to be the beginning of a series, but considering how long this one took, please don't hold your breath for the next installment.

hikari's blog