website/content/syscalls.md

9.0 KiB

+++ title = "Linux syscall hooking" date = 2024-09-26 +++

sometimes, you might want to hook Linux syscalls from a binary. it's probably not for a very elegant reason, but I'm not here to judge. in this article we discuss a few options for doing this. these will be focussed on x86 platforms running Linux, but many of these techniques should also be possible on other ISAs, with appropriate changes.

binary rewriting

one way to approach the problem is simply to rewrite binaries at runtime. one can set executable text pages to be writeable, swap out some instructions, and then set them back to be read-only and executable. how can this look in practice?

one has to consider the limitations of binary rewriting in this manner. we can only replace syscall instructions with instructions of the same length. thus, we have limited flexibility in how to implement a syscall hooking function. absolute calls to some custom handler address would require further code modification than is possible, in order to prepare a register with the jump address. we can do relative jumps, but we need to find some space within the maximum range to put our handler function.

zpoline

zpoline{{ref(n=1)}} is one approach to get around these limitations, on x86 platforms. zpoline replaces all syscall/sysenter instructions with call *%rax. this takes advantage of two things:

  1. rax will always contain the target syscall number, thus holding a bounded low number. this allows us to rewrite the virtual memory in the range [0x00, (SYSCALL_MAX + n)] to have some custom handler code of length n.
  2. call *%rax is the same size as syscall (2 bytes){{ref(n=2)}}. this means we can mmap(0) at the start of runtime, and create fall-through nops, followed by a jump to a syscall hook function. the performance of this should be pretty good, not much different to adding an extra function call to each syscall, which is all we want to do anyway.

okay, but is it ok to mmap() the zero virtual address? well, not really. for starters, you probably can't do it as a non-privileged user. Linux has a setting mmap_min_addr, which disallows any mmap()s below a given address{{ref(n=3)}}. secondly, this can cause certain null-pointer bugs to exhibit weird behaviour, instead of throwing segmentation faults.

thus, this option does work, and may be suitable for certain usecases, but it does have large flaws (like all of the options).

custom trampolines

an alternative that doesn't involve mmap()ing the zero page is used by AMD's Onload{{ref(n=4)}} - try to replace

b8 NN NN NN NN          mov    eax,0xN         ; N = syscall number
0f 05                   syscall

with

b8 TT TT TT TT          mov    eax,0xTTTTTTTT  ; 0xTTTTTTTT = trampoline func address
ff d0                   call   rax

where 0xTTTTTTTT is the address of some trampoline function we have crafted. we can get an address to put this function at by using mmap(..., MMAP_32BIT).

another option here is to try to get some virtual memory within a certain distance from the syscall code, and use relative call instructions.

note that this method is not as reliable. it relies on being able to find syscall-performing code which can be modified in this way, which is not guaranteed. thus, it could fail in some situations. it does circumvent the requirement for an ugly mmap(0), however.

syscall user dispatch (SUD)

SUD{{ref(n=5)}} is probably the most "correct" way to go about doing this. this kernel feature was initially created for Wine. SUD allows the userspace program to request the kernel traps all syscalls, raising a SIGSYS signal every time. a certain range of program memory can be excluded from this trapping, and a flag can be used to toggle trapping off or on at runtime (although the excluded range will never trap, regardless of this flag).

this is is enabled via prctl():

int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, unsigned long off, unsigned long size, int8_t *switch);
int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0L, 0L, 0L);

where off is the start of the non-trapping memory region, and size is it's length. switch is a pointer to the aforementioned toggleable flag.

here is a very basic example of using SUD:

#define _GNU_SOURCE

#include <signal.h>
#include <sys/prctl.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stddef.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <stdio.h>

bool selector = SYSCALL_DISPATCH_FILTER_ALLOW;

static volatile void handle_sigsys(int sig, siginfo_t *info, void *ucontext) {
	char buf[128];
	int len;
    int ret;

	len = snprintf(buf, 1024, "Caught syscall with number 0x%x\n", info->si_syscall);
    selector = SYSCALL_DISPATCH_FILTER_ALLOW;
    write(1, buf, len);
}

int main(void) {
    int err;

    /* install SIGSYS signal handler */
    struct sigaction act;
    memset(&act, 0, sizeof(act));

	sigset_t mask;
	sigemptyset(&mask);

	act.sa_sigaction = handle_sigsys;
	act.sa_flags = SA_SIGINFO;
	act.sa_mask = mask;

    if (err = sigaction(SIGSYS, &act, NULL)) {
		printf("sigaction failed: %d\n", err);
		exit(-1);
	}

    /* enable SUD */
    if (err = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &selector)) {
		printf("prctl failed: %d\n", err);
		exit(-1);
	}

    selector = SYSCALL_DISPATCH_FILTER_BLOCK;
    syscall(SYS_write);

    return 0;
}

running this will produce the following:

Caught syscall with number 0x38

where syscall number 0x38 is SYS_write (on my system).

{% note(header="note") %} we've had to disable syscall hooking within handle_sigsys(), before we perform the write syscall. this could also be achieved using the offset exclusion mechanism, with a caveat: SUD also needs to be disabled at signal handler return. this is because, depending on some factors, the signal handler will return to a signal trampoline, which exists somewhere else in memory (VDSO or libc). this trampoline will then perform the sigreturn syscall, which - if you had only excluded the handler function memory itself - would cause us to raise a SIGSYS inside a SIGSYS handler (oops). {% end %}

SUD should have pretty good performance, although it is still switching from user space to kernel space (on syscall), and back again (on SIGSYS handling) - unlike the binary rewriting methods.

rewriting the kernel syscall table

Linux kernel modules could also try to patch the kernel's syscall table.

it used to be the case that if Linux is configured with CONFIG_KALLSYMS_ALL=y:

$ zgrep CONFIG_KALLSYMS_ALL /proc/config.gz
CONFIG_KALLSYMS_ALL=y

then we can find the address of the syscall table in kallsyms:

$ sudo grep sys_call_table /proc/kallsyms
ffffffff854017e0 D sys_call_table

and thus programmatically:

void *sys_call_table = kallsyms_lookup_name("sys_call_table");

once the syscall table address has been obtained, the handlers for specific syscall(s) could then be overwritten, in order to hook them. however, this kind of symbol lookup was generally considered very naughty, and kallsyms_lookup_name() was unexported in Linux 5.7{{ref(n=6)}}. there should be some feasible alternatives to do the same thing, but they would be slower, uglier, and more painful.

modifying IA32_LSTAR MSR

another method which has been used successfully{{ref(n=7)}} involves rewriting the x86 MSR IA32_LSTAR. this register holds the destination address which the CPU will load into RIP when taking a syscall{{ref(n=2)}}. one can write a kernel module which modifies this register to contain a custom syscall handler address. this syscall handler function can hook syscalls before dispatching as normal (or not).

now that kallsysms_lookup_name() is no longer exported, this method also becomes problematic. as the example kernel module in {{ref(n=7,nosup=true)}} demonstrates, we still need to be able to grab some kernel symbol addresses in order to correctly implement the dispatcher.

however, this does propose another idea for implementing the previous method: using IA32_LSTAR to ascertain the address of the Linux syscall handler, instead of kallsyms. indeed, Onload has also exploited this method as a workaround for the unexporting of kallsyms_lookup_name(){{ref(n=8)}}.

citations

{{ defref(n=1, url="https://www.usenix.org/conference/atc23/presentation/yasukata") }}

{{ defref(n=2, url="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html") }}

{{ defref(n=3, url="https://elixir.bootlin.com/linux/v6.11/source/arch/s390/mm/mmap.c#L137") }}

{{ defref(n=4, url="d5984bf52f/src/lib/transport/unix/fdtable.c (L311)") }}

{{ defref(n=5, url="https://docs.kernel.org/admin-guide/syscall-user-dispatch.html") }}

{{ defref(n=6, url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bd476e6c67190b5eb7b6e105c8db8ff61103281") }}

{{ defref(n=7, url="https://vvdveen.com/data/lstar.txt") }}

{{ defref(n=8, url="083e5959ef/src/driver/linux_resource/syscall_x86.c (L347)") }}