9.0 KiB
+++ title = "Linux syscall hooking" date = 2024-09-26 +++
sometimes, you might want to hook Linux syscalls from a binary. it's probably not for a very elegant reason, but I'm not here to judge. in this article we discuss a few options for doing this. these will be focussed on x86 platforms running Linux, but many of these techniques should also be possible on other ISAs, with appropriate changes.
binary rewriting
one way to approach the problem is simply to rewrite binaries at runtime. one can set executable text pages to be writeable, swap out some instructions, and then set them back to be read-only and executable. how can this look in practice?
one has to consider the limitations of binary rewriting in this manner. we can only replace syscall instructions with instructions of the same length. thus, we have limited flexibility in how to implement a syscall hooking function. absolute calls to some custom handler address would require further code modification than is possible, in order to prepare a register with the jump address. we can do relative jumps, but we need to find some space within the maximum range to put our handler function.
zpoline
zpoline{{ref(n=1)}} is one approach to get around these limitations, on x86 platforms. zpoline replaces all syscall
/sysenter
instructions with call *%rax
. this takes advantage of two things:
rax
will always contain the target syscall number, thus holding a bounded low number. this allows us to rewrite the virtual memory in the range[0x00, (SYSCALL_MAX + n)]
to have some custom handler code of lengthn
.call *%rax
is the same size assyscall
(2 bytes){{ref(n=2)}}. this means we canmmap(0)
at the start of runtime, and create fall-through nops, followed by a jump to a syscall hook function. the performance of this should be pretty good, not much different to adding an extra function call to each syscall, which is all we want to do anyway.
okay, but is it ok to mmap()
the zero virtual address? well, not really. for starters, you probably can't do it as a non-privileged user. Linux has a setting mmap_min_addr
, which disallows any mmap()
s below a given address{{ref(n=3)}}. secondly, this can cause certain null-pointer bugs to exhibit weird behaviour, instead of throwing segmentation faults.
thus, this option does work, and may be suitable for certain usecases, but it does have large flaws (like all of the options).
custom trampolines
an alternative that doesn't involve mmap()
ing the zero page is used by AMD's Onload{{ref(n=4)}} - try to replace
b8 NN NN NN NN mov eax,0xN ; N = syscall number
0f 05 syscall
with
b8 TT TT TT TT mov eax,0xTTTTTTTT ; 0xTTTTTTTT = trampoline func address
ff d0 call rax
where 0xTTTTTTTT
is the address of some trampoline function we have crafted. we can get an address to put this function at by using mmap(..., MMAP_32BIT)
.
another option here is to try to get some virtual memory within a certain distance from the syscall code, and use relative call instructions.
note that this method is not as reliable. it relies on being able to find syscall-performing code which can be modified in this way, which is not guaranteed. thus, it could fail in some situations. it does circumvent the requirement for an ugly mmap(0)
, however.
syscall user dispatch (SUD)
SUD{{ref(n=5)}} is probably the most "correct" way to go about doing this. this kernel feature was initially created for Wine. SUD allows the userspace program to request the kernel traps all syscalls, raising a SIGSYS
signal every time. a certain range of program memory can be excluded from this trapping, and a flag can be used to toggle trapping off or on at runtime (although the excluded range will never trap, regardless of this flag).
this is is enabled via prctl()
:
int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, unsigned long off, unsigned long size, int8_t *switch);
int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0L, 0L, 0L);
where off
is the start of the non-trapping memory region, and size
is it's length. switch
is a pointer to the aforementioned toggleable flag.
here is a very basic example of using SUD:
#define _GNU_SOURCE
#include <signal.h>
#include <sys/prctl.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stddef.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <stdio.h>
bool selector = SYSCALL_DISPATCH_FILTER_ALLOW;
static volatile void handle_sigsys(int sig, siginfo_t *info, void *ucontext) {
char buf[128];
int len;
int ret;
len = snprintf(buf, 1024, "Caught syscall with number 0x%x\n", info->si_syscall);
selector = SYSCALL_DISPATCH_FILTER_ALLOW;
write(1, buf, len);
}
int main(void) {
int err;
/* install SIGSYS signal handler */
struct sigaction act;
memset(&act, 0, sizeof(act));
sigset_t mask;
sigemptyset(&mask);
act.sa_sigaction = handle_sigsys;
act.sa_flags = SA_SIGINFO;
act.sa_mask = mask;
if (err = sigaction(SIGSYS, &act, NULL)) {
printf("sigaction failed: %d\n", err);
exit(-1);
}
/* enable SUD */
if (err = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &selector)) {
printf("prctl failed: %d\n", err);
exit(-1);
}
selector = SYSCALL_DISPATCH_FILTER_BLOCK;
syscall(SYS_write);
return 0;
}
running this will produce the following:
Caught syscall with number 0x38
where syscall number 0x38
is SYS_write
(on my system).
{% note(header="note") %}
we've had to disable syscall hooking within handle_sigsys()
, before we perform the write
syscall. this could also be achieved using the offset exclusion mechanism, with a caveat: SUD also needs to be disabled at signal handler return. this is because, depending on some factors, the signal handler will return to a signal trampoline, which exists somewhere else in memory (VDSO or libc). this trampoline will then perform the sigreturn
syscall, which - if you had only excluded the handler function memory itself - would cause us to raise a SIGSYS
inside a SIGSYS
handler (oops).
{% end %}
SUD should have pretty good performance, although it is still switching from user space to kernel space (on syscall), and back again (on SIGSYS handling) - unlike the binary rewriting methods.
rewriting the kernel syscall table
Linux kernel modules could also try to patch the kernel's syscall table.
it used to be the case that if Linux is configured with CONFIG_KALLSYMS_ALL=y
:
$ zgrep CONFIG_KALLSYMS_ALL /proc/config.gz
CONFIG_KALLSYMS_ALL=y
then we can find the address of the syscall table in kallsyms
:
$ sudo grep sys_call_table /proc/kallsyms
ffffffff854017e0 D sys_call_table
and thus programmatically:
void *sys_call_table = kallsyms_lookup_name("sys_call_table");
once the syscall table address has been obtained, the handlers for specific syscall(s) could then be overwritten, in order to hook them.
however, this kind of symbol lookup was generally considered very naughty, and kallsyms_lookup_name()
was unexported in Linux 5.7{{ref(n=6)}}. there should be some feasible alternatives to do the same thing, but they would be slower, uglier, and more painful.
modifying IA32_LSTAR MSR
another method which has been used successfully{{ref(n=7)}} involves rewriting the x86 MSR IA32_LSTAR
. this register holds the destination address which the CPU will load into RIP
when taking a syscall{{ref(n=2)}}. one can write a kernel module which modifies this register to contain a custom syscall handler address. this syscall handler function can hook syscalls before dispatching as normal (or not).
now that kallsysms_lookup_name()
is no longer exported, this method also becomes problematic. as the example kernel module in {{ref(n=7,nosup=true)}} demonstrates, we still need to be able to grab some kernel symbol addresses in order to correctly implement the dispatcher.
however, this does propose another idea for implementing the previous method: using IA32_LSTAR
to ascertain the address of the Linux syscall handler, instead of kallsyms. indeed, Onload has also exploited this method as a workaround for the unexporting of kallsyms_lookup_name()
{{ref(n=8)}}.
citations
{{ defref(n=1, url="https://www.usenix.org/conference/atc23/presentation/yasukata") }}
{{ defref(n=2, url="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html") }}
{{ defref(n=3, url="https://elixir.bootlin.com/linux/v6.11/source/arch/s390/mm/mmap.c#L137") }}
{{ defref(n=4, url="d5984bf52f/src/lib/transport/unix/fdtable.c (L311)
") }}
{{ defref(n=5, url="https://docs.kernel.org/admin-guide/syscall-user-dispatch.html") }}
{{ defref(n=6, url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bd476e6c67190b5eb7b6e105c8db8ff61103281") }}
{{ defref(n=7, url="https://vvdveen.com/data/lstar.txt") }}
{{ defref(n=8, url="083e5959ef/src/driver/linux_resource/syscall_x86.c (L347)
") }}