website/content/syscalls.md

164 lines
9.0 KiB
Markdown
Raw Permalink Normal View History

2024-09-27 00:03:54 +01:00
+++
title = "Linux syscall hooking"
date = 2024-09-26
+++
sometimes, you might want to hook Linux syscalls from a binary. it's probably not for a very elegant reason, but I'm not here to judge. in this article we discuss a few options for doing this. these will be focussed on x86 platforms running Linux, but many of these techniques should also be possible on other ISAs, with appropriate changes.
## binary rewriting
one way to approach the problem is simply to rewrite binaries at runtime. one can set executable text pages to be writeable, swap out some instructions, and then set them back to be read-only and executable. how can this look in practice?
one has to consider the limitations of binary rewriting in this manner. we can only replace syscall instructions with instructions of the same length. thus, we have limited flexibility in how to implement a syscall hooking function. absolute calls to some custom handler address would require further code modification than is possible, in order to prepare a register with the jump address. we can do relative jumps, but we need to find some space within the maximum range to put our handler function.
### zpoline
zpoline{{ref(n=1)}} is one approach to get around these limitations, on x86 platforms. zpoline replaces all `syscall`/`sysenter` instructions with `call *%rax`. this takes advantage of two things:
1. `rax` will always contain the target syscall number, thus holding a bounded low number. this allows us to rewrite the virtual memory in the range `[0x00, (SYSCALL_MAX + n)]` to have some custom handler code of length `n`.
2. `call *%rax` is the same size as `syscall` (2 bytes){{ref(n=2)}}.
this means we can `mmap(0)` at the start of runtime, and create fall-through nops, followed by a jump to a syscall hook function. the performance of this should be pretty good, not much different to adding an extra function call to each syscall, which is all we want to do anyway.
okay, but is it ok to `mmap()` the zero virtual address? well, not really. for starters, you probably can't do it as a non-privileged user. Linux has a setting `mmap_min_addr`, which disallows any `mmap()`s below a given address{{ref(n=3)}}. secondly, this can cause certain null-pointer bugs to exhibit weird behaviour, instead of throwing segmentation faults.
thus, this option does work, and may be suitable for certain usecases, but it does have large flaws (like all of the options).
### custom trampolines
an alternative that doesn't involve `mmap()`ing the zero page is used by AMD's Onload{{ref(n=4)}} - try to replace
```
b8 NN NN NN NN mov eax,0xN ; N = syscall number
0f 05 syscall
```
with
```
b8 TT TT TT TT mov eax,0xTTTTTTTT ; 0xTTTTTTTT = trampoline func address
ff d0 call rax
```
where `0xTTTTTTTT` is the address of some trampoline function we have crafted. we can get an address to put this function at by using `mmap(..., MMAP_32BIT)`.
another option here is to try to get some virtual memory within a certain distance from the syscall code, and use relative call instructions.
note that this method is not as reliable. it relies on being able to find syscall-performing code which can be modified in this way, which is not guaranteed. thus, it could fail in some situations. it does circumvent the requirement for an ugly `mmap(0)`, however.
## syscall user dispatch (SUD)
SUD{{ref(n=5)}} is probably the most "correct" way to go about doing this. this kernel feature was initially created for Wine. SUD allows the userspace program to request the kernel traps all syscalls, raising a `SIGSYS` signal every time. a certain range of program memory can be excluded from this trapping, and a flag can be used to toggle trapping off or on at runtime (although the excluded range will never trap, regardless of this flag).
this is is enabled via `prctl()`:
```c
int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, unsigned long off, unsigned long size, int8_t *switch);
int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0L, 0L, 0L);
```
where `off` is the start of the non-trapping memory region, and `size` is it's length. `switch` is a pointer to the aforementioned toggleable flag.
here is a very basic example of using SUD:
```c
#define _GNU_SOURCE
#include <signal.h>
#include <sys/prctl.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stddef.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <stdio.h>
bool selector = SYSCALL_DISPATCH_FILTER_ALLOW;
static volatile void handle_sigsys(int sig, siginfo_t *info, void *ucontext) {
char buf[128];
int len;
int ret;
len = snprintf(buf, 1024, "Caught syscall with number 0x%x\n", info->si_syscall);
selector = SYSCALL_DISPATCH_FILTER_ALLOW;
write(1, buf, len);
}
int main(void) {
int err;
/* install SIGSYS signal handler */
struct sigaction act;
memset(&act, 0, sizeof(act));
sigset_t mask;
sigemptyset(&mask);
act.sa_sigaction = handle_sigsys;
act.sa_flags = SA_SIGINFO;
act.sa_mask = mask;
if (err = sigaction(SIGSYS, &act, NULL)) {
printf("sigaction failed: %d\n", err);
exit(-1);
}
/* enable SUD */
if (err = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &selector)) {
printf("prctl failed: %d\n", err);
exit(-1);
}
selector = SYSCALL_DISPATCH_FILTER_BLOCK;
syscall(SYS_write);
return 0;
}
```
running this will produce the following:
```
Caught syscall with number 0x38
```
where syscall number `0x38` is `SYS_write` (on my system).
{% note(header="note") %}
we've had to disable syscall hooking within `handle_sigsys()`, before we perform the `write` syscall. this could also be achieved using the offset exclusion mechanism, with a caveat: SUD also needs to be disabled at signal handler return. this is because, depending on some factors, the signal handler will return to a signal trampoline, which exists somewhere else in memory (VDSO or libc). this trampoline will then perform the `sigreturn` syscall, which - if you had only excluded the handler function memory itself - would cause us to raise a `SIGSYS` inside a `SIGSYS` handler (oops).
{% end %}
SUD should have pretty good performance, although it is still switching from user space to kernel space (on syscall), and back again (on SIGSYS handling) - unlike the binary rewriting methods.
## rewriting the kernel syscall table
Linux kernel modules could also try to patch the kernel's syscall table.
it used to be the case that if Linux is configured with `CONFIG_KALLSYMS_ALL=y`:
```shell
$ zgrep CONFIG_KALLSYMS_ALL /proc/config.gz
CONFIG_KALLSYMS_ALL=y
```
then we can find the address of the syscall table in `kallsyms`:
```shell
$ sudo grep sys_call_table /proc/kallsyms
ffffffff854017e0 D sys_call_table
```
and thus programmatically:
```c
void *sys_call_table = kallsyms_lookup_name("sys_call_table");
```
once the syscall table address has been obtained, the handlers for specific syscall(s) could then be overwritten, in order to hook them.
however, this kind of symbol lookup was generally considered very naughty, and `kallsyms_lookup_name()` was unexported in Linux 5.7{{ref(n=6)}}. there should be some feasible alternatives to do the same thing, but they would be slower, uglier, and more painful.
## modifying IA32_LSTAR MSR
another method which has been used successfully{{ref(n=7)}} involves rewriting the x86 MSR `IA32_LSTAR`. this register holds the destination address which the CPU will load into `RIP` when taking a syscall{{ref(n=2)}}. one can write a kernel module which modifies this register to contain a custom syscall handler address. this syscall handler function can hook syscalls before dispatching as normal (or not).
now that `kallsysms_lookup_name()` is no longer exported, this method also becomes problematic. as the example kernel module in {{ref(n=7,nosup=true)}} demonstrates, we still need to be able to grab some kernel symbol addresses in order to correctly implement the dispatcher.
however, this does propose another idea for implementing the previous method: using `IA32_LSTAR` to ascertain the address of the Linux syscall handler, instead of kallsyms. indeed, Onload has also exploited this method as a workaround for the unexporting of `kallsyms_lookup_name()`{{ref(n=8)}}.
## citations
{{ defref(n=1, url="https://www.usenix.org/conference/atc23/presentation/yasukata") }}
{{ defref(n=2, url="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html") }}
{{ defref(n=3, url="https://elixir.bootlin.com/linux/v6.11/source/arch/s390/mm/mmap.c#L137") }}
{{ defref(n=4, url="https://github.com/Xilinx-CNS/onload/blob/d5984bf52fcfba0cd75df8a1f8828f3363f6e164/src/lib/transport/unix/fdtable.c#L311") }}
{{ defref(n=5, url="https://docs.kernel.org/admin-guide/syscall-user-dispatch.html") }}
{{ defref(n=6, url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bd476e6c67190b5eb7b6e105c8db8ff61103281") }}
{{ defref(n=7, url="https://vvdveen.com/data/lstar.txt") }}
{{ defref(n=8, url="https://github.com/Xilinx-CNS/onload/blob/083e5959ef76632ae3cd3d4356fb079ce0c570d1/src/driver/linux_resource/syscall_x86.c#L347") }}