From ff813fa8525236f41f8ff2c81bae0037d098421c Mon Sep 17 00:00:00 2001 From: Jack Bond-Preston Date: Fri, 27 Sep 2024 00:03:54 +0100 Subject: [PATCH] add syscall article --- content/syscalls.md | 164 +++++++++++++++++++++++++++++++ content/学中文学了一年了.md | 1 - sass/style/article.scss | 24 +++++ static/images/info.svg | 2 + templates/shortcodes/defref.html | 1 + templates/shortcodes/note.html | 12 +++ templates/shortcodes/ref.html | 1 + 7 files changed, 204 insertions(+), 1 deletion(-) create mode 100644 content/syscalls.md create mode 100644 static/images/info.svg create mode 100644 templates/shortcodes/defref.html create mode 100644 templates/shortcodes/note.html create mode 100644 templates/shortcodes/ref.html diff --git a/content/syscalls.md b/content/syscalls.md new file mode 100644 index 0000000..1f7b833 --- /dev/null +++ b/content/syscalls.md @@ -0,0 +1,164 @@ ++++ +title = "Linux syscall hooking" +date = 2024-09-26 ++++ + +sometimes, you might want to hook Linux syscalls from a binary. it's probably not for a very elegant reason, but I'm not here to judge. in this article we discuss a few options for doing this. these will be focussed on x86 platforms running Linux, but many of these techniques should also be possible on other ISAs, with appropriate changes. + +## binary rewriting +one way to approach the problem is simply to rewrite binaries at runtime. one can set executable text pages to be writeable, swap out some instructions, and then set them back to be read-only and executable. how can this look in practice? + +one has to consider the limitations of binary rewriting in this manner. we can only replace syscall instructions with instructions of the same length. thus, we have limited flexibility in how to implement a syscall hooking function. absolute calls to some custom handler address would require further code modification than is possible, in order to prepare a register with the jump address. we can do relative jumps, but we need to find some space within the maximum range to put our handler function. + +### zpoline +zpoline{{ref(n=1)}} is one approach to get around these limitations, on x86 platforms. zpoline replaces all `syscall`/`sysenter` instructions with `call *%rax`. this takes advantage of two things: +1. `rax` will always contain the target syscall number, thus holding a bounded low number. this allows us to rewrite the virtual memory in the range `[0x00, (SYSCALL_MAX + n)]` to have some custom handler code of length `n`. +2. `call *%rax` is the same size as `syscall` (2 bytes){{ref(n=2)}}. +this means we can `mmap(0)` at the start of runtime, and create fall-through nops, followed by a jump to a syscall hook function. the performance of this should be pretty good, not much different to adding an extra function call to each syscall, which is all we want to do anyway. + +okay, but is it ok to `mmap()` the zero virtual address? well, not really. for starters, you probably can't do it as a non-privileged user. Linux has a setting `mmap_min_addr`, which disallows any `mmap()`s below a given address{{ref(n=3)}}. secondly, this can cause certain null-pointer bugs to exhibit weird behaviour, instead of throwing segmentation faults. + +thus, this option does work, and may be suitable for certain usecases, but it does have large flaws (like all of the options). + +### custom trampolines +an alternative that doesn't involve `mmap()`ing the zero page is used by AMD's Onload{{ref(n=4)}} - try to replace +``` +b8 NN NN NN NN mov eax,0xN ; N = syscall number +0f 05 syscall +``` +with +``` +b8 TT TT TT TT mov eax,0xTTTTTTTT ; 0xTTTTTTTT = trampoline func address +ff d0 call rax +``` +where `0xTTTTTTTT` is the address of some trampoline function we have crafted. we can get an address to put this function at by using `mmap(..., MMAP_32BIT)`. + +another option here is to try to get some virtual memory within a certain distance from the syscall code, and use relative call instructions. + +note that this method is not as reliable. it relies on being able to find syscall-performing code which can be modified in this way, which is not guaranteed. thus, it could fail in some situations. it does circumvent the requirement for an ugly `mmap(0)`, however. + +## syscall user dispatch (SUD) +SUD{{ref(n=5)}} is probably the most "correct" way to go about doing this. this kernel feature was initially created for Wine. SUD allows the userspace program to request the kernel traps all syscalls, raising a `SIGSYS` signal every time. a certain range of program memory can be excluded from this trapping, and a flag can be used to toggle trapping off or on at runtime (although the excluded range will never trap, regardless of this flag). + +this is is enabled via `prctl()`: +```c +int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, unsigned long off, unsigned long size, int8_t *switch); +int prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_OFF, 0L, 0L, 0L); +``` +where `off` is the start of the non-trapping memory region, and `size` is it's length. `switch` is a pointer to the aforementioned toggleable flag. + +here is a very basic example of using SUD: +```c +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +bool selector = SYSCALL_DISPATCH_FILTER_ALLOW; + +static volatile void handle_sigsys(int sig, siginfo_t *info, void *ucontext) { + char buf[128]; + int len; + int ret; + + len = snprintf(buf, 1024, "Caught syscall with number 0x%x\n", info->si_syscall); + selector = SYSCALL_DISPATCH_FILTER_ALLOW; + write(1, buf, len); +} + +int main(void) { + int err; + + /* install SIGSYS signal handler */ + struct sigaction act; + memset(&act, 0, sizeof(act)); + + sigset_t mask; + sigemptyset(&mask); + + act.sa_sigaction = handle_sigsys; + act.sa_flags = SA_SIGINFO; + act.sa_mask = mask; + + if (err = sigaction(SIGSYS, &act, NULL)) { + printf("sigaction failed: %d\n", err); + exit(-1); + } + + /* enable SUD */ + if (err = prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &selector)) { + printf("prctl failed: %d\n", err); + exit(-1); + } + + selector = SYSCALL_DISPATCH_FILTER_BLOCK; + syscall(SYS_write); + + return 0; +} +``` +running this will produce the following: +``` +Caught syscall with number 0x38 +``` +where syscall number `0x38` is `SYS_write` (on my system). + +{% note(header="note") %} +we've had to disable syscall hooking within `handle_sigsys()`, before we perform the `write` syscall. this could also be achieved using the offset exclusion mechanism, with a caveat: SUD also needs to be disabled at signal handler return. this is because, depending on some factors, the signal handler will return to a signal trampoline, which exists somewhere else in memory (VDSO or libc). this trampoline will then perform the `sigreturn` syscall, which - if you had only excluded the handler function memory itself - would cause us to raise a `SIGSYS` inside a `SIGSYS` handler (oops). +{% end %} + +SUD should have pretty good performance, although it is still switching from user space to kernel space (on syscall), and back again (on SIGSYS handling) - unlike the binary rewriting methods. + +## rewriting the kernel syscall table +Linux kernel modules could also try to patch the kernel's syscall table. + +it used to be the case that if Linux is configured with `CONFIG_KALLSYMS_ALL=y`: +```shell +$ zgrep CONFIG_KALLSYMS_ALL /proc/config.gz +CONFIG_KALLSYMS_ALL=y +``` +then we can find the address of the syscall table in `kallsyms`: +```shell +$ sudo grep sys_call_table /proc/kallsyms +ffffffff854017e0 D sys_call_table +``` +and thus programmatically: +```c +void *sys_call_table = kallsyms_lookup_name("sys_call_table"); +``` +once the syscall table address has been obtained, the handlers for specific syscall(s) could then be overwritten, in order to hook them. +however, this kind of symbol lookup was generally considered very naughty, and `kallsyms_lookup_name()` was unexported in Linux 5.7{{ref(n=6)}}. there should be some feasible alternatives to do the same thing, but they would be slower, uglier, and more painful. + +## modifying IA32_LSTAR MSR + +another method which has been used successfully{{ref(n=7)}} involves rewriting the x86 MSR `IA32_LSTAR`. this register holds the destination address which the CPU will load into `RIP` when taking a syscall{{ref(n=2)}}. one can write a kernel module which modifies this register to contain a custom syscall handler address. this syscall handler function can hook syscalls before dispatching as normal (or not). + +now that `kallsysms_lookup_name()` is no longer exported, this method also becomes problematic. as the example kernel module in {{ref(n=7,nosup=true)}} demonstrates, we still need to be able to grab some kernel symbol addresses in order to correctly implement the dispatcher. + +however, this does propose another idea for implementing the previous method: using `IA32_LSTAR` to ascertain the address of the Linux syscall handler, instead of kallsyms. indeed, Onload has also exploited this method as a workaround for the unexporting of `kallsyms_lookup_name()`{{ref(n=8)}}. + + +## citations + +{{ defref(n=1, url="https://www.usenix.org/conference/atc23/presentation/yasukata") }} + +{{ defref(n=2, url="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html") }} + +{{ defref(n=3, url="https://elixir.bootlin.com/linux/v6.11/source/arch/s390/mm/mmap.c#L137") }} + +{{ defref(n=4, url="https://github.com/Xilinx-CNS/onload/blob/d5984bf52fcfba0cd75df8a1f8828f3363f6e164/src/lib/transport/unix/fdtable.c#L311") }} + +{{ defref(n=5, url="https://docs.kernel.org/admin-guide/syscall-user-dispatch.html") }} + +{{ defref(n=6, url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bd476e6c67190b5eb7b6e105c8db8ff61103281") }} + +{{ defref(n=7, url="https://vvdveen.com/data/lstar.txt") }} + +{{ defref(n=8, url="https://github.com/Xilinx-CNS/onload/blob/083e5959ef76632ae3cd3d4356fb079ce0c570d1/src/driver/linux_resource/syscall_x86.c#L347") }} \ No newline at end of file diff --git a/content/学中文学了一年了.md b/content/学中文学了一年了.md index 5043914..5c22012 100644 --- a/content/学中文学了一年了.md +++ b/content/学中文学了一年了.md @@ -1,7 +1,6 @@ +++ title = "学中文学了一年了" date = 2024-09-23 -template = "article.html" +++ 大约一年前,我开始认真地学习中文。因此,我觉得我应该写一下我的经历。 diff --git a/sass/style/article.scss b/sass/style/article.scss index 59665c8..d589452 100644 --- a/sass/style/article.scss +++ b/sass/style/article.scss @@ -76,4 +76,28 @@ color: $body-color; } } + + .note { + line-height: inherit; + margin: 1.25em 0; + padding: 6px 12px; + display: flex; + gap: 12px; + border-left: 1px solid $heading-color; + + .icon svg { + height: 1.3em; + display: flex; + align-items: center; + color: $heading-color; + } + + p { + margin-top: 0; + } + + p:last-child { + margin-bottom: 0; + } + } } diff --git a/static/images/info.svg b/static/images/info.svg new file mode 100644 index 0000000..ecf7a42 --- /dev/null +++ b/static/images/info.svg @@ -0,0 +1,2 @@ + + \ No newline at end of file diff --git a/templates/shortcodes/defref.html b/templates/shortcodes/defref.html new file mode 100644 index 0000000..8e91721 --- /dev/null +++ b/templates/shortcodes/defref.html @@ -0,0 +1 @@ +

[{{ n }}] {{ url }} (↑)

\ No newline at end of file diff --git a/templates/shortcodes/note.html b/templates/shortcodes/note.html new file mode 100644 index 0000000..6566f23 --- /dev/null +++ b/templates/shortcodes/note.html @@ -0,0 +1,12 @@ +
+ {% set icon = load_data(path="static/images/info.svg") %} +
+ {{ icon | safe }} +
+
+ {% if header %} +

{{ header }}

+ {% endif %} + {{ body | markdown | safe }} +
+
\ No newline at end of file diff --git a/templates/shortcodes/ref.html b/templates/shortcodes/ref.html new file mode 100644 index 0000000..0871b3a --- /dev/null +++ b/templates/shortcodes/ref.html @@ -0,0 +1 @@ +{% if not nosup %}{% endif %}[{{ n }}]{% if not nosup %}{% endif %} \ No newline at end of file