kernel - Improve debugging of spurious interrupts * Report spurious T_RESERVED interrupt vectors / trap numbers. Report the actual trap number and try to ignore it. If this occurs it helps with debugging as a cold boot 'vmstat -i -v' can be matched up against the spurious interrupt number (usually spurious# - 0x20) to locate the PCI device causing the problem.
Kernel - Additional cpu bug hardening part 2/2 * Due to speculative instruction execution, the kernel may speculatively execute instructions using data from registers that still contain userland-controlled content. Reduce the chance of this situation arising by proactively clearing all user registers after saving them for syscalls, exceptions, and interrupts. In addition, for system calls, zero-out any unrestored registers on-return to avoid leaking kernel data back to userland. * This was discussed over the last few months in various OS groups and I've decided to implement it. After the FP debacle, it is prudent to also give general registers similar protections.
kernel - Fix CVE-2018-8897, debug register issue * #DB can be delayed in a way that causes it to occur on the first instruction of the int $3 or syscall handlers. These handlers must be able to detect and handle the condition. This is a historical artifact of cpu operation that has existed for a very long time on both AMD and Intel CPUs. * Fix by giving #DB its own trampoline stack and a way to load a deterministic %gs and %cr3 independent of the normal CS check. This is CVE-2018-8897. * Also fix the NMI trampoline while I'm here. * Also fix an old issue with debug register trace traps which can occur when the kernel is accessing the user's address space. This fix was lost years ago, now recovered. Credits: Nick Peterson of Everdox Tech, LLC (original reporter) Credits: Thanks to Microsoft for coordinating the OS vendor response
kernel - Implement spectre mitigations part 3 (stabilization) * Fix a bug in the system call entry code. The wrong stack pointer was being loaded for KMMUENTRY_SYSCALL and KMMUENTRY_SYSCALL was using an offset that did not exist in certain situations. * Load the correct stack pointer, but also change KMMUENTRY_CORE to not use stack-relative loads and stores. Instead it uses the trampframe directly via %gs:BLAH Reported-by: zrj
kernel - Implement spectre mitigations part 2 * NOTE: The last few commits may have said 'IBPB' but they really meant 'IBRS. The last few commits addde IBRS support, this one cleans that up and adds IBPB support. * Intel says for IBRS always-on mode (mode 2), SPEC_CTRL still has to be poked on every user->kernel entry as a barrier, even though the value is not being changed. So make this change. This actually somewhat improves performance a little on Skylake and later verses when I just set it globally and left it that way. * Implement IBPB detection and support on Intel. At the moment we default to turning it off because the performance hit is pretty massive. Currently the work on linux is only using IBPB for VMM related operations and not for user->kernel entry. * Enhance the machdep.spectre_mitigation sysctl to print out what the mode matrix is whenever you change it, in human readable terms. 0 IBRS disabled IBPB disabled 1 IBRS mode 1 (kernel-only) IBPB disabled 2 IBRS mode 2 (at all times) IBPB disabled 4 IBRS disabled IBPB enabled 5 IBRS mode 1 (kernel-only) IBPB enabled 6 IBRS mode 2 (at all times) IBPB enabled Currently we default to (1) instead of (5) when we detect that the microcode detects both features. IBPB is not turned on by default (you can see why below). * Haswell and Skylake performance loss matrix using the following test. This tests a high-concurrency compile, which is approximately a 5:1 user:kernel test with high concurrency. The haswell box is: i3-4130 CPU @ 3.40GHz (2-core/4-thread) The skylake box is: i5-6500 CPU @ 3.20GHz (4-core/4-thread) This does not include MMU isolation losses, which will add another 3-4% or so in losses. (/usr/obj on tmpfs) time make -j 8 nativekernel NO_MODULES=TRUE PERFORMANCE LOSS MATRIX HASWELL SKYLAKE IBPB=0 IBPB=1 IBPB=0 IBPB=1 IBRS=0 0% 12% 0% 17% IBRS=1 >12%< 21% >2.4%< 15% IBRS=2 58% 60% 23% 32% Note that the automatic default when microcode support is detected is IBRS=1, IBPB=0 (12% loss on Haswell and 2.4% loss on Skylake for this test). If we add 3-4% or so for MMU isolation, a Haswell cpu loses around 16% and a Skylake cpu loses around 6% or so in performance. PERFORMANCE LOSS MATRIX (including 3% MMU isolation losses) HASWELL SKYLAKE IBPB=0 IBPB=1 IBPB=0 IBPB=1 IBRS=0 3% 15% 3% 20% IBRS=1 >15%< 24% >5.4%< 18% IBRS=2 61% 63% 26% 35%
kernel - Implement spectre mitigations part 1 * Implement machdep.spectre_mitigation. This can be set as a tunable or sysctl'd later. The tunable is only applicable if the BIOS has the appropriate microcode, otherwise you have to update the microcode first and then use sysctl to set the mode. This works similarly to Linux's IBRS support. mode 0 - Spectre IBPB MSRs disabled mode 1 - Sets IBPB MSR on USER->KERN transition and clear it on KERN->USER. mode 2 - Leave IBPB set globally. Do not toggle on USER->KERN or KERN->USER transitions. * Retest spectre microcode MSRs on microcode update. * Spectre mode 1 is enabled by default if the microcode supports it. (we might change this to disabled by default, I'm still mulling it over). * General performance effects (not counting the MMU separation mode, which is machdep.meltdown_mitigation and adds another 3% in overhead): Skylake loses around 5% for mode 1 and 12% for mode 2, verses mode 0. Haswell loses around 12% for mode 1 and 53% for mode 2, verses mode 0. Add another 3% if MMU separation is also turned on (aka machdep.meltdown_mitigation). * General system call overhead effects on Skylake: machdep.meltdown_mitigation=0, machdep.spectre_mitigation=0 103ns machdep.meltdown_mitigation=1, machdep.spectre_mitigation=0 360ns machdep.meltdown_mitigation=1, machdep.spectre_mitigation=1 848ns machdep.meltdown_mitigation=1, machdep.spectre_mitigation=2 404ns Note that mode 1 has better overall performance for mixed user+kernel workloads despite having a much higher system call overhead, whereas mode 2 has lower system call overhead but generally lower overall performance because IBPB is enabled in usermode.
kernel - Intel user/kernel separation MMU bug fix part 5 * Fix iretq fault handling. As I thought, I messed it up with the trampoline patches. Fixing it involves issuing the correct KMMU* macros to ensure that the code is on the correct stack and has the correct mmu context. Revalidate with a test program that uses a signal handler to change the stack segment descriptor to something it shouldn't be. * Get rid of the "kernel trap 9..." console message for the iretq fault case.
kernel - Intel user/kernel separation MMU bug fix part 3/3 * Implement the isolated pmap template, iso_pmap. The pmap code will generate a dummy iso_pmap containing only the kernel mappings required for userland to be able to transition into the kernel and vise-versa. The mappings needed are: (1) The per-cpu trampoline area for our stack (rsp0) (2) The global descriptor table (gdt) for all cpus (3) The interrupt descriptor table (idt) for all cpus (4) The TSS block for all cpus (we store this in the trampoline page) (5) Kernel code addresses for the interrupt vector entry and exit * In this implementation the 'kernel code' addresses are currently just btext to etext. That is, the kernel's primary text area. Kernel data and bss are not part of the isolation map. TODO - just put the vector entry and exit points in the map, and not the entire kernel. * System call performance is reduced when isolation is turned on. 100ns -> 350ns or so. However, typical workloads should not lose more than 5% performance or so. System-call heavy and interrupt-heavy workloads (network, database, high-speed storage, etc) can lose a lot more performance. We leave the trampoline code in-place whether isolation is turned on or not. The trampoline overhead, without isolation, is only 5nS or so. * Fix a missing exec-related trampoline initialization. * Clean-up kernel page table PTEs a bit. PG_M is ignored on non-terminal PTEs, so don't set it. Also don't set PG_U in non-terminal kernel page table pages (PG_U is never set on terminal PTEs so this wasn't a problem, but we should be correct). * Fix a bug in fast_syscall's trampoline stack. The wrong stack pointer was being loaded. * Move mdglobaldata->gd_common_tss to privatespace->common_tss. Place common_tss in the same page as the trampoline to reduce exposure to globaldata from the isolated MMU context. * 16-byte align struct trampframe for convenience. * Fix a bug in POP_FRAME. Always cli in order to avoid getting an interrupt just at the iretq instruction, which might be misinterpreted.
kernel - Intel user/kernel separation MMU bug fix part 1/3 * Part 1/3 of the fix for the Intel user/kernel separation MMU bug. It appears that it is possible to discern the contents of kernel memory with careful timing measurements of instructions due to speculative memory reads and speculative instruction execution by Intel cpus. This can happen because Intel will allow both to occur even when the memory access is later disallowed due to privilege separation in the PTE. Even though the execution is always aborted, the speculative reads and speculative execution results in timing artifacts which can be measured. A speculative compare/branch can lead to timing artifacts that allow the actual contents of kernel memory to be discerned. While there are multiple speculative attacks possible, the Intel bug is particularly bad because it allows a user program to more or less effortlessly access kernel memory (and if a DMAP is present, all of physical memory). * Part 1 implements all the logic required to load an 'isolated' version of the user process's PML4e into %cr3 on all user transitions, and to load the 'normal' U+K version into %cr3 on all transitions from user to kernel. * Part 1 fully allocates, copies, and implements the %cr3 loads for the 'isolated' version of the user process PML4e. * Part 1 does not yet actually adjust the contents of this isolated version to replace the kernel map with just a trampoline map in kernel space. It does remove the DMAP as a test, though. The full separation will be done in part 3.
kernel: Remove the COMPAT_43 kernel option along with all related code. It is commented out in our default kernel config files for almost five years now, since 9466f37df5258f3bc3d99ae43627a71c1c085e7d. Approved-by: dillon Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/2946>
kernel - Misc fixes and debugging * Add required CLDs in the exception paths. The interrupt paths already CLD in PUSH_FRAME. * Note that the fast_syscall (SYSENTER) path has an implied CLD due to the hardware mask applied to rflags. * Add the IOPL bits to the set of bits set to 0 during a fast_syscall. * When creating a dummy interrupt frame we don't have to push the actual %cs. Just push $0 so the frame isn't misinterpreted as coming from userland. * Additional debug verbosity for freeze_on_seg_fault. * Reserve two void * fields for LWP debugging (for a later commit)
kernel - Add syscall quick return path for x86-64 * Flag the case where a sysretq can be performed to quickly return from a system call instead of having to execute the slower doreti code. * This about halves syscall times for simple system calls such as getuid(), and reduces longer syscalls by ~80ns or so on a fast 3.4GHz SandyBridge, but does not seem to really effect performance a whole lot. Taken-From: FreeBSD (loosely)