kernel - Use higher invltlb synchronization timeout if guest * Increase the invltlb synchronization timeout from 10 seconds to 60 seconds if running as a guest. Just in case the host is heavily paging the guest, 10 seconds might not be enough. Of course, performance will be absolutely terrible if broadcast IPIs take that long to synchronize, but there isn't anything the guest can do about it.
kernel - Expand GDT table to maximum size * Expand the GDT table from 9 entries to 65536 entries (limit field 0xFFFF). * This deals with an Intel quirk in VMX where the descriptor for the GDT limit field is not restored on a VM exit, but instead unconditionally set to 0xFFFF.
vm: Change 'kernel_map' global to type of 'struct vm_map *' Change the global variable 'kernel_map' from type 'struct vm_map' to a pointer to this struct. This simplify the code a bit since all invocations take its address. This change also aligns with NetBSD's 'kernal_map' that it's also a pointer, which also helps the porting of NVMM. No functional changes.
flame_graph - Add initial code to support flame graphs * Add better PC sampling code to the kernel, capable of generating call stack traces. * Implement an initial flame_graph utility. flame_graph > /tmp/x.out & (let it run a while) flame_graph -p < /tmp/x.out Requested-by: mjg
kernel - Refactor malloc_type to reduce static data in image * malloc_type was embedding a SMP_MAXCPU array of kmalloc_use structures, which winds up being 16KB a pop x 400+ MALLOC_DEFINE() declarations. This was over 6MB of static data in the kernel binary, and it wasn't BSS because the declaration is initialized with some defaults. So this reduction is significant and directly impacts both memory use and kernel boot times. * Change malloc_type->ks_use from an array to a pointer. Embed a single kmalloc_use structure (ks_use0) as the default. When ncpus is probed, the kernel now goes through all malloc_type structures and dynamically allocates a properly-sized ks_use array. Any new malloc hoppers after that point will also dynamically allocate ks_use.
kernel - Zen 2 - Make sure %fs's selector is loaded in AP bootstrap * Issue load_fs() in the AP bootstrap. It appears that Zen 2 handles %fs in a weird way when the selector isn't loaded, causing the first wrmsr(MSR_FSBASE) to quietly fail, and possibly others too. * For good measure, also issue load_ds() and load_es(). * Fixes DragonFlyBSD's boot on Zen 2.
kernel - Fix SMAP/SMEP caught user mode access part 2/2. * Finish implementing SMAP exception handling support by properly detecting it in trap() and generating a panic(). Otherwise the cpu just locks up in a page-fault loop without any indication as to why on the console. * To properly support SMAP, make sure AC is cleared on system calls (it is already cleared on any interrupt or exception by the frame push code but I missed the syscall entry code).
kernel - Update AMD topology detection, scheduler NUMA work (TR2) * Update AMD topology detection to use the correct cpuid. It now properly detects the Threadripper 2990WX as having four nodes with 8 cores and 2 threads per core, per node. It previously detected the chip as one node with 32 cores and 2 threads per core. * Report the basic detected topology without requiring bootverbose. * Record information about how much memory is attached to each node. We previously just assumed that it was symmetric. This will be used by the scheduler. * Fix instability in the scheduler when running on a large number of cores. Flag 0x08 (on by default) is needed to actively schedule overloaded threads onto other cores, but this operation was being executed on all cores simultaneously which throws the uload/ucount metrics into an unstable state, causing threads to bounce around longer the necessary. Fix by round-robining the operation based on something similar to sched_ticks % cpuid. This significantly improves heavy multi-tasking performance on systems with many cores. * Add memory-on-node weighting to the scheduler. This detects asymetric NUMA configurations for situations where not all DIMM slots have been populated, and for CPUs which are naturally assymetric such as the 2990WX which only has memory directly connected to two of its four nodes. This change will preferentially schedule threads onto nodes with greater amounts of attached memory under light loads, and dig into the less desirable cpu nodes as the load increases.
x86_64: Implement x2apic support. Now LAPIC registers are accessed through MSR at fixed location, instead of going through MMIO region. Most noticeable is that ICR operation is greatly simplified, i.e. IPI sending operation: - Reserved bits are read as 0; there is no need to read ICR first for OR with the new values. - No more pending bit, i.e. ICR write is synchronized; there is no need to read ICR to test pending bit. - ICR is 64 bits in x2apic mode, i.e. two 32 bits writes to ICR-low and ICR-high become one write to ICR. NOTE: Though Intel SDM says that wrmsr to LAPIC registers are relaxed, we don't need to put mfence or sfence before them, especially for sending IPIs, since the generic IPIQ and the machdep code already uses atomic operation before doing ICR operation. For the rest of the code, there really are no needs to add mfence/sfence before rdmsr/wrmsr to LAPIC registers. As of this commit, x2apic mode is _not_ enabled by default. It can be enabled through hw.x2apic_enable tuneable, and a read-only sysctl node with the same name is available for debugging purpose. Based on work by ivadasz@.
x86_64: Prepare for x2apic support. - Use macro to access and modify LAPIC registers. - Use function pointers for hot LAPIC operation, i.e. IPI and timer. - Refactor the related code a bit. Global variable 'lapic' is renamed to 'lapic_mem' to ease code search. Based on work by ivadasz@.
kernel - Fix CVE-2018-8897, debug register issue * #DB can be delayed in a way that causes it to occur on the first instruction of the int $3 or syscall handlers. These handlers must be able to detect and handle the condition. This is a historical artifact of cpu operation that has existed for a very long time on both AMD and Intel CPUs. * Fix by giving #DB its own trampoline stack and a way to load a deterministic %gs and %cr3 independent of the normal CS check. This is CVE-2018-8897. * Also fix the NMI trampoline while I'm here. * Also fix an old issue with debug register trace traps which can occur when the kernel is accessing the user's address space. This fix was lost years ago, now recovered. Credits: Nick Peterson of Everdox Tech, LLC (original reporter) Credits: Thanks to Microsoft for coordinating the OS vendor response
kernel - Intel user/kernel separation MMU bug fix part 3/3 * Implement the isolated pmap template, iso_pmap. The pmap code will generate a dummy iso_pmap containing only the kernel mappings required for userland to be able to transition into the kernel and vise-versa. The mappings needed are: (1) The per-cpu trampoline area for our stack (rsp0) (2) The global descriptor table (gdt) for all cpus (3) The interrupt descriptor table (idt) for all cpus (4) The TSS block for all cpus (we store this in the trampoline page) (5) Kernel code addresses for the interrupt vector entry and exit * In this implementation the 'kernel code' addresses are currently just btext to etext. That is, the kernel's primary text area. Kernel data and bss are not part of the isolation map. TODO - just put the vector entry and exit points in the map, and not the entire kernel. * System call performance is reduced when isolation is turned on. 100ns -> 350ns or so. However, typical workloads should not lose more than 5% performance or so. System-call heavy and interrupt-heavy workloads (network, database, high-speed storage, etc) can lose a lot more performance. We leave the trampoline code in-place whether isolation is turned on or not. The trampoline overhead, without isolation, is only 5nS or so. * Fix a missing exec-related trampoline initialization. * Clean-up kernel page table PTEs a bit. PG_M is ignored on non-terminal PTEs, so don't set it. Also don't set PG_U in non-terminal kernel page table pages (PG_U is never set on terminal PTEs so this wasn't a problem, but we should be correct). * Fix a bug in fast_syscall's trampoline stack. The wrong stack pointer was being loaded. * Move mdglobaldata->gd_common_tss to privatespace->common_tss. Place common_tss in the same page as the trampoline to reduce exposure to globaldata from the isolated MMU context. * 16-byte align struct trampframe for convenience. * Fix a bug in POP_FRAME. Always cli in order to avoid getting an interrupt just at the iretq instruction, which might be misinterpreted.
kernel - Intel user/kernel separation MMU bug fix part 1/3 * Part 1/3 of the fix for the Intel user/kernel separation MMU bug. It appears that it is possible to discern the contents of kernel memory with careful timing measurements of instructions due to speculative memory reads and speculative instruction execution by Intel cpus. This can happen because Intel will allow both to occur even when the memory access is later disallowed due to privilege separation in the PTE. Even though the execution is always aborted, the speculative reads and speculative execution results in timing artifacts which can be measured. A speculative compare/branch can lead to timing artifacts that allow the actual contents of kernel memory to be discerned. While there are multiple speculative attacks possible, the Intel bug is particularly bad because it allows a user program to more or less effortlessly access kernel memory (and if a DMAP is present, all of physical memory). * Part 1 implements all the logic required to load an 'isolated' version of the user process's PML4e into %cr3 on all user transitions, and to load the 'normal' U+K version into %cr3 on all transitions from user to kernel. * Part 1 fully allocates, copies, and implements the %cr3 loads for the 'isolated' version of the user process PML4e. * Part 1 does not yet actually adjust the contents of this isolated version to replace the kernel map with just a trampoline map in kernel space. It does remove the DMAP as a test, though. The full separation will be done in part 3.
kernel - Expand physical memory support to 64TB * Make NKPML4E truly programmable and change the default from 1 PDP page to 16 PDP pages. This increases KVM from 512G to 8TB, which should be enough to accomodate a maximal 64TB configuration. Note that e.g. 64TB of physical ram certainly requires more than one kernel PDP page, since the vm_page_array alone would require around 2TB, never mind everything else! PDP entries in the PML4E (512 total @ 512GB per entry): 256 User space 112 (unused, avail for NKPML4E) 128 DMAP (64TB max physical memory) 16 KVM NKPML4E default (8TB) (recommend 64 max) * Increase the DMAP from 64 PDP pages to 128 PDP pages, allowing support for up to 64TB of physical memory. * Changes the meaning of KPML4I from being 'the index of the only PDP page in the PML4e' to 'the index of the first PDP page in the PML4e'. There are NKPML4E PDP pages starting at index KPML4I. * NKPDPE can now exceed 512. This is calculated to be the maximmum number of PD pages needed for KVM, which is now (NKPML4E*NPDPEPG-1). We now pre-allocate and populate only enough PD pages to accomodate the page tables we are pre-installing. Those, in turn, are calculated to be sufficient for bootstrapping mainly vm_page_array and a large initial set of pv_entry structures. * Remove nkpt, it was not being used any more.