kernel - Enhance the sniff code, refactor interrupt disablement for IPIs * Add kern.sniff_enable, default to 1. Allows the sysop to disable the feature if desired. * Add kern.sniff_target, allows sniff IPIs to be targetted to all cpus (-1), or to a particular cpu (0...N). This feature allows the sysop to test IPI delivery to particular CPUs (typically monitoring with systat -pv 0.1) to determine that delivery is working properly. * Bring in some additional AMD-specific setup from FreeBSD, beginnings of support for the APIC Extended space. For now just make sure the extended entries are masked. * Change interrupt disablement expectations. The caller of apic_ipi(), selected_apic_ipi(), and related macros is now required to hard-disable interrupts rather than these functions doing so. This allows the caller to run certain operational sequences atomically. * Use the TSC to detect IPI send stalls instead of a hard-coded loop count. * Also set the APIC_LEVEL_ASSERT bit when issuing a directed IPI, though the spec says this is unnecessary. Do it anyway. * Remove unnecessary critical section in selected_apic_ipi(). We are in a hard-disablement and in particular we do not want to accidently trigger a splz() due to the crit_exit() while in the hard-disablement. * Enhance the IPI stall detection and recovery code. Provide more inforamtion. Also enable the LOOPMASK_IN debugging tracker by default. * Add a testing feature to machdep.all_but_self_ipi_enable. By setting this to 2, we force the smp_invltlb() to always use the ALL_BUT_SELF IPI. For testing only.
vkernel - Fix a vkernel lockup on startup - During ap_init() any pending IPIs is processed manually so clear gd_npoll as the real kernel does. - Do not disable interrupts for vkernels during lwkt_send_ipiq3() because they don't seem to be re-enabled afterwards as they should. I'm not entirely sure this is the right fix, more investigation is required.
kernel - Refactor cpu localization for VM page allocations (3) * Instead of iterating the cpus in the mask starting at cpu #0, iterate starting at mycpu to the end, then from 0 to mycpu - 1. This fixes random masked wakeups from favoring lower-numbered cpus. * The user process scheduler (usched_dfly) was favoring lower-numbered cpus due to a bug in the simple selection algorithm, causing forked processes to initially weight improperly. A high fork or fork/exec rate skewed the way the cpus were loaded. Fix this by correctly scanning cpus from the (scancpu) rover. * For now, use a random 'previous' affinity for initially scheduling a fork.
kernel - Fix excessive ipiq recursion (3) * Third try. I'm not quite sure why we are still getting hard locks. These changes (so far) appear to fix the problem, but I don't know why. It is quite possible that the problem is still not fixed. * Setting target->gd_npoll will prevent *all* other cpus from sending an IPI to that target. This should have been ok because we were in a critical section and about to send the IPI to the target ourselves, after setting gd_npoll. The critical section does not prevent Xinvltlb, Xsniff, Xspuriousint, or Xcpustop from running, but of these only Xinvltlb does anything significant and it should theoretically run at a higher level on all cpus than Xipiq (and thus complete without causing a deadlock of any sort). So in short, it should have been ok to allow something like an Xinvltlb to interrupt the cpu inbetween setting target->gd_npoll and actually sending the Xipiq to the target. But apparently it is not ok. * Only clear mycpu->gd_npoll when we either (1) EOI and take the IPIQ interrupt or (2) If the IPIQ is made pending via reqflags, when we clear the flag. Previously we were clearing gd_npoll in the IPI processing loop itself, potentially racing new incoming interrupts before they get EOId by our cpu. This also should have been just fine, because interrupts are enabled in the processing loop so nothing should have been able to back-up in the LAPIC. I can conjecture that possibly there was a race when we cleared gd_npoll multiple times, potentially clearing it the second (or later) times, allowing multiple incoming IPIs to be queued from multiple cpu sources but then cli'ing and entering a e.g. Xinvltlb processing loop before our cpu could acknowledge any of them. And then, possibly, trying to issue an IPI with the system in this state. I don't really see how this can cause a hard lock because I did not observe any loop/counter error messages on the console which should have been triggered if other cpus got stuck trying to issue IPIs. But LAPIC IPI interactions are not well documented so... perhaps they were being issued but blocked our local LAPIC from accepting a Xinvltlb due to having one extra unacknowledged Xipiq pending? But then, our Xinvltlb processing loop *does* enable interrupts for the duration, so it should have drained if this were so. In anycase, we no longer gratuitously clear gd_npoll in the processing loop. We only clear it when we know there isn't one in-flight heading to our cpu and none queued on our cpu. What will happen now is that a second IPI can be sent to us once we've EOI'd the first one, and wind up in reqflags, but will not be acted upon until our current processing loop returns. I will note that the gratuitous clearing we did before *could* have allowed substantially all other cpus to try to Xipiq us at nearly the same time, so perhaps the deadlock was related to that type of situation. * When queueing an ipiq command from mycpu to a target, interrupts were enabled between our entry into the ipiq fifo, the setting of our cpu bit in the target gd_ipimask, the setting of target->gd_npoll, and our issuing of the actual IPI to the target. We now disable interrupts across these four steps. It should have been ok for interrupts to have been left enabled across these four steps. It might still be, but I am not taking any chances now.
kernel - Fix excessive ipiq recursion * Fix a situation where excessive IPIQ recursion can occur. The problem was revealed by the previous commit when the passive signalling mechanism was changed. * Passive IPI sends now signal at 1/4 full. * Active IPI sends wait for the FIFO to be < 1/2 full only when the nesting level is 0, otherwise they allow it to become almost completely full. This effectively gives IPI callbacks a buffer of roughly 1/2 the FIFO in which they can issue IPI sends without triggering the wait-process loop (which is the cause of the nesting). IPI callbacks do not usually send more than one or two IPI sends to any given cpu target which should theoretically guarantee that excessive stacking will not occur. Reported-by: marino
kernel - Fix Xinvltlb issue, fix ipiq issue, add Xsniff * The Xinvltlb IPI interrupt enables interrupts in smp_inval_intr(), which allows potentially pending interrupts and other things to happen. We must use doreti instead of doreti_iret. * Fix a reentrancy issue with lwkt_ipiq. Reentrancy can occur when the ipi callback itself needs to issue an IPI, but the target cpu FIFO is full. When this happens, the cpu mask may not be correct so force a scan of all cpus in this situation. * Add an infinite loop detection test to lwkt_process_ipiq() and jigger another IPI if it persists more than 10 seconds, hopefully recovering the system if as-yet unknown IPI issues persist. * Add the Xsniff IPI and augment systat -pv to use it. This sniffs the %rip and %rpc on all cpus, allowing us to see where where the kernel spends its time.
kernel - Add lwkt_cpusync_quick() * Add a quick one-stage cpusync function to complement our two-stage interlock/deinterlock cpusync functions. The one-stage version doesn't have to spin the target cpus, only the originating cpu, but it can't quiesce the cpus either whereas the two-stage version can.
kernel - Refactor cpumask_t to extend cpus past 64, part 1/2 * 64-bit systems only. 32-bit builds use the macros but cannot be expanded past 32 cpus. * Change cpumask_t from __uint64_t to a structure. This commit implements one 64-bit sub-element (the next one will implement four for 256 cpus). * Create a CPUMASK_*() macro API for non-atomic and atomic cpumask manipulation. These macros generally take lvalues as arguments, allowing for a fairly optimal implementation. * Change all C code operating on cpumask's to use the newly created CPUMASK_*() macro API. * Compile-test 32 and 64-bit. Run-test 64-bit. * Adjust sbin/usched, usr.sbin/powerd. usched currently needs more work.
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree * This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon. * Special note on GSOC core This is, needless to say, a huge amount of work compressed down into a few paragraphs of comments. Adds the pc64/vmm subdirectory and tons of stuff to support hardware virtualization in guest-user mode, plus the ability for programs (vkernels) running in this mode to make normal system calls to the host. * Add system call infrastructure for VMM mode operations in kern/sys_vmm.c which vectors through a structure to machine-specific implementations. vmm_guest_ctl_args() vmm_guest_sync_addr_args() vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original user stack for EPT (since EPT 'physical' addresses cannot reach that far into the backing store represented by the process's original VM space). Also installs the GUEST_CR3 for the guest using parameters supplied by the guest. vmm_guest_sync_addr_args() - A host helper function that the vkernel can use to invalidate page tables on multiple real cpus. This is a lot more efficient than having the vkernel try to do it itself with IPI signals via cpusync*(). * Add Intel VMX support to the host infrastructure. Again, tons of work compressed down into a one paragraph commit message. Intel VMX support added. AMD SVM support is not part of this GSOC and not yet supported by DragonFly. * Remove PG_* defines for PTE's and related mmu operations. Replace with a table lookup so the same pmap code can be used for normal page tables and also EPT tables. * Also include X86_PG_V defines specific to normal page tables for a few situations outside the pmap code. * Adjust DDB to disassemble SVM related (intel) instructions. * Add infrastructure to exit1() to deal related structures. * Optimize pfind() and pfindn() to remove the global token when looking up the current process's PID (Matt) * Add support for EPT (double layer page tables). This primarily required adjusting the pmap code to use a table lookup to get the PG_* bits. Add an indirect vector for copyin, copyout, and other user address space copy operations to support manual walks when EPT is in use. A multitude of system calls which manually looked up user addresses via the vm_map now need a VMM layer call to translate EPT. * Remove the MP lock from trapsignal() use cases in trap(). * (Matt) Add pthread_yield()s in most spin loops to help situations where the vkernel is running on more cpu's than the host has, and to help with scheduler edge cases on the host. * (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page() uses to try to shortcut operations and avoid locks. Implement it for pc64. This function checks whether the page is already faulted in as requested by looking up the PTE. If not it returns NULL and the full blown vm_fault_page() code continues running. * (Matt) Remove the MP lock from most the vkernel's trap() code * (Matt) Use a shared spinlock when possible for certain critical paths related to the copyin/copyout path.
kernel - Misc adjustments used by the vkernel and VMM, misc optimizations * This section committed separately because it is basically independent of VMM. * Improve pfind(). Don't get proc_token if the process being looked up is the current process. * Improve kern_kill(). Do not obtain proc_token any more. p->p_token is sufficient and the process group has its own lock now. * Call pthread_yield() when spinning on various things. x Spinlocks x Tokens (spinning in lwkt_switch) x cpusync (ipiq) * Rewrite sched_yield() -> dfly_yield(). dfly_yield() will unconditionally round-robin the LWP, ignoring estcpu. It isn't perfect but it works fairly well. The dfly scheduler will also no longer attempt to migrate threads across cpus when handling yields. They migrate normally in all other circumstances. This fixes situations where the vkernel is spinning waiting for multiple events from other cpus and in particular when it is doing a global IPI for pmap synchronization of the kernel_pmap.