ioapic/icu: Rework PIC selection code - In the early stage, before I/O APIC is detected and setup, ICU controls interrupts, so all IDT entries should be set to ICU's intr code. - Switch to I/O APIC only after ICU is completely disconnected, i.e. after IMCR is set and LINT0 is masked.
kernel - scheduler adjustments for large ncpus / 48-core monster * Change the LWKT scheduler's token spinning algorithm. It used to DELAY a short period of time and then simply retry, creating a lot of contention between cpus trying to acquire a token. Now the LWKT scheduler uses a FIFO index mechanic to resequence the contending cpus into 1uS retry slots using essentially just atomic_fetchadd_int(), so it is very cache friendly. The spin-retry thus has a bounded cache management traffic load regardless of the number of cpus and contending cpus will not be tripping over each other. The new algorithm slightly regresses 4-cpu operation (~5% under heavy contention) but significantly improves 48-cpu operation. It is also flexible enough for further work down the road. The old algorithm simply did not scale very well. Add three sysctls: sysctl lwkt.spin_method=1 0 Allow a user thread to be scheduled on a cpu while kernel threads are contended on a token, using the IPI mechanic to interrupt the user thread and reschedule on decontention. This can potentially result in excessive IPI traffic. 1 Allow a user thread to be scheduled on a cpu while kernel threads are contended on a token, reschedule on the next clock tick (100 Hz typically). Decontention will NOT generate any IPI traffic. DEFAULT. 2 Do not allow a user thread to be scheduled on a cpu while kernel threads are contended. Should not be used normally, for debugging only. sysctl lwkt.spin_delay=1 Slot time in microseconds, default 1uS. Recommended values are 1 or 2 but not longer. sysctl lwkt.spin_loops=10 Number of times the LWKT scheduler loops on contended threads before giving up and allowing an idle-thread HLT. In order to wake up from the HLT decontention will cause an IPI so you do not want to set this value too small and. Values between 10 and 100 are recommended. * Redo the token decontention algorithm. Use a new gd_reqflags flag, RQF_WAKEUP, coupled with RQF_AST_LWKT_RESCHED in the per-cpu globaldata structure to determine what cpus actually need to be IPId on token decontention (to wakeup their idle threads stuck in HLT). This requires that all gd_reqflags operations use locked atomic instructions rather than non-locked instructions. * Decontention IPIs are a last-gasp effort if the LWKT scheduler has spun too many times. Under normal conditions, even under heavy contention, actual IPIing should be minimal.
Major kernel build infrastructure changes, part 1/2 (sys). These changes are primarily designed to create a 2-layer machine and cpu build hierarchy in order to support virtual kernel builds in the near term and future porting efforts in the long term. * Split arch/ into a set of platform architectures under machine/ and a set of cpu architectures under cpu/. All platform and cpu header files will be accessible via <machine/*.h>. Platform header files may override cpu header files (the platform header file then typically #include's the cpu header file). * Any cpu header files that are not overridden will be copied directly into /usr/include/machine/, allowing the platform to omit those header files (not have to create degenerate forwarding header files). * All source files access platform and cpu architecture files via the <machine/*.h> path. The <cpu/*.h> path should only be used by platform header files when including the lower level cpu header files. * Require both the 'machine' and the 'machine_arch' directives in the kernel config file. * When building modules in the presence of a kernel config, use the IF files, use*.h files, and opt*.h files provided by the kernel config and do not generate them in each module's object directory. This streamlines the module build considerably.
Further normalize the _XXX_H_ symbols used to conditionalize header file inclusion. Use _MACHINE_BLAH_H_ for headers found in "/usr/src/sys/arch/<arch>/include". Most headers already did this, but some did not. Use _ARCH_SUBDIR_BLAH_H_ for headers found in "/usr/src/sys/arch/<arch>/subdir" instead of _I386_SUBDIR_BLAH_H_. Change #include's made in architecture-specific directories to use <machine/blah.h> instead of "blah.h", allowing the included header files to be overrdden by another architecture. For example, a virtual kernel architecture might include a header from arch/i386/include which then includes some other header in arch/i386/include. But really we want that other header to also go via the arch/vkernel/include, so the header files in arch/i386/include must use <machine/blah.h> instead of "blah.h" for most of their sub-includes. Change most architecture-specific includes such as <i386/icu/icu.h> to use a generic path through the "arch" softlink, such as <arch/icu/icu.h>. Remove the temporary -I@/arch shim made in a recent commit, the <arch/...> mechanism replaces it. These changes allow us to implement heirarchical architectural overrides, primarily intended for virtual kernel support. A virtual kernel uses an architecture of 'vkernel' but must be able to access actual cpu-specific header files such as those found in arch/i386. It does this using a "cpu" softlink. For example, someone including <machine/atomic.h> in a vkernel build would hit the "arch/vkernel/include/atomic.h" header, and this header could then #include <cpu/atomic.h> to access the actual cpu's atomic.h file: "arch/i386/include/atomic.h". The ultimate effect is that an architecture can build on another architecture's header and source files.
Allow 'options SMP' *WITHOUT* 'options APIC_IO'. That is, an ability to produce an SMP-capable kernel that uses the PIC/ICU instead of the IO APICs for interrupt routing. SMP boxes with broken BIOSes (namely my Shuttle XPC SN95G5) could very well have serious interrupt routing problems when operating in IO APIC mode. One solution is to not use the IO APICs. That is, to run only the Local APICs for the SMP management. * Don't conditionalize NIDT. Just set it to 256 * Make the ICU interrupt code MP SAFE. This primarily means using the imen_spinlock to protect accesses to icu_imen. * When running SMP without APIC_IO, set the LAPIC TPR to prevent unintentional interrupts. Leave LINT0 enabled (normally with APIC_IO LINT0 is disabled when the IO APICs are activated). LINT0 is the virtual wire between the 8259 and LAPIC 0. * Get rid of NRSVIDT. Just use IDT_OFFSET instead. * Clean up all the APIC_IO tests which should have been SMP tests, and all the SMP tests which should have been APIC_IO tests. Explicitly #ifdef out all code related to the IO APICs when APIC_IO is not set.
ICU/APIC cleanup part 9/many. Get rid of machine/smptests.h, remove or implement the related #defines. Distinguish between boot-time vector initialization and interrupt setup and teardown in MACHINTR ABI. Get rid of the ISR test for APIC-generated interrupts and all related support code. Just generate the EOI and pray. Document more of the IO APIC redirection register(s). Intel sure screwed up the LAPIC and IO APIC royally. There is no simple way to poll the actual signal level on a pin, no simple way to manually EOI interrupts or EOI them in the order we desire, no simple way to poll the LAPIC for the vector that will be EOI'd when we send the EOI. We can't mask the interrupt on the IO APIC without triggering stupid legacy code on some machines. We can't even program the IO APIC linearly, it uses a stupid register/data sequence that makes it impossible for access on an SMP system without serialization. It's a goddamn mess, and it is all Intel's fault.
ICU/APIC cleanup part 7/many. Get rid of most of the dependancies on ICU_LEN, NSWI, and NHWI, by creating a generous system standard maximum for hardware and software interrupts in the MI sys/interrupt.h. The interrupt architecture can then further limit available hardware and software interrupts. For example, i386 uses 32 bit masks and so is limited to 32 hardware interrupts and 32 software interrupts. The name ICU_OFFSET is confusing, rename it to IDT_OFFSET, which is what it really is. Note that this separation is possible due to recent work on the MI interrupt layer. Separate the software interrupt mask from the hardware interrupt mask in the i386 code. Get rid of rndcontrol's 16 irq limit by creating a new ioctl to iterate through interrupt numbers.
ICU/APIC cleanup part 6/many. Move the APIC and ICU vector arrays into the new machine interrupt ABI. Move the interrupt vector setup and teardown code into the new ABI. Make FAST_HI the default and remove the #define. Add a vector control function to the machine interrupt ABI. Start changing names of globals so we can eventually link both ICU and APIC interrupt code into the same binary. Note that 'fastunpend' has not yet been renamed.