kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
kernel: Move semicolon from the definition of SYSINIT() to its invocations. This affected around 70 of our (more or less) 270 SYSINIT() calls. style(9) advocates the terminating semicolon to be supplied by the invocation too, because it can make life easier for editors and other source code parsing programs.
kernel - Change time_second to time_uptime for all expiration calculations * Vet the entire kernel and change use cases for expiration calculations using time_second to use time_uptime instead. * Protects these expiration calculations from step changes in the wall time, particularly needed for route table entries. * Probably requires further variable type adjustments but the use of time_uptime instead if time_second is highly unlikely to ever overrun any demotions to int still present.
kernel - Move mplock to machine-independent C * Remove the per-platform mplock code and move it all into machine-independent code: sys/mplock2.h and kern/kern_mplock.c. * Inline the critical path. * When a conflict occurs kern_mplock.c will KTR log the file and line number of both the holder and conflicting acquirer. Set debug.ktr.giant_enable=-1 to enable conflict logging.
kernel - Move MP lock inward, plus misc other stuff * Remove the MPSAFE flag from the syscalls.master file. All system calls are now called without the MP lock held and will acquire the MP lock if necessary. * Shift the MP lock inward. Try to leave most copyin/copyout operations outside the MP lock. Reorder some of the copyouts in the linux emulation code to suit. Kernel resource operations are MP safe. Process ucred access is now outside the MP lock but not quite MP safe yet (will be fixed in a followup). * Remove unnecessary KKASSERT(p) calls left over from the time before system calls where prefixed with sys_* * Fix a bunch of cases in the linux emulation code when setting groups where the ngrp range check is incorrect.
Revamp SYSINIT ordering. Relabel sysinit IDs (SI_* in sys/kernel.h) to make them less confusing, particularly with regard to the relative order init routines are called in. Reorder many sysinits. Reorder the SMP and CLOCK code to bring all the cpus up far earlier in the boot sequence and to make the full threading and clocking subsystems available for device config.
Make access to basetime MP safe and interrupt-race safe by using a simple tail-chasing FIFO for updates to basetime. Reorganize the PROC sysctl's. This actually undoes part of the last commit and redoes it, though there was nothing wrong with the last commit. Move the SYSCTL_OUT phase to *after* the SYSCTL_IN phase.
This commit represents a major revamping of the clock interrupt and timebase infrastructure in DragonFly. * Rip out the existing 8254 timer 0 code, and also disable the use of Timer 2 (which means that the PC speaker will no longer go beep). Timer 0 used to represent a periodic interrupt and a great deal of code was in place to attempt to obtain a timebase off of that periodic interrupt. Timer 0 is now used in software retriggerable one-shot mode to produce variable-delay interrupts. A new hardware interrupt clock abstraction called SYSTIMERS has been introduced which allows threads to register periodic or one-shot interrupt/IPI callbacks at approximately 1uS granularity. Timer 2 is now set in continuous periodic mode with a period of 65536 and provides the timebase for the system, abstracted to 32 bits. All the old platform-integrated hardclock() and statclock() code has been rewritten. The old IPI forwarding code has been #if 0'd out and will soon be entirely removed (the systimer abstraction takes care of multi-cpu registrations now). The architecture-specific clkintr() now simply calls an entry point into the systimer and provides a Timer 0 reload and Timer 2 timebase function API. * On both UP and SMP systems, cpus register systimer interrupts for the Hz interrupt, the stat interrupt, and the scheduler round-robin interrupt. The abstraction is carefully designed to allow multiple interrupts occuring at the same time to be processed in a single hardware interrupt. While we currently use IPI's to distribute requested interrupts from other cpu's, the intent is to use the abstraction to take advantage of per-cpu timers when available (e.g. on the LAPIC) in the future. systimer interrupts run OUTSIDE THE MP LOCK. Entry points may be called from the hard interrupt or via an IPI message (IPI messages have always run outside the MP lock). * Rip out timecounters and disable alternative timecounter code for other time sources. This is temporary. Eventually other time sources, such as the TSC, will be reintegrated as independant, parallel-running entities. There will be no 'time switching' per-say, subsystems will be able to select which timebase they wish to use. It is desireable to reintegrate at least the TSC to improve [get]{micro,nano}[up]time() performance. WARNING: PPS events may not work properly. They were not removed, but they have not been retested with the new code either. * Remove spl protection around [get]{micro,nano}[up]time() calls, they are now internally protected. * Use uptime instead of realtime in certain CAM timeout tests * Remove struct clockframe. Use struct intrframe everywhere where clockframe used to be used. * Replace most splstatclock() protections with crit_*() protections, because such protections must now also protect against IPI messaging interrupts. * Add fields to the per-cpu globaldata structure to access timebase related information using only a critical section rather then a mutex. However, the 8254 Timer 2 access code still uses spin locks. More work needs to be done here, the 'realtime' correction is still done in a single global 'struct timespec basetime' structure. * Remove the CLKINTR_PENDING icu and apic interrupt hacks. * Augment the IPI Messaging code to make an intrframe available to callbacks. * Document 8254 timing modes in i386/sai/timerreg.h. Note that at the moment we assume an 8254 instead of an 8253 as we are using TIMER_SWSTROBE mode. This may or may not have to be changed to an 8253 mode. * Integrate the NTP correction code into the new timebase subsystem. * Separate boottime from basettime. Once boottime is believed to be stable it is no longer effected by NTP or other time corrections. CAVETS: * PC speaker no longer works * Profiling interrupt rate not increased (it needs work to be made operational on a per-cpu basis rather then system-wide). * The native timebase API is function-based, but currently hardwired. * There might or might not be issues with 486 systems due to the timer mode I am using.
syscall messaging 3: Expand the 'header' that goes in front of the syscall arguments in the kernel copy. The header was previously just an lwkt_msg. The header is now a 'union sysmsg'. 'union sysmsg' contains an lwkt_msg plus space for the additional meta data required to asynchronize various system calls. We haven't actually asynchronized anything yet and will not be able to until the reply port and abort processing infrastructure is in place. See sys/sysmsg.h for more information on the new header. Also cleanup syscall generation somewhat and add some ibcs2 stuff I missed.
syscall messaging 2: Change the standard return value storage for system calls from proc->p_retval[] to the message structure embedded in the syscall. System calls used to set their non-error return value in p_retval[] but must now set it in the message structure. This is a necessary precursor to any sort of asynchronizatino, for obvious reasons. This work was particularly annoying because all the emualtion code declares and manually fills in syscall argument structures. This commit could potentially destabilize some of the emulation code but I went through the most important Linux emulation code three times and tested it with linux-mozilla, so I am fairly confident that I got it right. Note: proper linux emulation requires setting the fallback elf brand to 3 or it will default to SVR4. It really ought to default to linux (3), not SVR4. sysctl -w kern.fallback_elf_brand=3
Preliminary syscall messaging work. Adjust all <syscall>_args structures to include an lwkt_msg at their base which will eventually allow syscalls to run asynch. Note that this is for the kernel copy of the arguments, the userland argument format has not changed for the standard syscall entry point. Begin abstracting a messaging syscall interface (#if 0'd out at the moment). Change the syscall2 entry point to take the new expanded argument structure into account. Change sysent argument calculation (AS macro) to take the new expanded argument structure into account. Note: existing linux, svr4, and ibcs2 emulation may break with this commit, though it is not intentional.