kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
jail - add jail.defaults.allow_listen_override * Add jail.defaults.allow_listen_override (also per-jail settable). This feature is disabled by default. When enabled, this feature allows both wildcard and non-wildcard listen sockets in the jail to override wildcard listen sockets on the host. These sockets will be masked by the jail's IP list, meaning that a wildcard socket in the jail effectively covers just the jail's IP list. Non-wildcard listen sockets on the host are not overriden. Use of this feature allows the host to operate normally, without having to make its services jail-friendly. Only those services which bind to specific IPs that might conflict with the jail IPs will need modification, and only if the jail needs to have that service as well. * In order to use the feature safely each jail should be given its own unique IPs for both localhost and its externally routable IP. For example: jail -u root / tr3990xJ 127.0.0.2,10.0.0.139 /bin/csh ifconfig can be used on the host to create multiple 127.0.0.X aliases on lo0 and to assign additional routable IPs to the machine for use in its jails. For example: ifconfig lo0 inet 127.0.0.2 alias ifconfig lo0 inet 127.0.0.3 alias ifconfig lo0 inet6 ::2 alias ifconfig lo0 inet6 ::3 alias ifconfig em0 inet 10.0.0.139 netmask 255.255.0.0 alias ifconfig em0 inet 10.0.0.140 netmask 255.255.0.0 alias ... * Within a jail, use of localhost (127.0.0.1 or ::1) will automatically be converted to the jail's localhost IP (such as 127.0.0.2). Also, accept(), getsockname(), and getpeername() will translate the jail's localhost IP back to 127.0.0.1 or ::1. Most services within the jail can thus use localhost without being the wiser. * Listen address/port pairs within a jail can now be overloaded with the same address/port pairs on the host, or overloaded verses other jails without generating an error. However, accessibility to these ports is governed by the 'jail.deafults.allow_listen_override' sysctl setting for the jail (or the jail-specific version of the same sysctl). Any jail-to-jail overloading of identical address/port pairs is allowed, but operationally undefined. Only one jail will receive connections. It is best to supply each jail with its own unique local and routable IPs. * IPV6 is now fully supported using the same mechanisms. You can supply a mix of IPV4 and IPV6 addresses in the jail command if desired. The overloading feature works the same.
jail: Allow jails to mount nullfs(5) and tmpfs(5) - The code is structured in a way that it should be easy to add more filesystems in the future. - User mounts are disabled in jails for now. - It is not allowed for jails to unmount filesystems that were not mounted within the jail itself. Reviewed by: dillon, mjg
jail: Simplify a bit by using the new BIT64 sysctl functions - No functional changes. - The per-jail settings have been renamed to match the new capability constants. The default settings will be renamed soon too. - Fix a missing prison chflags check in ufs_settattr() and ext2fs_setattr().
jail - Rework sysctl configuration variables - Jail sysctls are now jail-specific so that different jails can have different settings. Each jail will have its own subtree which can be operated directly with sysctl(8). Naming convention: jail.<n>.<setting> - All previous sysctls are now moved to 'jail.defaults' and they are used as a template for any newly created jail. Example: # jls JID Hostname Path IPs 2 t02.local /jails/02 10.0.0.3 1 t01.local /jails/01 10.0.0.2 # sysctl jail jail.jailed: 0 jail.list: 2 t02.local /jails/02 10.0.0.3 1 t01.local /jails/01 10.0.0.2 jail.defaults.allow_raw_sockets: 0 jail.defaults.chflags_allowed: 0 jail.defaults.sysvipc_allowed: 0 jail.defaults.socket_unixiproute_only: 1 jail.defaults.set_hostname_allowed: 1 jail.1.set_hostname_allowed: 1 jail.1.socket_unixiproute_only: 1 jail.1.sysvipc_allowed: 0 jail.1.chflags_allowed: 0 jail.1.allow_raw_sockets: 0 jail.2.set_hostname_allowed: 1 jail.2.socket_unixiproute_only: 1 jail.2.sysvipc_allowed: 0 jail.2.chflags_allowed: 0 jail.2.allow_raw_sockets: 0 # sysctl jail.2.allow_raw_sockets=1 jail.2.allow_raw_sockets: 0 -> 1 # jexec 2 ping -q -c 1 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes --- 10.0.0.1 ping statistics --- 1 packets transmitted, 1 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.766/0.766/0.766/0.000 ms # jexec 1 ping -q -c 1 10.0.0.1 ping: socket: Operation not permitted # service jail stop Stopping jails: t01.local t02.local. # sysctl jail jail.jailed: 0 jail.defaults.allow_raw_sockets: 0 jail.defaults.chflags_allowed: 0 jail.defaults.sysvipc_allowed: 0 jail.defaults.socket_unixiproute_only: 1 jail.defaults.set_hostname_allowed: 1
jail: Implement read-only sysctl "jail.jailed" Implement the read-only sysctl entry 'jail.jailed', which can be used to determine if a process is running inside a jail (value is 1) or not (value is 0). NOTE: The current FreeBSD has such a sysctl entry called 'security.jail.jailed'. However, DragonFly BSD doesn't not have any 'security.jail.*' but only 'jail.*' sysctl entries. Meanwhile, update /etc/rc to use this new sysctl entry to better deal with the rc scripts with the 'nojail' keyword. Also document this sysctl entry in the jail.8 man page. This commit is based mostly on FreeBSD as well as the patch in bug report #118. Reviewed-by: dillon, mjg (Mateusz Guzik) Bug-report: #118
kernel - Fix rare ucred race * In a threaded program if one thread is modifying the ucred, e.g. changing the uid or gid or something like that, and another thread enters a system call at the same time, the second thread can wind up trying to hold a stale ucred kfree()'d by the first thread. * Very rare race on top of a ~2-instruction window. * Fix the problem by obtaining proc->p_spin when updating the per-thread ucred cache (td->td_ucred) from p->p_ucred, as well as when replacing p_ucred. These fixes do NOT impose any critical-path overhead. For the case where a thread already has the current p_ucred cached on entry to a system call, absolutely nothing needs to be done. Reported-by: joris (Joris Giovannangeli)
kernel - Major signal path adjustments to fix races, tsleep race fixes, +more * Refactor the signal code to properly hold the lp->lwp_token. In particular the ksignal() and lwp_signotify() paths. * The tsleep() path must also hold lp->lwp_token to properly handle lp->lwp_stat states and interlocks. * Refactor the timeout code in tsleep() to ensure that endtsleep() is only called from the proper context, and fix races between endtsleep() and lwkt_switch(). * Rename proc->p_flag to proc->p_flags * Rename lwp->lwp_flag to lwp->lwp_flags * Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add td->td_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add some freeze testing code to the x86-64 trap code (default disabled).