# DragonFly BSD 4.8 * Version 4.8.0 released 27 March 2017 * Version 4.8.1 released 02 August 2017 DragonFly version 4.8 brings EFI boot support in the installer, further speed improvements in the kernel, a new NVMe driver, a new eMMC driver, and Intel video driver updates. The details of all commits between the 4.6 and 4.8 branches are available in the associated commit messages for [4.8RC](http://lists.dragonflybsd.org/pipermail/commits/2017-March/625576.html), [4.8.0](http://lists.dragonflybsd.org/pipermail/commits/2017-March/625648.html), and [4.8.1](http://lists.dragonflybsd.org/pipermail/commits/2017-August/626150.html). ## Big-ticket items ### Improved kernel performance This release further localizes cache lines and reduces/removes cache ping-ponging on globals. For bulk builds on many-cores or multi-socket systems, we have around a 5% improvement, and certain subsystems such as namecache lookups and exec()s see massive focused improvements. See the corresponding [mailing list post with details](http://lists.dragonflybsd.org/pipermail/users/2017-February/313242.html). ### Support for eMMC booting, and mobile and high-performance PCIe SSDs This kernel release includes support for eMMC storage as the boot device. We also sport a brand new SMP-friendly, high-performance NVMe SSD driver (PCIe SSD storage). Initial device test results [are available](http://apollo.backplane.com/DFlyMisc/nvme_randread.txt). ### EFI support The installer can now create an EFI or legacy installation. Numerous adjustments have been made to userland utilities and the kernel to support EFI as a mainstream boot environment. The /boot filesystem may now be placed either in its own GPT slice, or in a DragonFly disklabel inside a GPT slice. DragonFly, by default, creates a GPT slice for all of DragonFly and places a DragonFly disklabel inside it with all the standard DFly partitions, such that the disk names are roughly the same as they would be in a legacy system. ### Improved graphics support The i915 driver has been updated to match the version found with the Linux 4.6 kernel. (Linux 4.7 in the DragonFly 4.8.1 release.) Broadwell and Skylake processor users will see improvements. ### Other user-affecting changes * Kernel is now built using -O2. * VKernels now use COW, so multiple vkernels can share one disk image. * powerd() is now sensitive to time and temperature changes. * Non-boot-filesystem kernel modules can be loaded in rc.conf instead of loader.conf. ## Details ### Checksums MD5 (dfly-x86_64-4.8.0_REL.img) = 7936811dc0113bb5a5c607d3bfd71917 MD5 (dfly-x86_64-4.8.0_REL.iso) = e6811893c02e99ca7dd8f3c1d6e92ae3 MD5 (dfly-x86_64-4.8.0_REL.img.bz2) = 0e0a426ea581b9057ef1277b2ba7167d MD5 (dfly-x86_64-4.8.0_REL.iso.bz2) = 54bd900737a32fab9939ec5fd1fd0d6d ### Upgrading If you have an existing 4.6.x system and are running a generic kernel, the normal upgrade process, described below, will work. *Note that DSA OpenSSH keys are now deprecated.* It's possible to [change your configuration](http://www.openssh.com/legacy.html) to allow DSA keys again, but we recommend moving to a new key when possible. If you only have DSA keys, change to another type before upgrading or you may lock yourself out. You may be able to use -oHostKeyAlgorithms=+ssh-dss to get in anyway, but we recommend changing to RSA keys ASAP. *Note that OpenSSH HPN support has been removed.* You will need to remove it from your sshd config. This only affects you if you specifically enabled it in your configuration. Change your local /usr/src to 4.8: cd /usr/src git fetch origin git branch DragonFly_RELEASE_4_8 origin/DragonFly_RELEASE_4_8 git checkout DragonFly_RELEASE_4_8 git pull And then rebuild: (in /usr/src ) make buildworld make buildkernel make installkernel make installworld make upgrade Don't forget to upgrade your existing packages. 4.8 packages have already been built and are immediately available. pkg upgrade ## All changes since DragonFly 4.6 ### Kernel * Refactor buffer cache code to remove dynamic KVA reservations. Instead, all KVA is reserved at boot time. Saves us from unnecessary IPIs and allows significant simplification of the buffer cache code. * Add vfs.repurpose_enable (under test, disabled by default). This feature can be enabled to significantly reduce the IPI and VM management load on a machine which is doing huge amounts of file I/O, for example from a NVMe SSD, by bypassing normal VM page recycling mechanism. When enabled, the feature only triggers under high I/O loads. It works by repurposing the VM pages underlying a buffer in-place (when possible) so as not to have to kremove/kenter the pages in the buffer's KVA. Normal VM page recycling (which would otherwise be overwhelmed by the I/O load) is bypassed as well. * Change how the IPIQ is processed, in particular create an independent Xinterrupt vector mechanism for page invalidations that ignore (will operate) even if a critical section is held. Implement machdep.optimized_invltlb (disabled by default, under test) which avoids sending tlb invalidation IPIs to idle cpus. * Fix numerous races that could occur under extreme loads. Most use cases would never trigger these but our build boxes did occasionally. For example, there was a two instruction race where the cpu bit for a pmap would be cleared (for two instructions) and cause a TLB IPI occuring at the same time on another cpu for the same pmap to not realize that cpu was using the pmap. The fix is to disable the CR3 reload optimization for the LWP->LWP (same proc) switch case. * Fix a HAMMER bug which could result in a DATA CRC error being improperly reported. * Fix a double-write triggered by the way HAMMER uses cluster_write(). This significantly improves HAMMER's write performance. * Numerous other HAMMER cleanups and fixes also went in. * Fix a hard lock that could occur in getpbuf*() due to a misinterpretation of the return value of an atomic op. * Fix a stacking interrupt that can occur in a 10-instruction window, potentially (but not found in the wild) running the kernel stack out. * Cut pmap related IPIs in half for certain buffer-cache operations by not bothering to invalidate the TLB, and on the flip-side always invalidating the TLB when entering a new PTE even if the prior contents was invalid. This improves performance and also makes debugging easier by removing a problematic optimization. * Fix a number of difficult-to-trigger SMP races, in particular one related to doing simultaneous umounts of different mount points which the bulk build could trigger. Also fix a mountctl vs umount race. * Reduce the number of atomic ops in the switch path. * Fix a namecache race/panic which could occur under extreme loads coupled with a lot of mount/umount activity. * Restrict %rip sampling to root. * Fix a getpid() issue in vfork() when threaded. In particular, concurrent vfork()s in a threaded program could cause the wrong PID to be returned by getpid() in the child prior to the exec. * Fix a rare tsleep/callout race when the callout timer triggers before the tsleep() is completely done setting up. * Cleanup namecache stall messages on the console. In particular, report the proper elapsed time and the td_comm of the thread involved. * Further reduce memory testing and early-boot zeroing to improve boot times on systems with large amounts of ram. * Remove the idle page-zeroing code entirely. Zeroing a page on a modern cpu on-demand is better for many reasons, and may actually be faster when combined with the consumer accessing data in the page, due to cache effects. Remove PG_ZERO, because it is no longer needed. Removing PG_ZERO also makes the kernel more debuggable by removing another possible source of cross-contamination. * Refactor and finish implementing CPU localization for kernel memory allocations. Combine with NUMA awareness. This works for cpu-localized or short-lived kernel data structures. The two are combined together in our PQ_L2_SIZE abstraction that used to be the VM page coloring code. This code now also handles CPU localization and NUMA awareness. * Fix many vkernel issues and significantly improve vkernel performance. * Update kern.proc.pathname, a sysctl used by programs to find the path of the running program. This sysctl was originally implemented before we stored sufficient data to return a full, proper path. * Sync ACPICA from Intel (this is a regular occurrence). * Fix the memcpy() assembly ABI. The assembly was not returning the original (dst) argument. Doesn't fix any known issues but closes a hole when GCC sometimes decides to call memcpy while generating code. * Many commits to clean up -O2 warnings and errors. The kernel is now compiled -O2 by default. * Add a workaround for an improper yield in the ACPI path (aka buggy ACPI code). * Fix a STOP/CONT race that could be triggered by a pending signal at just the wrong time. * Threaded coredump fixes and fix a lockup related to same when multiple threads of the same process seg-fault at the same time. * Fix a CAM/VM deadlock that could occur due to a bug in uiomove_nofault(). This could cause an 'indefinite wait buffer' during heavy paging/swapping. * Add code to detect and deal with lost IPIs. This is primarily for vkernels where some virtual hosts can lose IPIs. Real CPUs are not supposed lose IPIs. * Various fixes to clock_gettime(). * Remove more vestiges of the MPLOCK. All critical paths have long since divested from this lock, but there are still a few non-critical places left that use it. * Rework the low-memory process killing code and fix a number of races that could prevent the feature from working. * Fix a system lockup with VMM and refactor the VMX code. * Fix a deadlock when numvnodes reaches maxvnodes, which can occur under heavy loads. Also fix a minor kernel memory leak when 'df' or filesystem sync races a umount. Also reduce the maxvnodes calculation modestly. For example, a machine with 8GB of ram will now set maxvnodes to 478483 instead of 598103. * Fix a rare panic which can be triggered by vm_object_page_remove() when user_yield() is improperly called while holding a spinlock, and then decides to deschedule. * Reduce the size of some dynamically allocated kernel structures. In particular, excessively-sized inode hash table allocations are now smaller. Primarily affects UFS (which DragonFlyBSD doesn't use much). * Add workaround for AMD erratum 793. * Fix a deadlock which can occur in stacked cluster_*() I/O calls. * Fix a bug where recursive module loading could deadlock. * Fix a silly bug in the NFS sillyrename code (server side NFS) which could cause the NFS server's sillyrename code to never remove the silly-renamed file. How silly! * Do a better job accommodating high-ncpu + low-memory configurations. * Refactor shared spinlocks to reduce the amount of spinning which can occur when multiple cpus acquire a shared spinlock at the same time. * Overhaul namecache operations to reduce SMP contention even further. This improves simultaneous non-conflicting single-component performance at least 25x on systems with many cores, and significantly reduces vnode and mount structure ref and unref operations. * Overhaul numerous other kernel structures to improve cache locality and reduce cache line bouncing. * Fix a bug in SMBFS's file rename code. * Implement RLIMIT_RSS, a per-process RSS limiter which will force localized paging on a per-process basis. This feature can be used to prevent one process from turning the rest of the machine into a hard case. * Increase the maximum supported swap space. The maximum is now limited primarily by ram and will be in the tens of terabytes (if you have enough ram for the supporting management structures). Also increase the kernel's KVM from 128G to 511G. * Implement dynamic pmap deletion (disabled by default). This directs the pmap code to delete intermediate page table pages and PDs from the pmap on the fly. It can be useful if memory is at a premium, but note that, if enabled, it will slow execution of programs which allocate and deallocate memory at a high rate. * Refactor how user 'nice' levels work, making the selected nice values more significant than they used to be. * Add a high performance native NVME driver to DragonFly, written by Matt Dillon. This driver will use MSI-X vectors and all available queues supported by the device, per-cpu localization with no locking or minimal locking (no SMP conflicts in most cases), and is capable of insane IOPS and throughput. ### Graphics * Stabilizes Broadwell and Skylake, bring us up to the Linux 4.6 equivalent DRM. * Implement the Linux i2c API to make porting easier. * Fix a few old bugs, including a lock order reversal, which could stall-out video playback (and the rest of X). * Fix a kernel drm thread priority mistake that allowed user processes to have a higher priority than the drm helper thread. This fixes most temporary video stalls reported on browsers. * Handle EFI framebuffer passing into DRM, improve syscons VT switching and fix a related deadlock. Also have the kernel try to switch back to the console VT from X when a panic occurs. ### Networking * Many improvements across the board. * iwm - Fixes an issue caused by inverted logic. Numerous other improvements that significantly improve performance. * wlan - Support for asynchronous bg scan and other features added. ### Other drivers * nvme - Added to default kernel build, plus fixes and performance improvements. * mmcsd - Significant eMMC support added to DragonFly. * ahci - Some compatibility adjustments and more quirks added to support broken chipsets, in particular port multipliers. Also implement FBS (FIS-Based-Switching) when supported by the chipset. * Trackpoint and Elantech support added. ### Userland * systat enhanced to collapse multiple interrupts belonging to the same driver, as there are often too many to list now. * systat -vm 1 significantly enhanced and revamped to report more useful information and to unpack fields so they don't run into each other. And add 'nvme' to the block device match. Also adjust the extended vmstats display and change how ozfod and nzfod is reported. * 'vmstat 1' output refactored. All the fields were running into each other due to the high performance of a modern machine verses what existed 30 years ago. * Change mount/mountd signalling to reduce unnecessary mountlist scans and commands from mount_null and mount_tmpfs operations. Only really matters under heavy concurrent use of mount/umount, but the bulk build actually creates that situation. * Fix numerous fork/exec*() leaks that libc can trigger due to not using O_CLOEXEC in an atomic fashion. Add various O_CLOEXEC features to functions like popen() and mk*stemp*() (add mkostemp() and mkostemps()). Fix a file descriptor leak in popen() when running in a threaded environment. * Be nicer to pthreads in vfork() by giving the new sub-process's lwp the same TID as the one that called vfork(). This allows pthread support functions to execute in the child during the vfork without imploding pthreads. * Lots of compatibility fixes to headers to improve dports bulk builds. * Several OpenSSL imports for security fixes. * Resync OpenSSH to make it easier to keep it uptodate. * Separate out kernel C flags by having the kernel build use KCFLAGS instead of CFLAGS. * Remove numerous old ISA drivers from the tree entirely. As DragonFlyBSD is now 64-bit only, we can begin to remove old drivers that do not exist on 64-bit platforms. * Introduce WORLD_CFLAGS and WORLD_CCOPTLEVEL, defaulting to -O. This makes it easier to compile your world -O2 or whatever (e.g. WORLD_CCOPTLEVEL=2). However, we discourage use of 3 or higher. Valid values are 0, 1, 2, 3, s, g, and 'fast'. * Adjust STATUS formatting for ps to make it more readable and to remove ancient flags that are no longer applicable and just create clutter. * Fix malloc() alignment for small allocations. The minimum alignment is now 16 for allocations in the 16-128 byte range instead of 8. Note that power-of-2 allocations have always been naturally aligned, but some programs use multiples of (e.g.) 16, like '48', and assume 16-byte alignment. * Fortunes refactored, added. * powerd - Add temperature-based management to powerd with a new -H lotemp:hightemp option. This feature is extremely useful on laptops with poor cooling and whos BIOSes intentionally throttle at too-high a temperature. Powerd now also detects power state changes (which can change the list of available frequencies) and properly transitions the service when a power state change occurs. * Lots of libthread_xu / pthreads fixes and adjustments to improve dports compatibility. * Add copy-on-write features to the vkernel. For example, allows multiple vkernels to use a single disk image by having each one COW modifications internally to ram. * /usr/src/secure rewired, conflicts removed from libmd, libcrypt. ### Various tools have been upgraded in the base system: * Compiler updated to GCC 5.4.1. * We now have a gold linker with LTO. * binutils 2.25 * less 481. * OpenSSL / LibRESSL completely revamped. Base now uses libressl. * Multiple timezone updates. ### Hammer Status Miscellaneous improvements. One thing that didn't make it into the release was a version bump to use a faster CRC algorithm with a different polynomial. This work will be MFC'd to -release once testing is complete. However, users should not worry about it too much because the most serious performance fix *IS* in the release (a fix to the cluster_write() code for filesystem writes). ### Hammer2 Status Development continues but no word yet on a first release. ### Clang status A starting framework has been added for using clang as the alternate base compiler in DragonFly, to replace gcc 4.7. It's not yet complete. Clang can of course be added as a package. ### 64-bit status * Note that DragonFly is a 64-bit-only operating system as of 4.6, and will not run on 32-bit hardware. * AMD Ryzen support is in the release and further work will be brought in as new Ryzen developments occur. There are some cpu-reported-topology issues that will be fixed and MFC'd. There are some stability issues currently waiting on an AMD microcode update to resolve/retest. Ryzen users can be assured that we are staying on top of it!