From 386e1b2199f5fad83631aed2702cbea534d233e8 Mon Sep 17 00:00:00 2001 From: sjg Date: Wed, 7 Dec 2011 18:07:43 -0800 Subject: [PATCH] --- goals.mdwn | 570 ----------------------------------------------------- 1 file changed, 570 deletions(-) diff --git a/goals.mdwn b/goals.mdwn index 5f44351d..8b137891 100644 --- a/goals.mdwn +++ b/goals.mdwn @@ -1,571 +1 @@ -# DragonFly Design Goals -Table of contents: - -* [[Caching|goals#caching]] -* [[I/O Model|goals#iomodel]] -* [[Packages|goals#packages]] -* [[Messaging|goals#messaging]] -* [[Threads|goals#threads]] -* [[User API|goals#userapi]] -* [[VFS Model|goals#vfsmodel]] - - ----- - -## What is it? - - - -DragonFly is going to be a multi-year project. It will take a lot of groundwork -to even approach the goals we outline here. By checking our various goal -links, you can bring up position papers on various nitty-gritty aspects -of kernel and system design which the project hopes to accomplish. - - -First and foremost among all of our goals is a desire to be able to -implement them in small bite-sized chunks, while at the same time maintaining -good stability for the system as a whole. While the goals are not listed -in any particular order, there is a natural order to things which should allow -us to advance piecemeal without compromising the stability of the -system as a whole. It's a laudable goal that will sit foremost in our -minds, even though we know it is probably not 100% achievable. The messaging -system is going to be key to the effort. If we can get that in-place we -will have an excellent (and debuggable) API on top of which the remainder of -the work can be built. - - -## Caching Infrastructure Overview - - -Our goal is to create a flexible dual-purpose caching infrastructure which -mimics the well known and mature MESI (Modified Exclusive Shared Invalid) -model over a broad range of configurations. The primary purpose of this -infrastructure will be to protect I/O operations and live memory mappings. -For example, a range-based MESI model would allow multiple processes to -simultaneously operate both reads and writes on different portions of a single -file. If we implement the infrastructure properly we can extend it into -a networked-clustered environment, getting us a long ways towards achieving -a single-system-image capability. - - -Such a caching infrastructure would, for example, protect a write() from a -conflicting ftruncate(), and would preserve atomicity between read() and -write(). The same caching infrastructure would actively invalidate or -reload memory mappings, effectively replacing most of what VNODE locking -is used for now. - - - -The contemplated infrastructure would utilize two-way messaging and focus -on the VM Object rather than the VNode as the central manager of cached -data. Some operations, such as a read() or write(), would obtain the -appropriate range lock on the VM object, issue their I/O, then release the -lock. Long-term caching operations might collapse ranges together to -bound the number of range locks being maintained, which allows the -infrastructure to maintain locks between operations in a scalable fashion. -In such cases cache operations such as invalidation or, say, -Exclusive->Shared transitions, would generate a message to the holding -entity asking it to downgrade or release its range lock. -In other words, the caching system being contemplated is -an ***actively managed*** system. - -## New I/O Device Model - - -I/O is considerably easier to fix than VFS owing to the fact that -most devices operate asynchronously already, despite having a semi-synchronous -API. The I/O model being contemplated consists of three major pieces of work: - -
    -
  1. I/O Data will be represented by ranges of VM Objects instead of ranges of -system or user addresses. This allows I/O devices to operate entirely -independently of the originating user process.
  2. - -
  3. Device I/O will be handled through a port/messaging system. -(See 'messaging' goal.)
  4. - -
  5. Device I/O will typically be serialized through one or more threads. -Each device will typically be managed by its own thread but certain high -performance devices might be managed by multiple threads (up to one per cpu). -Multithreaded devices would not necessarily compete for resources. For -example, the TCP stack could be multithreaded with work split up by target -port number mod N, and thus run on multiple threads (and thus multiple cpus) -without contention.
  6. -
- - - -As part of this work I/O messages will utilize a flat 64 bit byte-offset -rather than block numbers. - - -Note that device messages can be acted upon synchronously by the device. -Do not make the mistake of assuming that messages are unconditionally -serialized to the device thread because they aren't. See the messaging -section for more information. - - -It should also be noted that the device interface is being designed with -the flexibility to allow devices to operate as user processes rather than -as kernel-only threads. Though we probably will not achieve this capability -for some time, the intention is to eventually be able to do it. There are -innumerable advantages to being able to transparently pull things like -virtual block devices and even whole filesystems into userspace. - -## Dealing with Package Installation - - - -Applications are such a god-awful mess these days that it is hard to come -up with a packaging and installation system that can achieve seamless -installation and flawless operation. We have come to the conclusion that -the crux of the problem is that even seemingly minor updates to third -party libraries (that we have no control over) can screw up an -already-installed application. A packaging system **CAN** walk the -dependency tree and upgrade everything that needs upgrading. The -problem is that the packaging system might not actually have a new -version of the package or packages that need to be upgraded due to some -third party library being upgraded. - - -We need to have the luxury and ability to upgrade only the particular -package we want, without blowing up applications that depend on said package. -This isn't to say that it is desirable. Instead we say that it is -necessary because it allows us to do piecemeal upgrades (as well as -piecemeal updates to the packaging system's database itself) -without having to worry about blowing up other things in the process. -***Eventually*** we would synchronize everything up, but there could be -periods of a few days to a few months where a few packages might not be, -and certain very large packages could wind up depending on an old version -of some library for a very long time. We need to be able to support that. -We also need to be able to support versioned support/configuration directories -that might be hardwired by a port. Whenever such conflicts occur, the -packaging system needs to version the supporting directories as well. -If two incompatible versions of package X both need /usr/local/etc/X -we would wind up with /usr/local/etc/X:VERSION1 and /usr/local/etc/X:VERSION2. - - - -It is possible to accomplish this goal by explicitly versioning -dependencies and tagging the package binary with an 'environment'... A -filesystem overlay, you could call it, which applies to supporting -directories like /usr/lib, /usr/local/lib, even /usr/local/etc, which -makes only the particular version of the particular libraries -and/or files the package needs visible to it. Everything else would -be invisible to that package. By enforcing visibility you would -know very quickly if you specified your -package dependencies incorrectly, because your package would not -be able to find incorrectly placed libraries or supporting files, -because they were not made accessible when the package was installed. -For example, if the package says a program depends on version 1.5 of the -ncurses library, then version 1.5 is all that would be visible to the program -(it would appear as just libncurses.* to the program). - - -With such a system we would be able to install multiple versions of anything -whether said entities supported fine-grained version control or not, and -even if (in a normal sytem) there would be conflicts with other entities. -The packaging system would be responsible for tagging the binaries and -the operating system would be responsible for the visibility issues. The -packaging system or possibly even just a cron job would be responsible for -running through the system and locating all the 'cruft' that is removable -after you've updated all the packages that used to depend on it. - - - -Another real advantage of enforced visibility is that it provides us -with proof-positive that a package does or does not need something. We -would not have to rely on the packaging system to find out what the -dependencies were; we could just look at the environment tagged to the -binary! - -## The Port/Messaging Model - - -DragonFly will have a lightweight port/messaging API to go along with its -lightweight kernel threads. The port/messaging API is very simple -in concept; You construct a message, you send it to a target port, and -at some point later you wait for a reply on your reply port. On this -simple abstraction, we intend to build a high level of capability and -sophistication. To understand the capabilities of the messaging system, -you must first understand how a message is dispatched. It basically works -like this: -
-    fubar()
-    {
-        FuMsg msg;
-        initFuMsg(&msg, replyPort, ...);
-        error = targetPort->mp_SendMsg(&msg);
-        if (error == EASYNC) {
-          /* now or at some later time, or wait on reply port */
-          error = waitMsg(&msg);    
-        }
-    }
-
- - - -The messaging API will wrap this basic mechanism into synchronous and -asynchronous messaging functions. For example, lwkt_domsg() will send -a message synchronously and wait for a reply. It will set a flag to hint -to the target port that the message will be blocked on synchronously and -if the target port returns EASYNC, lwkt_domsg() will block. Likewise -lwkt_sendmsg() would send a message asynchronously, but if the target port -returns a synchronous error code (i.e. anything not EASYNC) lwkt_sendmsg() -will manually queue the now complete message on the reply port itself. - - -As you may have guessed, the target port's mp_SendMsg() function has total -control over how it deals with the message. Regardless of any hints passed -to it in the messaging flags, the target port can decide to act on the -message synchronously (in the context of the caller) and return, or it may -decide to queue the message and return EASYNC. Messaging operations generally -should not 'block' from the point of view of the initiator. That is, the -target port should not try to run the message synchronously if doing so would -cause it to block. Instead, it should queue it to its own thread (or to the -message queue conveniently embedded in the target port structure itself) and -return EASYNC. - - -A target port might act on a message synchronously for any -number of reasons. It is in fact precisely the mp_sendMsg() function for -the target port which deals with per-cpu caches and opportunistic locking -such as try_mplock() in order to deal with the request without having to -resort to more expensive queueing / switching. - - -The key thing to remember here is that our best case optimization is direct -execution by mp_SendMsg() with virtually no more overhead than a simple -subroutine call would otherwise entail. No queueing, no messing around -with the reply port... If a message can be acted upon synchronously, then we -are talking about an extremely inexpensive operation. It is this key feature -that allows us to use a messaging interface by design without having to worry -about performance issues. We are explicitly NOT employing the type of -sophistication that, say, Mach uses. We are not trying to track memory -mappings or pointers or anything like that, at least not in the low level -messaging interface. User<->Kernel messaging interfaces simply employ -mp_SendMsg() function vectors which do the appropriate translation, so as -far as the sender and recipient are concerned the message will be local to -their VM context. - - -## The Light Weight Kernel Threading Model - - - -DragonFly employs a light weight kernel threading (LWKT) model at its core. -Each process in the system has an associated thread, and most kernel-only -processes are in fact just pure threads. For example, the pageout daemon -is a pure thread and has no process context. - - -The LWKT model has a number of key features that can be counted on no matter -the architecture. These features are designed to remove or reduce contention -between cpus. -
    -
  1. - - Each cpu in the system has its own self-contained LWKT scheduler. - Threads are locked to their cpus by design and can only be moved to other - cpus under certain special circumstances. Any LWKT scheduling operation - on a particular cpu is only directly executed on that cpu. - This means that the core LWKT scheduler can schedule, deschedule, - and switch between threads within a cpu's domain without any locking - whatsoever. No MP lock, nothing except a simple critical section.
  2. -
  3. - - A thread will never be preemptively moved to another cpu while it is - running in the kernel, a thread will never be moved between cpus while - it is blocked. The userland scheduler may migrate a thread that is - running in usermode. A thread will never be preemptively switched - to a non-interrupt thread. If an interrupt thread preempts the current - thread, then the moment the interrupt thread completes or blocks, the - preempted thread will resume regardless of its scheduling state. For - example, a thread might get preempted after calling - lwkt_deschedule_self() but before it actually switches out. This is OK - because control will be returned to it directly after the interrupt - thread completes or blocks.
  4. - -
  5. - - Due to (2) above, a thread can cache information obtained through the - per-cpu globaldata structure without having to obtain any locks and, if - the information is known not to be touched by interrupts, without having to - enter a critical section. This allows per-cpu caches for various types of - information to be implemented with virtually no overhead.
  6. -
  7. - - A cpu which attempts to schedule a thread belonging to another cpu - will issue an IPI-based message to the target cpu to execute the operation. - These messages are asynchronous by default and while IPIs may entail some - latency, they don't necessarily waste cpu cycles due to that fact. Threads - can block such operations by entering a critical section and, in fact, - that is what the LWKT scheduler does. Entering and exiting a critical - section are considered to be cheap operations and require no locking - or locked bus instructions to accomplish.
  8. -
  9. - - The IPI messaging subsystem - deals with FIFO-full deadlocks by spinning and processing the incoming - queue while waiting for its outgoing queue to unstall. The IPI messaging - subsystem specifically does not switch threads under these circumstances - which allows the software to treat it as a non-blocking API even though - some spinning might occasionally occur.
  10. -
- - - -In addition to these key features, the LWKT model allows for both FAST -interrupt preemption **AND** threaded interrupt preemption. FAST -interrupts may preempt the current thread when it is not in a critical section. -Threaded interrupts may also preempt the current thread. The LWKT system -will switch to the threaded interrupt and then switch back to the original -when the threaded interrupt blocks or completes. IPI functions operate -in a manner very similar to FAST interrupts and have the same trapframe -capability. This is used heavily by DragonFly's SYSTIMERS API to distribute -hardclock() and statclock() interrupts to all cpus. - -### The IPI Messaging Subsystem - - -The LWKT model implements an asynchronous messaging system for communication -between cpus. Basically you simply make a call providing the target cpu with -a function pointer and data argument which the target cpu executes -asynchronously. Since this is an asynchronous model the caller does not wait -for a synchronous completion, which greatly improves performance, and the -overhead on the target cpu is roughly equivalent to an interrupt. - - -IPI messages operate like FAST Interrupts... meaning that they preempt -whatever is running on the target cpu (subject to a critical section), run, -and then whatever was running before resumes. For this reason IPI functions -are not allowed to block in any manner whatsoever. IPI messages are used -to do things like schedule threads and free memory belonging to other cpus. - - -IPI messaging is used heavily by at least half a dozen major LWKT -subsystems, including the per-cpu thread scheduler, the slab allocator, -and messaging subsystems. Since IPI messaging is a DragonFly-native -subsystem, it does not require and does not use the Big Giant Lock. -All IPI based functions must therefore be MP-safe (and they are). - -### The IPI-based CPU Synchronization Subsystem - - -The LWKT model implements a generalized, machine independent cpu -synchronization API. The API may be used to place target cpu(s) into a -known state while one is operating on a sensitive data structure. This -interface is primarily used to deal with MMU pagetable updates. For -example, it is not safe to check and clear the modify bit on a page table -entry and then remove the page table entry, even if holding the proper lock. -This is because a userland process running on another cpu may be accessing or -modifying that page, which will create a race between the TLB writeback on the -target cpu and your attempt to clear the page table entry. The proper -solution is to place all cpus that might be able to issue a writeback -on the page table entry (meaning all cpus in the pmap's pm_active mask) -into a known state first, then make the modification, then release the cpus -with a request to invalidate their TLB. - - -The API implemented by DragonFly is deadlock-free. Multiple cpu -synchronization activities are allowed to operate in parallel and this -includes any threads which are mastering a cpu synchronization event for -the duration of mastering. Even with this flexibility, since the cpu -synchronization interface operates in a controlled environment the callback -functions tend to work just like the callback functions used in the -IPI messaging subsystem. - -### Serializing Tokens - - -A serializing token may be held by any number of threads simultaneously. -A thread holding a token is guaranteed that no other thread also -holding that same token will be running at the same time. - - -A thread may hold any number of serializing tokens. - - -A thread may hold serializing tokens through a thread yield or blocking -condition, but must understand that another thread holding those tokens -may be allowed to run while the first thread is not running (blocked or -yielded away). - - -There are theoretically no unresolvable deadlock situations that can -arise with the serializing token mechanism. However, the initial -implementation can potentially get into livelock issues with multiply -held tokens. - - -Serializing tokens may also be used to protect threads from preempting -interrupts that attempt to obtain the same token. This is a slightly -different effect from the Big Giant Lock (also known as the MP lock), -which does not interlock against interrupts on the same cpu. ***It is -important to note that token atomicity is maintained through preemptive -conditions, even though preemption involves a temporary switch to another -thread. It is not necessary to enter a spl() level or critical section -to preserve token atomicity***. - - - -Holding a serializing token does not prevent preemptive interrupts -from occuring, though it might cause some of those interrupts to -block-reschedule. Unthreaded FAST and IPI messaging interrupts are not -allowed to use tokens as they have no thread context of their own to operate -in. These subsystems are instead interlocked through the use of critical -sections. - -## Creating a Portable User API - - -Most standard UNIX systems employ a system call table through which many types -of data, including raw structures, are passed back and forth. The biggest -obstacle to the ability for user programs to interoperate with kernels -which are older or newer than themselves is the fact that these raw structures -often change. The worst offenders are things like network interfaces, route -table ioctls, ipfw, and raw process structures for ps, vmstat, etc. But even -nondescript system calls like stat() and readdir() have issues. In more -general terms the system call list itself can create portability problems. - - -It is a goal of this project to (1) make all actual system calls message-based, -(2) pass structural information through capability and element lists -instead of as raw structures, and (3) implement a generic 'middle layer' -that looks somewhat like an emulation layer, managed by the kernel but loaded -into userspace. This layer implements all standard system call APIs -and converts them into the appropriate message(s). - -For example, Linux emulation would -operate in (kernel-protected) userland rather then in kernelland. FreeBSD -emulation would work the same way. In fact, even 'native' programs will -run through an emulation layer in order to see the system call we all know -and love. The only difference is that native programs will know that the -emulation layer exists and is directly accessible from userland and won't -waste an extra INT0x80 (or whatever) to enter the kernel just to get spit -back out again into the emulation layer. - - -Another huge advantage of converting system calls to message-based entities -is that it completely solves the userland threads issue. One no longer needs -multiple kernel contexts or stacks to deal with multiple userland threads, -one needs only **one** kernel context and stack per user process. Userland -threads would still use rfork() to create a real process for each CPU on the -system, but all other operations would use a thread-aware emulation layer. -In fact, nearly all userland upcalls would be issued by the emulation layer -in userland itself, not directly by the kernel. Here is an example of how -a thread-aware emulation layer would work: - -
-    ssize_t
-    read(int fd, void *buf, size_t nbytes)
-    {
-        syscall_any_msg_t msg;
-        int error;
-    
-        /*
-         * Use a convenient mostly pre-built message stored in
-         * the userthread structure for synchronous requests.
-         */
-        msg = &curthread->td_sysmsg;
-        msg->fd = fd;
-        msg->buf = buf;
-        msg->nbytes = bytes;
-        if ((error = lwkt_domsg(&syscall_port, msg)) != 0) {
-        curthread->td_errno = error;
-        msg->result = -1;
-        }
-        return(msg->result);
-    }
-
- - -And there you have it. The only 'real' system calls DragonFly would implement -would be message-passing primitives for sending, receiving, and waiting. -Everything else would go through the emulation layer. Of course, on the -kernel side the message command will wind up hitting a dispatch table almost -as big as the one that existed in FreeBSD 4.x. But as more and more -subsystems become message-based, the syscall messages -become more integrated with those subsystems -and the overhead of dealing with a 'message' could actually wind up being -less than the overhead of dealing with discrete system calls. Portability -becomes far easier to accomplish because the 'emulation layer' provides a -black box which separates what a userland program expects from what the -kernel expects, and the emulation layer can be updated along with the kernel -(or a backwards compatible version can be created) which makes portability -issues transparent to a userland binary. - - -Plus, we get all the advantages that a message-passing model provides, -including a very easy way to wedge into system calls for debugging or other -purposes, and a very easy way to create a security layer in the kernel -which could, for example, disable or alter certain classes of system calls -based on the security environment. - -## The New VFS Model - - -Fixing the VFS subsystem is probably the single largest piece of work -we will be doing. VFS has two serious issues which will require a lot -of reworking. First, the current VFS API uses a massive reentrancy model -all the way to its core and we want to try to fit it into a threaded -messaging API. Second, the current VFS API has one of the single most -complex interfaces in the system... VOP_LOOKUP and friends, which resolve -file paths. Fixing VFS involves two major pieces of work. - - - -First, the VOP_LOOKUP interface and VFS cache will be completely redone. All -file paths will be loaded in an unresolved state into the VFS cache by the -kernel before **ANY** VFS operation is initiated. The kernel will -recurse down the VFS cache and when it hits a leaf it will start creating new -entries to represent the unresolved path elements. The tail of the snake -will then be handed to VFS_LOOKUP() for resolution. VFS_LOOKUP() will be -able to return a new VFS pointer if further resolution is required. For -example, it hits a mount point. The kernel will then no longer pass random -user supplied strings (and certainly not using user address space!) to the -VFS subsystem. - - - -Second, the VOP interface in general will be converted to a messaging -interface. All direct userspace addresses will be resolved into VM object -ranges by the kernel. The VOP interface will **NOT** handle direct -userspace addresses any more. As a messaging interface VOPs can still operate -synchronously, and initially that is what we will do. But the intention is -to thread most of the VOP interface (i.e. replace the massive reentrancy -model with a serialized threaded messaging model). For a high performance -filesystem running multiple threads (one per cpu) we can theoretically -achieve the same level of performance that a massively reentrant model can -achieve. However, blocking points, such as the bread()'s you see all over -filesystem code, would either have to be asynchronized, which is difficult, - or we would have to spawn a lot more threads to handle parallelism. -Initially we can take the (huge) performance hit and serialize the VOP -operations into a single thread, then we can optimize the filesystems we -care about like UFS. It should be noted that a massive reentrancy model -is not going to perform all that much better than, say, a 16-thread model -for a filesystem because in both cases the bottleneck is the I/O. As -long as one thread is free to handle non-blocking (cached) requests we can -achieve 95% of the performance of a massive reentrancy model. - - -A messaging interface is preferable for many reasons, not the least of -which being that it makes stacking actually work the way it should work, -as independent and opaque elements which stack together to form a whole. -For example, with the new API a capability layer could be slapped onto a -filesystem that otherwise doesn't implement one of its own, and the -enduser would not know the difference. Filesytems are almost universally -self-contained entities. A message-based API would allow these entities -to run in userspace for debugging or even in a deployment when one -absolutely cannot afford a crash. Why run msdosfs or cd9660 in the -kernel and risk a crash when it would operate just as well in userland? -Debugging and filesystem development are other good reasons for having a -messaging API rather than a massively reentrant API. - - - - - - - -- 2.41.0