1 # DragonFly Design Goals
5 * [[Caching|goals#caching]]
6 * [[I/O Model|goals#iomodel]]
7 * [[Packages|goals#packages]]
8 * [[Messaging|goals#messaging]]
9 * [[Threads|goals#threads]]
10 * [[User API|goals#userapi]]
11 * [[VFS Model|goals#vfsmodel]]
16 ## <a name="intro">What is it?</a>
20 DragonFly is going to be a multi-year project. It will take a lot of groundwork
21 to even approach the goals we outline here. By checking our various goal
22 links, you can bring up position papers on various nitty-gritty aspects
23 of kernel and system design which the project hopes to accomplish.
26 First and foremost among all of our goals is a desire to be able to
27 implement them in small bite-sized chunks, while at the same time maintaining
28 good stability for the system as a whole. While the goals are not listed
29 in any particular order, there is a natural order to things which should allow
30 us to advance piecemeal without compromising the stability of the
31 system as a whole. It's a laudable goal that will sit foremost in our
32 minds, even though we know it is probably not 100% achievable. The messaging
33 system is going to be key to the effort. If we can get that in-place we
34 will have an excellent (and debuggable) API on top of which the remainder of
35 the work can be built.
38 ## <a name="caching">Caching Infrastructure Overview</a>
41 Our goal is to create a flexible dual-purpose caching infrastructure which
42 mimics the well known and mature MESI (Modified Exclusive Shared Invalid)
43 model over a broad range of configurations. The primary purpose of this
44 infrastructure will be to protect I/O operations and live memory mappings.
45 For example, a range-based MESI model would allow multiple processes to
46 simultaneously operate both reads and writes on different portions of a single
47 file. If we implement the infrastructure properly we can extend it into
48 a networked-clustered environment, getting us a long ways towards achieving
49 a single-system-image capability.
52 Such a caching infrastructure would, for example, protect a write() from a
53 conflicting ftruncate(), and would preserve atomicity between read() and
54 write(). The same caching infrastructure would actively invalidate or
55 reload memory mappings, effectively replacing most of what VNODE locking
60 The contemplated infrastructure would utilize two-way messaging and focus
61 on the VM Object rather than the VNode as the central manager of cached
62 data. Some operations, such as a read() or write(), would obtain the
63 appropriate range lock on the VM object, issue their I/O, then release the
64 lock. Long-term caching operations might collapse ranges together to
65 bound the number of range locks being maintained, which allows the
66 infrastructure to maintain locks between operations in a scalable fashion.
67 In such cases cache operations such as invalidation or, say,
68 Exclusive->Shared transitions, would generate a message to the holding
69 entity asking it to downgrade or release its range lock.
70 In other words, the caching system being contemplated is
71 an ***actively managed*** system.
73 ## <a name="iomodel">New I/O Device Model</a>
76 I/O is considerably easier to fix than VFS owing to the fact that
77 most devices operate asynchronously already, despite having a semi-synchronous
78 API. The I/O model being contemplated consists of three major pieces of work:
81 <li>I/O Data will be represented by ranges of VM Objects instead of ranges of
82 system or user addresses. This allows I/O devices to operate entirely
83 independently of the originating user process.</li>
85 <li>Device I/O will be handled through a port/messaging system.
86 (See 'messaging' goal.)</li>
88 <li>Device I/O will typically be serialized through one or more threads.
89 Each device will typically be managed by its own thread but certain high
90 performance devices might be managed by multiple threads (up to one per cpu).
91 Multithreaded devices would not necessarily compete for resources. For
92 example, the TCP stack could be multithreaded with work split up by target
93 port number mod N, and thus run on multiple threads (and thus multiple cpus)
94 without contention.</li>
99 As part of this work I/O messages will utilize a flat 64 bit byte-offset
100 rather than block numbers.
103 Note that device messages can be acted upon synchronously by the device.
104 Do not make the mistake of assuming that messages are unconditionally
105 serialized to the device thread because they aren't. See the messaging
106 section for more information.
109 It should also be noted that the device interface is being designed with
110 the flexibility to allow devices to operate as user processes rather than
111 as kernel-only threads. Though we probably will not achieve this capability
112 for some time, the intention is to eventually be able to do it. There are
113 innumerable advantages to being able to transparently pull things like
114 virtual block devices and even whole filesystems into userspace.
116 ## <a name="packages">Dealing with Package Installation</a>
120 Applications are such a god-awful mess these days that it is hard to come
121 up with a packaging and installation system that can achieve seamless
122 installation and flawless operation. We have come to the conclusion that
123 the crux of the problem is that even seemingly minor updates to third
124 party libraries (that we have no control over) can screw up an
125 already-installed application. A packaging system **CAN** walk the
126 dependency tree and upgrade everything that needs upgrading. The
127 problem is that the packaging system might not actually have a new
128 version of the package or packages that need to be upgraded due to some
129 third party library being upgraded.
132 We need to have the luxury and ability to upgrade only the particular
133 package we want, without blowing up applications that depend on said package.
134 This isn't to say that it is desirable. Instead we say that it is
135 necessary because it allows us to do piecemeal upgrades (as well as
136 piecemeal updates to the packaging system's database itself)
137 without having to worry about blowing up other things in the process.
138 ***Eventually*** we would synchronize everything up, but there could be
139 periods of a few days to a few months where a few packages might not be,
140 and certain very large packages could wind up depending on an old version
141 of some library for a very long time. We need to be able to support that.
142 We also need to be able to support versioned support/configuration directories
143 that might be hardwired by a port. Whenever such conflicts occur, the
144 packaging system needs to version the supporting directories as well.
145 If two incompatible versions of package X both need /usr/local/etc/X
146 we would wind up with /usr/local/etc/X:VERSION1 and /usr/local/etc/X:VERSION2.
150 It is possible to accomplish this goal by explicitly versioning
151 dependencies and tagging the package binary with an 'environment'... A
152 filesystem overlay, you could call it, which applies to supporting
153 directories like /usr/lib, /usr/local/lib, even /usr/local/etc, which
154 makes only the particular version of the particular libraries
155 and/or files the package needs visible to it. Everything else would
156 be invisible to that package. By enforcing visibility you would
157 know very quickly if you specified your
158 package dependencies incorrectly, because your package would not
159 be able to find incorrectly placed libraries or supporting files,
160 because they were not made accessible when the package was installed.
161 For example, if the package says a program depends on version 1.5 of the
162 ncurses library, then version 1.5 is all that would be visible to the program
163 (it would appear as just libncurses.* to the program).
166 With such a system we would be able to install multiple versions of anything
167 whether said entities supported fine-grained version control or not, and
168 even if (in a normal sytem) there would be conflicts with other entities.
169 The packaging system would be responsible for tagging the binaries and
170 the operating system would be responsible for the visibility issues. The
171 packaging system or possibly even just a cron job would be responsible for
172 running through the system and locating all the 'cruft' that is removable
173 after you've updated all the packages that used to depend on it.
177 Another real advantage of enforced visibility is that it provides us
178 with proof-positive that a package does or does not need something. We
179 would not have to rely on the packaging system to find out what the
180 dependencies were; we could just look at the environment tagged to the
183 ## <a name="messaging">The Port/Messaging Model</a>
186 DragonFly will have a lightweight port/messaging API to go along with its
187 lightweight kernel threads. The port/messaging API is very simple
188 in concept; You construct a message, you send it to a target port, and
189 at some point later you wait for a reply on your reply port. On this
190 simple abstraction, we intend to build a high level of capability and
191 sophistication. To understand the capabilities of the messaging system,
192 you must first understand how a message is dispatched. It basically works
198 initFuMsg(&msg, replyPort, ...);
199 error = targetPort->mp_SendMsg(&msg);
200 if (error == EASYNC) {
201 /* now or at some later time, or wait on reply port */
202 error = waitMsg(&msg);
209 The messaging API will wrap this basic mechanism into synchronous and
210 asynchronous messaging functions. For example, lwkt_domsg() will send
211 a message synchronously and wait for a reply. It will set a flag to hint
212 to the target port that the message will be blocked on synchronously and
213 if the target port returns EASYNC, lwkt_domsg() will block. Likewise
214 lwkt_sendmsg() would send a message asynchronously, but if the target port
215 returns a synchronous error code (i.e. anything not EASYNC) lwkt_sendmsg()
216 will manually queue the now complete message on the reply port itself.
219 As you may have guessed, the target port's mp_SendMsg() function has total
220 control over how it deals with the message. Regardless of any hints passed
221 to it in the messaging flags, the target port can decide to act on the
222 message synchronously (in the context of the caller) and return, or it may
223 decide to queue the message and return EASYNC. Messaging operations generally
224 should not 'block' from the point of view of the initiator. That is, the
225 target port should not try to run the message synchronously if doing so would
226 cause it to block. Instead, it should queue it to its own thread (or to the
227 message queue conveniently embedded in the target port structure itself) and
231 A target port might act on a message synchronously for any
232 number of reasons. It is in fact precisely the mp_sendMsg() function for
233 the target port which deals with per-cpu caches and opportunistic locking
234 such as try_mplock() in order to deal with the request without having to
235 resort to more expensive queueing / switching.
238 The key thing to remember here is that our best case optimization is direct
239 execution by mp_SendMsg() with virtually no more overhead than a simple
240 subroutine call would otherwise entail. No queueing, no messing around
241 with the reply port... If a message can be acted upon synchronously, then we
242 are talking about an extremely inexpensive operation. It is this key feature
243 that allows us to use a messaging interface by design without having to worry
244 about performance issues. We are explicitly NOT employing the type of
245 sophistication that, say, Mach uses. We are not trying to track memory
246 mappings or pointers or anything like that, at least not in the low level
247 messaging interface. User<->Kernel messaging interfaces simply employ
248 mp_SendMsg() function vectors which do the appropriate translation, so as
249 far as the sender and recipient are concerned the message will be local to
253 ## <a name="threads">The Light Weight Kernel Threading Model</a>
257 DragonFly employs a light weight kernel threading (LWKT) model at its core.
258 Each process in the system has an associated thread, and most kernel-only
259 processes are in fact just pure threads. For example, the pageout daemon
260 is a pure thread and has no process context.
263 The LWKT model has a number of key features that can be counted on no matter
264 the architecture. These features are designed to remove or reduce contention
269 Each cpu in the system has its own self-contained LWKT scheduler.
270 Threads are locked to their cpus by design and can only be moved to other
271 cpus under certain special circumstances. Any LWKT scheduling operation
272 on a particular cpu is only directly executed on that cpu.
273 This means that the core LWKT scheduler can schedule, deschedule,
274 and switch between threads within a cpu's domain without any locking
275 whatsoever. No MP lock, nothing except a simple critical section.</li>
278 A thread will never be preemptively moved to another cpu while it is
279 running in the kernel, a thread will never be moved between cpus while
280 it is blocked. The userland scheduler may migrate a thread that is
281 running in usermode. A thread will never be preemptively switched
282 to a non-interrupt thread. If an interrupt thread preempts the current
283 thread, then the moment the interrupt thread completes or blocks, the
284 preempted thread will resume regardless of its scheduling state. For
285 example, a thread might get preempted after calling
286 lwkt_deschedule_self() but before it actually switches out. This is OK
287 because control will be returned to it directly after the interrupt
288 thread completes or blocks.</li>
292 Due to (2) above, a thread can cache information obtained through the
293 per-cpu globaldata structure without having to obtain any locks and, if
294 the information is known not to be touched by interrupts, without having to
295 enter a critical section. This allows per-cpu caches for various types of
296 information to be implemented with virtually no overhead.</li>
299 A cpu which attempts to schedule a thread belonging to another cpu
300 will issue an IPI-based message to the target cpu to execute the operation.
301 These messages are asynchronous by default and while IPIs may entail some
302 latency, they don't necessarily waste cpu cycles due to that fact. Threads
303 can block such operations by entering a critical section and, in fact,
304 that is what the LWKT scheduler does. Entering and exiting a critical
305 section are considered to be cheap operations and require no locking
306 or locked bus instructions to accomplish.</li>
309 The IPI messaging subsystem
310 deals with FIFO-full deadlocks by spinning and processing the incoming
311 queue while waiting for its outgoing queue to unstall. The IPI messaging
312 subsystem specifically does not switch threads under these circumstances
313 which allows the software to treat it as a non-blocking API even though
314 some spinning might occasionally occur.</li>
319 In addition to these key features, the LWKT model allows for both FAST
320 interrupt preemption **AND** threaded interrupt preemption. FAST
321 interrupts may preempt the current thread when it is not in a critical section.
322 Threaded interrupts may also preempt the current thread. The LWKT system
323 will switch to the threaded interrupt and then switch back to the original
324 when the threaded interrupt blocks or completes. IPI functions operate
325 in a manner very similar to FAST interrupts and have the same trapframe
326 capability. This is used heavily by DragonFly's SYSTIMERS API to distribute
327 hardclock() and statclock() interrupts to all cpus.
329 ### The IPI Messaging Subsystem
332 The LWKT model implements an asynchronous messaging system for communication
333 between cpus. Basically you simply make a call providing the target cpu with
334 a function pointer and data argument which the target cpu executes
335 asynchronously. Since this is an asynchronous model the caller does not wait
336 for a synchronous completion, which greatly improves performance, and the
337 overhead on the target cpu is roughly equivalent to an interrupt.
340 IPI messages operate like FAST Interrupts... meaning that they preempt
341 whatever is running on the target cpu (subject to a critical section), run,
342 and then whatever was running before resumes. For this reason IPI functions
343 are not allowed to block in any manner whatsoever. IPI messages are used
344 to do things like schedule threads and free memory belonging to other cpus.
347 IPI messaging is used heavily by at least half a dozen major LWKT
348 subsystems, including the per-cpu thread scheduler, the slab allocator,
349 and messaging subsystems. Since IPI messaging is a DragonFly-native
350 subsystem, it does not require and does not use the Big Giant Lock.
351 All IPI based functions must therefore be MP-safe (and they are).
353 ### The IPI-based CPU Synchronization Subsystem
356 The LWKT model implements a generalized, machine independent cpu
357 synchronization API. The API may be used to place target cpu(s) into a
358 known state while one is operating on a sensitive data structure. This
359 interface is primarily used to deal with MMU pagetable updates. For
360 example, it is not safe to check and clear the modify bit on a page table
361 entry and then remove the page table entry, even if holding the proper lock.
362 This is because a userland process running on another cpu may be accessing or
363 modifying that page, which will create a race between the TLB writeback on the
364 target cpu and your attempt to clear the page table entry. The proper
365 solution is to place all cpus that might be able to issue a writeback
366 on the page table entry (meaning all cpus in the pmap's pm_active mask)
367 into a known state first, then make the modification, then release the cpus
368 with a request to invalidate their TLB.
371 The API implemented by DragonFly is deadlock-free. Multiple cpu
372 synchronization activities are allowed to operate in parallel and this
373 includes any threads which are mastering a cpu synchronization event for
374 the duration of mastering. Even with this flexibility, since the cpu
375 synchronization interface operates in a controlled environment the callback
376 functions tend to work just like the callback functions used in the
377 IPI messaging subsystem.
379 ### Serializing Tokens
382 A serializing token may be held by any number of threads simultaneously.
383 A thread holding a token is guaranteed that no other thread also
384 holding that same token will be running at the same time.
387 A thread may hold any number of serializing tokens.
390 A thread may hold serializing tokens through a thread yield or blocking
391 condition, but must understand that another thread holding those tokens
392 may be allowed to run while the first thread is not running (blocked or
396 There are theoretically no unresolvable deadlock situations that can
397 arise with the serializing token mechanism. However, the initial
398 implementation can potentially get into livelock issues with multiply
402 Serializing tokens may also be used to protect threads from preempting
403 interrupts that attempt to obtain the same token. This is a slightly
404 different effect from the Big Giant Lock (also known as the MP lock),
405 which does not interlock against interrupts on the same cpu. ***It is
406 important to note that token atomicity is maintained through preemptive
407 conditions, even though preemption involves a temporary switch to another
408 thread. It is not necessary to enter a spl() level or critical section
409 to preserve token atomicity***.
413 Holding a serializing token does <b>not</b> prevent preemptive interrupts
414 from occuring, though it might cause some of those interrupts to
415 block-reschedule. Unthreaded FAST and IPI messaging interrupts are not
416 allowed to use tokens as they have no thread context of their own to operate
417 in. These subsystems are instead interlocked through the use of critical
420 ## <a name="userapi">Creating a Portable User API</a>
423 Most standard UNIX systems employ a system call table through which many types
424 of data, including raw structures, are passed back and forth. The biggest
425 obstacle to the ability for user programs to interoperate with kernels
426 which are older or newer than themselves is the fact that these raw structures
427 often change. The worst offenders are things like network interfaces, route
428 table ioctls, ipfw, and raw process structures for ps, vmstat, etc. But even
429 nondescript system calls like stat() and readdir() have issues. In more
430 general terms the system call list itself can create portability problems.
433 It is a goal of this project to (1) make all actual system calls message-based,
434 (2) pass structural information through capability and element lists
435 instead of as raw structures, and (3) implement a generic 'middle layer'
436 that looks somewhat like an emulation layer, managed by the kernel but loaded
437 into userspace. This layer implements all standard system call APIs
438 and converts them into the appropriate message(s).
440 For example, Linux emulation would
441 operate in (kernel-protected) userland rather then in kernelland. FreeBSD
442 emulation would work the same way. In fact, even 'native' programs will
443 run through an emulation layer in order to see the system call we all know
444 and love. The only difference is that native programs will know that the
445 emulation layer exists and is directly accessible from userland and won't
446 waste an extra INT0x80 (or whatever) to enter the kernel just to get spit
447 back out again into the emulation layer.
450 Another huge advantage of converting system calls to message-based entities
451 is that it completely solves the userland threads issue. One no longer needs
452 multiple kernel contexts or stacks to deal with multiple userland threads,
453 one needs only **one** kernel context and stack per user process. Userland
454 threads would still use rfork() to create a real process for each CPU on the
455 system, but all other operations would use a thread-aware emulation layer.
456 In fact, nearly all userland upcalls would be issued by the emulation layer
457 in userland itself, not directly by the kernel. Here is an example of how
458 a thread-aware emulation layer would work:
462 read(int fd, void *buf, size_t nbytes)
464 syscall_any_msg_t msg;
468 * Use a convenient mostly pre-built message stored in
469 * the userthread structure for synchronous requests.
471 msg = &curthread->td_sysmsg;
475 if ((error = lwkt_domsg(&syscall_port, msg)) != 0) {
476 curthread->td_errno = error;
484 And there you have it. The only 'real' system calls DragonFly would implement
485 would be message-passing primitives for sending, receiving, and waiting.
486 Everything else would go through the emulation layer. Of course, on the
487 kernel side the message command will wind up hitting a dispatch table almost
488 as big as the one that existed in FreeBSD 4.x. But as more and more
489 subsystems become message-based, the syscall messages
490 become more integrated with those subsystems
491 and the overhead of dealing with a 'message' could actually wind up being
492 less than the overhead of dealing with discrete system calls. Portability
493 becomes far easier to accomplish because the 'emulation layer' provides a
494 black box which separates what a userland program expects from what the
495 kernel expects, and the emulation layer can be updated along with the kernel
496 (or a backwards compatible version can be created) which makes portability
497 issues transparent to a userland binary.
500 Plus, we get all the advantages that a message-passing model provides,
501 including a very easy way to wedge into system calls for debugging or other
502 purposes, and a very easy way to create a security layer in the kernel
503 which could, for example, disable or alter certain classes of system calls
504 based on the security environment.
506 ## <a name="vfsmodel">The New VFS Model</a>
509 Fixing the VFS subsystem is probably the single largest piece of work
510 we will be doing. VFS has two serious issues which will require a lot
511 of reworking. First, the current VFS API uses a massive reentrancy model
512 all the way to its core and we want to try to fit it into a threaded
513 messaging API. Second, the current VFS API has one of the single most
514 complex interfaces in the system... VOP_LOOKUP and friends, which resolve
515 file paths. Fixing VFS involves two major pieces of work.
519 First, the VOP_LOOKUP interface and VFS cache will be completely redone. All
520 file paths will be loaded in an unresolved state into the VFS cache by the
521 kernel before **ANY** VFS operation is initiated. The kernel will
522 recurse down the VFS cache and when it hits a leaf it will start creating new
523 entries to represent the unresolved path elements. The tail of the snake
524 will then be handed to VFS_LOOKUP() for resolution. VFS_LOOKUP() will be
525 able to return a new VFS pointer if further resolution is required. For
526 example, it hits a mount point. The kernel will then no longer pass random
527 user supplied strings (and certainly not using user address space!) to the
532 Second, the VOP interface in general will be converted to a messaging
533 interface. All direct userspace addresses will be resolved into VM object
534 ranges by the kernel. The VOP interface will **NOT** handle direct
535 userspace addresses any more. As a messaging interface VOPs can still operate
536 synchronously, and initially that is what we will do. But the intention is
537 to thread most of the VOP interface (i.e. replace the massive reentrancy
538 model with a serialized threaded messaging model). For a high performance
539 filesystem running multiple threads (one per cpu) we can theoretically
540 achieve the same level of performance that a massively reentrant model can
541 achieve. However, blocking points, such as the bread()'s you see all over
542 filesystem code, would either have to be asynchronized, which is difficult,
543 or we would have to spawn a lot more threads to handle parallelism.
544 Initially we can take the (huge) performance hit and serialize the VOP
545 operations into a single thread, then we can optimize the filesystems we
546 care about like UFS. It should be noted that a massive reentrancy model
547 is not going to perform all that much better than, say, a 16-thread model
548 for a filesystem because in both cases the bottleneck is the I/O. As
549 long as one thread is free to handle non-blocking (cached) requests we can
550 achieve 95% of the performance of a massive reentrancy model.
553 A messaging interface is preferable for many reasons, not the least of
554 which being that it makes stacking actually work the way it should work,
555 as independent and opaque elements which stack together to form a whole.
556 For example, with the new API a capability layer could be slapped onto a
557 filesystem that otherwise doesn't implement one of its own, and the
558 enduser would not know the difference. Filesytems are almost universally
559 self-contained entities. A message-based API would allow these entities
560 to run in userspace for debugging or even in a deployment when one
561 absolutely cannot afford a crash. Why run msdosfs or cd9660 in the
562 kernel and risk a crash when it would operate just as well in userland?
563 Debugging and filesystem development are other good reasons for having a
564 messaging API rather than a massively reentrant API.