Adding 2.2.1 notes
[ikiwiki.git] / goals.mdwn
1# DragonFly Design Goals
3Table of contents:
95c2f0c6 5* [[Caching|goals#caching]]
6* [[I/O Model|goals#iomodel]]
7* [[Packages|goals#packages]]
8* [[Messaging|goals#messaging]]
9* [[Threads|goals#threads]]
10* [[User API|goals#userapi]]
11* [[VFS Model|goals#vfsmodel]]
16## <a name="intro">What is it?</a>
20DragonFly is going to be a multi-year project. It will take a lot of groundwork
21to even approach the goals we outline here. By checking our various goal
22links, you can bring up position papers on various nitty-gritty aspects
23of kernel and system design which the project hopes to accomplish.
26First and foremost among all of our goals is a desire to be able to
27implement them in small bite-sized chunks, while at the same time maintaining
28good stability for the system as a whole. While the goals are not listed
29in any particular order, there is a natural order to things which should allow
30us to advance piecemeal without compromising the stability of the
31system as a whole. It's a laudable goal that will sit foremost in our
32minds, even though we know it is probably not 100% achievable. The messaging
33system is going to be key to the effort. If we can get that in-place we
34will have an excellent (and debuggable) API on top of which the remainder of
35the work can be built.
38## <a name="caching">Caching Infrastructure Overview</a>
41Our goal is to create a flexible dual-purpose caching infrastructure which
42mimics the well known and mature MESI (Modified Exclusive Shared Invalid)
43model over a broad range of configurations. The primary purpose of this
44infrastructure will be to protect I/O operations and live memory mappings.
45For example, a range-based MESI model would allow multiple processes to
46simultaneously operate both reads and writes on different portions of a single
47file. If we implement the infrastructure properly we can extend it into
48a networked-clustered environment, getting us a long ways towards achieving
49a single-system-image capability.
52Such a caching infrastructure would, for example, protect a write() from a
53conflicting ftruncate(), and would preserve atomicity between read() and
54write(). The same caching infrastructure would actively invalidate or
55reload memory mappings, effectively replacing most of what VNODE locking
56is used for now.
60The contemplated infrastructure would utilize two-way messaging and focus
61on the VM Object rather than the VNode as the central manager of cached
62data. Some operations, such as a read() or write(), would obtain the
63appropriate range lock on the VM object, issue their I/O, then release the
64lock. Long-term caching operations might collapse ranges together to
65bound the number of range locks being maintained, which allows the
66infrastructure to maintain locks between operations in a scalable fashion.
67In such cases cache operations such as invalidation or, say,
68Exclusive->Shared transitions, would generate a message to the holding
69entity asking it to downgrade or release its range lock.
70In other words, the caching system being contemplated is
71an ***actively managed*** system.
73## <a name="iomodel">New I/O Device Model</a>
76I/O is considerably easier to fix than VFS owing to the fact that
77most devices operate asynchronously already, despite having a semi-synchronous
78API. The I/O model being contemplated consists of three major pieces of work:
81<li>I/O Data will be represented by ranges of VM Objects instead of ranges of
82system or user addresses. This allows I/O devices to operate entirely
83independently of the originating user process.</li>
85<li>Device I/O will be handled through a port/messaging system.
86(See 'messaging' goal.)</li>
88<li>Device I/O will typically be serialized through one or more threads.
89Each device will typically be managed by its own thread but certain high
90performance devices might be managed by multiple threads (up to one per cpu).
91Multithreaded devices would not necessarily compete for resources. For
92example, the TCP stack could be multithreaded with work split up by target
93port number mod N, and thus run on multiple threads (and thus multiple cpus)
94without contention.</li>
99As part of this work I/O messages will utilize a flat 64 bit byte-offset
100rather than block numbers.
103Note that device messages can be acted upon synchronously by the device.
104Do not make the mistake of assuming that messages are unconditionally
105serialized to the device thread because they aren't. See the messaging
106section for more information.
109It should also be noted that the device interface is being designed with
110the flexibility to allow devices to operate as user processes rather than
111as kernel-only threads. Though we probably will not achieve this capability
112for some time, the intention is to eventually be able to do it. There are
113innumerable advantages to being able to transparently pull things like
114virtual block devices and even whole filesystems into userspace.
116## <a name="packages">Dealing with Package Installation</a>
120Applications are such a god-awful mess these days that it is hard to come
121up with a packaging and installation system that can achieve seamless
122installation and flawless operation. We have come to the conclusion that
123the crux of the problem is that even seemingly minor updates to third
124party libraries (that we have no control over) can screw up an
125already-installed application. A packaging system **CAN** walk the
126dependency tree and upgrade everything that needs upgrading. The
127problem is that the packaging system might not actually have a new
128version of the package or packages that need to be upgraded due to some
129third party library being upgraded.
132We need to have the luxury and ability to upgrade only the particular
133package we want, without blowing up applications that depend on said package.
134This isn't to say that it is desirable. Instead we say that it is
135necessary because it allows us to do piecemeal upgrades (as well as
136piecemeal updates to the packaging system's database itself)
137without having to worry about blowing up other things in the process.
138***Eventually*** we would synchronize everything up, but there could be
139periods of a few days to a few months where a few packages might not be,
140and certain very large packages could wind up depending on an old version
141of some library for a very long time. We need to be able to support that.
142We also need to be able to support versioned support/configuration directories
143that might be hardwired by a port. Whenever such conflicts occur, the
144packaging system needs to version the supporting directories as well.
145If two incompatible versions of package X both need /usr/local/etc/X
146we would wind up with /usr/local/etc/X:VERSION1 and /usr/local/etc/X:VERSION2.
150It is possible to accomplish this goal by explicitly versioning
151dependencies and tagging the package binary with an 'environment'... A
152filesystem overlay, you could call it, which applies to supporting
153directories like /usr/lib, /usr/local/lib, even /usr/local/etc, which
154makes only the particular version of the particular libraries
155and/or files the package needs visible to it. Everything else would
156be invisible to that package. By enforcing visibility you would
157know very quickly if you specified your
158package dependencies incorrectly, because your package would not
159be able to find incorrectly placed libraries or supporting files,
160because they were not made accessible when the package was installed.
161For example, if the package says a program depends on version 1.5 of the
162ncurses library, then version 1.5 is all that would be visible to the program
163(it would appear as just libncurses.* to the program).
166With such a system we would be able to install multiple versions of anything
167whether said entities supported fine-grained version control or not, and
168even if (in a normal sytem) there would be conflicts with other entities.
169The packaging system would be responsible for tagging the binaries and
170the operating system would be responsible for the visibility issues. The
171packaging system or possibly even just a cron job would be responsible for
172running through the system and locating all the 'cruft' that is removable
173after you've updated all the packages that used to depend on it.
177Another real advantage of enforced visibility is that it provides us
178with proof-positive that a package does or does not need something. We
179would not have to rely on the packaging system to find out what the
180dependencies were; we could just look at the environment tagged to the
183## <a name="messaging">The Port/Messaging Model</a>
186DragonFly will have a lightweight port/messaging API to go along with its
187lightweight kernel threads. The port/messaging API is very simple
188in concept; You construct a message, you send it to a target port, and
189at some point later you wait for a reply on your reply port. On this
190simple abstraction, we intend to build a high level of capability and
191sophistication. To understand the capabilities of the messaging system,
192you must first understand how a message is dispatched. It basically works
193like this:
195 fubar()
196 {
197 FuMsg msg;
198 initFuMsg(&amp;msg, replyPort, ...);
199 error = targetPort->mp_SendMsg(&amp;msg);
200 if (error == EASYNC) {
201 /* now or at some later time, or wait on reply port */
202 error = waitMsg(&amp;msg);
203 }
204 }
209The messaging API will wrap this basic mechanism into synchronous and
210asynchronous messaging functions. For example, lwkt_domsg() will send
211a message synchronously and wait for a reply. It will set a flag to hint
212to the target port that the message will be blocked on synchronously and
213if the target port returns EASYNC, lwkt_domsg() will block. Likewise
214lwkt_sendmsg() would send a message asynchronously, but if the target port
215returns a synchronous error code (i.e. anything not EASYNC) lwkt_sendmsg()
216will manually queue the now complete message on the reply port itself.
219As you may have guessed, the target port's mp_SendMsg() function has total
220control over how it deals with the message. Regardless of any hints passed
221to it in the messaging flags, the target port can decide to act on the
222message synchronously (in the context of the caller) and return, or it may
223decide to queue the message and return EASYNC. Messaging operations generally
224should not 'block' from the point of view of the initiator. That is, the
225target port should not try to run the message synchronously if doing so would
226cause it to block. Instead, it should queue it to its own thread (or to the
227message queue conveniently embedded in the target port structure itself) and
228return EASYNC.
231A target port might act on a message synchronously for any
232number of reasons. It is in fact precisely the mp_sendMsg() function for
233the target port which deals with per-cpu caches and opportunistic locking
234such as try_mplock() in order to deal with the request without having to
235resort to more expensive queueing / switching.
238The key thing to remember here is that our best case optimization is direct
239execution by mp_SendMsg() with virtually no more overhead than a simple
240subroutine call would otherwise entail. No queueing, no messing around
241with the reply port... If a message can be acted upon synchronously, then we
242are talking about an extremely inexpensive operation. It is this key feature
243that allows us to use a messaging interface by design without having to worry
244about performance issues. We are explicitly NOT employing the type of
245sophistication that, say, Mach uses. We are not trying to track memory
246mappings or pointers or anything like that, at least not in the low level
247messaging interface. User&lt;-&gt;Kernel messaging interfaces simply employ
248mp_SendMsg() function vectors which do the appropriate translation, so as
249far as the sender and recipient are concerned the message will be local to
250their VM context.
253## <a name="threads">The Light Weight Kernel Threading Model</a>
257DragonFly employs a light weight kernel threading (LWKT) model at its core.
258Each process in the system has an associated thread, and most kernel-only
259processes are in fact just pure threads. For example, the pageout daemon
260is a pure thread and has no process context.
263The LWKT model has a number of key features that can be counted on no matter
264the architecture. These features are designed to remove or reduce contention
265between cpus.
267 <li>
269 Each cpu in the system has its own self-contained LWKT scheduler.
270 Threads are locked to their cpus by design and can only be moved to other
271 cpus under certain special circumstances. Any LWKT scheduling operation
272 on a particular cpu is only directly executed on that cpu.
273 This means that the core LWKT scheduler can schedule, deschedule,
274 and switch between threads within a cpu's domain without any locking
275 whatsoever. No MP lock, nothing except a simple critical section.</li>
276 <li>
278 A thread will never be preemptively moved to another cpu while it is
279 running in the kernel, a thread will never be moved between cpus while
280 it is blocked. The userland scheduler may migrate a thread that is
281 running in usermode. A thread will never be preemptively switched
282 to a non-interrupt thread. If an interrupt thread preempts the current
283 thread, then the moment the interrupt thread completes or blocks, the
284 preempted thread will resume regardless of its scheduling state. For
285 example, a thread might get preempted after calling
286 lwkt_deschedule_self() but before it actually switches out. This is OK
287 because control will be returned to it directly after the interrupt
288 thread completes or blocks.</li>
290 <li>
292 Due to (2) above, a thread can cache information obtained through the
293 per-cpu globaldata structure without having to obtain any locks and, if
294 the information is known not to be touched by interrupts, without having to
295 enter a critical section. This allows per-cpu caches for various types of
296 information to be implemented with virtually no overhead.</li>
297 <li>
299 A cpu which attempts to schedule a thread belonging to another cpu
300 will issue an IPI-based message to the target cpu to execute the operation.
301 These messages are asynchronous by default and while IPIs may entail some
302 latency, they don't necessarily waste cpu cycles due to that fact. Threads
303 can block such operations by entering a critical section and, in fact,
304 that is what the LWKT scheduler does. Entering and exiting a critical
305 section are considered to be cheap operations and require no locking
306 or locked bus instructions to accomplish.</li>
307 <li>
309 The IPI messaging subsystem
310 deals with FIFO-full deadlocks by spinning and processing the incoming
311 queue while waiting for its outgoing queue to unstall. The IPI messaging
312 subsystem specifically does not switch threads under these circumstances
313 which allows the software to treat it as a non-blocking API even though
314 some spinning might occasionally occur.</li>
319In addition to these key features, the LWKT model allows for both FAST
320interrupt preemption **AND** threaded interrupt preemption. FAST
321interrupts may preempt the current thread when it is not in a critical section.
322Threaded interrupts may also preempt the current thread. The LWKT system
323will switch to the threaded interrupt and then switch back to the original
324when the threaded interrupt blocks or completes. IPI functions operate
325in a manner very similar to FAST interrupts and have the same trapframe
326capability. This is used heavily by DragonFly's SYSTIMERS API to distribute
327hardclock() and statclock() interrupts to all cpus.
329### The IPI Messaging Subsystem
332The LWKT model implements an asynchronous messaging system for communication
333between cpus. Basically you simply make a call providing the target cpu with
334a function pointer and data argument which the target cpu executes
335asynchronously. Since this is an asynchronous model the caller does not wait
336for a synchronous completion, which greatly improves performance, and the
337overhead on the target cpu is roughly equivalent to an interrupt.
340IPI messages operate like FAST Interrupts... meaning that they preempt
341whatever is running on the target cpu (subject to a critical section), run,
342and then whatever was running before resumes. For this reason IPI functions
343are not allowed to block in any manner whatsoever. IPI messages are used
344to do things like schedule threads and free memory belonging to other cpus.
347IPI messaging is used heavily by at least half a dozen major LWKT
348subsystems, including the per-cpu thread scheduler, the slab allocator,
349and messaging subsystems. Since IPI messaging is a DragonFly-native
350subsystem, it does not require and does not use the Big Giant Lock.
351All IPI based functions must therefore be MP-safe (and they are).
353### The IPI-based CPU Synchronization Subsystem
356The LWKT model implements a generalized, machine independent cpu
357synchronization API. The API may be used to place target cpu(s) into a
358known state while one is operating on a sensitive data structure. This
359interface is primarily used to deal with MMU pagetable updates. For
360example, it is not safe to check and clear the modify bit on a page table
361entry and then remove the page table entry, even if holding the proper lock.
362This is because a userland process running on another cpu may be accessing or
363modifying that page, which will create a race between the TLB writeback on the
364target cpu and your attempt to clear the page table entry. The proper
365solution is to place all cpus that might be able to issue a writeback
366on the page table entry (meaning all cpus in the pmap's pm_active mask)
367into a known state first, then make the modification, then release the cpus
368with a request to invalidate their TLB.
371The API implemented by DragonFly is deadlock-free. Multiple cpu
372synchronization activities are allowed to operate in parallel and this
373includes any threads which are mastering a cpu synchronization event for
374the duration of mastering. Even with this flexibility, since the cpu
375synchronization interface operates in a controlled environment the callback
376functions tend to work just like the callback functions used in the
377IPI messaging subsystem.
379### Serializing Tokens
382A serializing token may be held by any number of threads simultaneously.
383A thread holding a token is guaranteed that no other thread also
384holding that same token will be running at the same time.
387A thread may hold any number of serializing tokens.
390A thread may hold serializing tokens through a thread yield or blocking
391condition, but must understand that another thread holding those tokens
392may be allowed to run while the first thread is not running (blocked or
393yielded away).
396There are theoretically no unresolvable deadlock situations that can
397arise with the serializing token mechanism. However, the initial
398implementation can potentially get into livelock issues with multiply
399held tokens.
402Serializing tokens may also be used to protect threads from preempting
403interrupts that attempt to obtain the same token. This is a slightly
404different effect from the Big Giant Lock (also known as the MP lock),
405which does not interlock against interrupts on the same cpu. ***It is
406important to note that token atomicity is maintained through preemptive
407conditions, even though preemption involves a temporary switch to another
408thread. It is not necessary to enter a spl() level or critical section
409to preserve token atomicity***.
413Holding a serializing token does <b>not</b> prevent preemptive interrupts
414from occuring, though it might cause some of those interrupts to
415block-reschedule. Unthreaded FAST and IPI messaging interrupts are not
416allowed to use tokens as they have no thread context of their own to operate
417in. These subsystems are instead interlocked through the use of critical
420## <a name="userapi">Creating a Portable User API</a>
423Most standard UNIX systems employ a system call table through which many types
424of data, including raw structures, are passed back and forth. The biggest
425obstacle to the ability for user programs to interoperate with kernels
426which are older or newer than themselves is the fact that these raw structures
427often change. The worst offenders are things like network interfaces, route
428table ioctls, ipfw, and raw process structures for ps, vmstat, etc. But even
429nondescript system calls like stat() and readdir() have issues. In more
430general terms the system call list itself can create portability problems.
433It is a goal of this project to (1) make all actual system calls message-based,
434(2) pass structural information through capability and element lists
435instead of as raw structures, and (3) implement a generic 'middle layer'
436that looks somewhat like an emulation layer, managed by the kernel but loaded
437into userspace. This layer implements all standard system call APIs
438and converts them into the appropriate message(s).
440For example, Linux emulation would
441operate in (kernel-protected) userland rather then in kernelland. FreeBSD
442emulation would work the same way. In fact, even 'native' programs will
443run through an emulation layer in order to see the system call we all know
444and love. The only difference is that native programs will know that the
445emulation layer exists and is directly accessible from userland and won't
446waste an extra INT0x80 (or whatever) to enter the kernel just to get spit
447back out again into the emulation layer.
450Another huge advantage of converting system calls to message-based entities
451is that it completely solves the userland threads issue. One no longer needs
452multiple kernel contexts or stacks to deal with multiple userland threads,
453one needs only **one** kernel context and stack per user process. Userland
454threads would still use rfork() to create a real process for each CPU on the
455system, but all other operations would use a thread-aware emulation layer.
456In fact, nearly all userland upcalls would be issued by the emulation layer
457in userland itself, not directly by the kernel. Here is an example of how
458a thread-aware emulation layer would work:
461 ssize_t
462 read(int fd, void *buf, size_t nbytes)
463 {
464 syscall_any_msg_t msg;
465 int error;
467 /*
468 * Use a convenient mostly pre-built message stored in
469 * the userthread structure for synchronous requests.
470 */
471 msg = &amp;curthread->td_sysmsg;
472 msg->fd = fd;
473 msg->buf = buf;
474 msg->nbytes = bytes;
475 if ((error = lwkt_domsg(&amp;syscall_port, msg)) != 0) {
476 curthread->td_errno = error;
477 msg->result = -1;
478 }
479 return(msg->result);
480 }
484And there you have it. The only 'real' system calls DragonFly would implement
485would be message-passing primitives for sending, receiving, and waiting.
486Everything else would go through the emulation layer. Of course, on the
487kernel side the message command will wind up hitting a dispatch table almost
488as big as the one that existed in FreeBSD 4.x. But as more and more
489subsystems become message-based, the syscall messages
490become more integrated with those subsystems
491and the overhead of dealing with a 'message' could actually wind up being
492less than the overhead of dealing with discrete system calls. Portability
493becomes far easier to accomplish because the 'emulation layer' provides a
494black box which separates what a userland program expects from what the
495kernel expects, and the emulation layer can be updated along with the kernel
496(or a backwards compatible version can be created) which makes portability
497issues transparent to a userland binary.
500Plus, we get all the advantages that a message-passing model provides,
501including a very easy way to wedge into system calls for debugging or other
502purposes, and a very easy way to create a security layer in the kernel
503which could, for example, disable or alter certain classes of system calls
504based on the security environment.
506## <a name="vfsmodel">The New VFS Model</a>
509Fixing the VFS subsystem is probably the single largest piece of work
510we will be doing. VFS has two serious issues which will require a lot
511of reworking. First, the current VFS API uses a massive reentrancy model
512all the way to its core and we want to try to fit it into a threaded
513messaging API. Second, the current VFS API has one of the single most
514complex interfaces in the system... VOP_LOOKUP and friends, which resolve
515file paths. Fixing VFS involves two major pieces of work.
519First, the VOP_LOOKUP interface and VFS cache will be completely redone. All
520file paths will be loaded in an unresolved state into the VFS cache by the
521kernel before **ANY** VFS operation is initiated. The kernel will
522recurse down the VFS cache and when it hits a leaf it will start creating new
523entries to represent the unresolved path elements. The tail of the snake
524will then be handed to VFS_LOOKUP() for resolution. VFS_LOOKUP() will be
525able to return a new VFS pointer if further resolution is required. For
526example, it hits a mount point. The kernel will then no longer pass random
527user supplied strings (and certainly not using user address space!) to the
528VFS subsystem.
532Second, the VOP interface in general will be converted to a messaging
533interface. All direct userspace addresses will be resolved into VM object
534ranges by the kernel. The VOP interface will **NOT** handle direct
535userspace addresses any more. As a messaging interface VOPs can still operate
536synchronously, and initially that is what we will do. But the intention is
537to thread most of the VOP interface (i.e. replace the massive reentrancy
538model with a serialized threaded messaging model). For a high performance
539filesystem running multiple threads (one per cpu) we can theoretically
540achieve the same level of performance that a massively reentrant model can
541achieve. However, blocking points, such as the bread()'s you see all over
542filesystem code, would either have to be asynchronized, which is difficult,
543 or we would have to spawn a lot more threads to handle parallelism.
544Initially we can take the (huge) performance hit and serialize the VOP
545operations into a single thread, then we can optimize the filesystems we
546care about like UFS. It should be noted that a massive reentrancy model
547is not going to perform all that much better than, say, a 16-thread model
548for a filesystem because in both cases the bottleneck is the I/O. As
549long as one thread is free to handle non-blocking (cached) requests we can
550achieve 95% of the performance of a massive reentrancy model.
553A messaging interface is preferable for many reasons, not the least of
554which being that it makes stacking actually work the way it should work,
555as independent and opaque elements which stack together to form a whole.
556For example, with the new API a capability layer could be slapped onto a
557filesystem that otherwise doesn't implement one of its own, and the
558enduser would not know the difference. Filesytems are almost universally
559self-contained entities. A message-based API would allow these entities
560to run in userspace for debugging or even in a deployment when one
561absolutely cannot afford a crash. Why run msdosfs or cd9660 in the
562kernel and risk a crash when it would operate just as well in userland?
563Debugging and filesystem development are other good reasons for having a
564messaging API rather than a massively reentrant API.