drm - Fix kldload issue
[dragonfly.git] / sys / kern / dsched / bfq / doc / bfq.viki
1#TITLE: The Budget Fair Queueing Disk Scheduler for DragonFlyBSD
2#INCLUDE: note.sty
4* Introduction
5The BFQ disk scheduler is invented by Paolo Valente. The current version
6of BFQ in DragonFlyBSD is implemented according to his technique report[1].
7Also, some additional features are added into the current version,
8they are inspired by the Linux version[2], but are totally written from
11Like the CFQ (complete fair queue) disk scheduler under Linux, BFQ is a
12fair queueing scheduler that aims to improve the interactivity and lower
13the latency of the system. Maximize throughput, however, is not the major
14design goal of BFQ. So it is better to switch to BFQ if the computer is for
15desktop usage, in which interactivity eclipses throughput in general.
17* Basic Principles of the BFQ Scheduler
19** Budget
21The core conception of BFQ is the "budget" of every thread. It means the
22maximum amount of service (measured by the size of the I/O requests) that a
23thread can receive when the scheduler is serving it exclusively. Once a
24thread consumes up its budget, it gets off from the scheduler, assigned
25with a new budget, and queued (again) into the fair queue. Then BFQ will
26select another thread to serve exclusively.
28** The WF^2Q+ fair queueing algorithm
30BFQ is based on a fair queueing algorithm named WF^2Q+. This algorithm was
31first used on routers to fairly dispatch network packets from various
32connections. If we replace the term "packets" and "connections" by "I/O
33requests" and "threads (or processes)", we have reached the basic idea of
34how this algorithm is applied to BFQ scheduler.
36The WF^2Q+ algorithm decides which thread to select and to be served by BFQ
37when the last thread runs up its budget. It is based on the term "virtual
38time", which is actually the service offered and received (measured by
39bytes or sectors in implementation). It maintains a global virtual time,
40which is the amount of service offered globally. It also maintains two
41attributes for every thread: the virtual eligible time and the virtual
42deadline. The former one means the total service received while the latter
43one means the expected "time" to be selected, that is, it expects to be
44selected by the algorithm when the global virtual time reaches its
47The WF^2Q+ algorithm will always select the thread with minimum deadline
48among the threads whose eligible time is no later than the global virtual
49time. Intuitively, if all threads consume the same amount of budget, they
50will be selected alternately and have a same share of disk distribution; if
51one thread consumes more budget than others, it will get selected fewer.
53* Implementation
54The BFQ scheduler is written on top of the ''dsched'' framework.
55However, more features are needed from ''dsched'' than it could provide:
56the scheduler has to be notified when the disk is idle or about to idle and
57only with this notification can it dispatch further I/O requests to the
58driver. Therefore, before implementing the scheduler itself, request
59polling feature is added to ''dsched'' framework.
61** Request polling in dsched
62Before request polling is implemented, the ''dsched'' framework does not
63have a ''dequeue()'' interface for scheduling policy running on top of it.
64Instead, it provides some ''strategy()'' functions for a scheduler to call
65when it "guesses" that the disk may be able to receive more I/O requests.
67The request polling feature transfers the guessing work to ''dsched'' by
68maintaining a variable called ''tag_queue_depth'', which is the estimated
69depth of the disk's NCQ or TCQ. A variable called
70''max_tag_queue_depth'' is initialized as the maximum depth of the disk's
71TCQ or NCQ, which can be acquired from the driver.
73The request polling feature is not restricted only to BFQ but can be made
74use of by any policy on ''dsched'' framework. To use this feature, a policy
76 @ Monitor ''current_tag_queue_depth'', and push as many ''bio''s as it
77 can until the depth reaches the maximum value. Monitoring can be
78 achieved by:
79 @ Creating a monitor thread and poll the value periodically (not
80 recommended)
81 @ Monitoring the value when:
82 @ some ''bio''s are done
83 @ some ''bio''s are pushed to the scheduler by ''dsched'''s
84 ''queue()'' interface. Actually, the policy may register a
85 ''polling_func'' callback, being called by ''dsched'' when
86 a ''bio'' dispatched by
87 ''dsched_strategy_request_polling()''is done.
88 @ Use ''dsched_strategy_request_polling()'' to dispatch the ''bio''s.
89 This ''strategy()'' call will decrease the
90 ''current_tag_queue_depth''. Note that unlike
91 ''dsched_strategy_async()'', a policy cannot register a ''biodone()''
92 callback which gets called when the dispatched ''bio'' is done.
93 Instead, if such a callback is needed, the policy should:
94 @ [optional] Register a biodone callback function (type
95 ''dsched_bio_done_t'') by assigning it to ''polling_func'' in
96 the policy structure. Note: this function should not be
97 blocked, (eg. by locks) and should be MPSAFE; this function should
98 not be changed after the ''prepare()'' interface is called.
100** The WF^2Q fair queue
101The WF^2Q fair queueing algorithm is implemented in
104To efficiently implement the functions that WF^2Q provides, a data
105structure named "augmented binary tree" is used. With its help, WF^2Q+
106can select a proper thread described above within O(log(N)) time, where N
107is the number of threads in the tree. The inserting and deleting
108operations are scaled O(log(N)) as well. The detailed information about
109how to implement WF2^Q with augmented tree is in [3].
111Before the implementation of BFQ, the ''tree.h'', which contains the
112definition of red-black tree in \DragonFly does not support the augment
113function. Thus the ''tree.h'' from FreeBSD is ported.
115** The structure of the BFQ scheduler: helper thread, lwkt message and why
117In current version, a helper thread is used to executing the following
120*** Serialized ''bfq_dequeue()''
121The ''bfq_dequeue()'' function is the core of the BFQ scheduler. It takes
122the responsibility to serve a thread within a preset time slice, dispatche
123''bio''s of that thread and select another thread from the WF^2Q+ fair
124queue when current thread runs out of its budget. It should be called
125whenever the disk is idle or about to idle.
127To avoid blocking ''ithreads'' (interrupt threads), we use a helper thread
128to dispatch all bios to the lower driver in current version, that is to
129say, the ''bfq_dequeue()'' function is only called by the helper thread.
131Originally, ''bfq_dequeue()'' could be called by:
132 @ ''dsched_request_polling_biodone()'', which is called by a interrupt
133 thread when a I/O request is done by the hard drive.
134 @ ''bfq_queue()'', after a user thread pushing its bios to the
135 scheduler.
136 @ ''bfq_timeout()'', after the scheduler finishing suspending.
137 @ ''bfq_destroy_tdio()'', when the tdio being destroyed is waited by
138 the scheduler.
140Now these callers will uniformly send an lwkt message to the helper thread,
141and all bfq_dequeue() will thus be serialized.
143*** Non-blocking ''bfq_timeout()''
144''bfq_timeout()'' needs to acquire BFQ_LOCK, which may cause the calling
145thread, the callout facility to block on it. To avoid this situation,
146in current version a function sending message to the helper thread will
147be called when the callout alarm strikes.
149*** Non-blocking ''bfq_destroy_tdio()''
150Due to high latency experienced in some test case (blogbench), we have
151found that blocking on destroying a thread is not healthy. Therefore the
152helper thread now receives message of destroying a tdio and call
153''bfq_destroy_tdio()'' instead. Note that in that function, no operation on
154the destroyed ''thread_io'' structure should be applied, because it may
155have been recycled.
157*** Possible Performance Issues
159As almost all major scheduler operations are serialized (actually, only
160''bfq_queue()'' and the customized biodone function are exceptions), the
161performance will be not as high as expected, and it is proved in some
162benchmarks. The helper thread seems to be the most possible source of the
163high latency, and this should be fixed in the next version, by refactoring
164all the synchronizing operations and use as few lockings as possible.
167** How the budget of a thread is adjusted by its behaviors
168Ideally, equal budgets is excepted to assegned to all threads, and they
169should run out of their budgets immediately. However, the abstract is far
170from the real world conditions. First, a thread could do random I/O which
171is very time consuming. Second, it could be a CPU-intensive thread that
172seldom does I/O and thus consumes its budget very slowly.
174As the BFQ scheduler runs on the service domain and it cares no time domain
175latency issues, the actual performance (and interactivity) could be
176affected by the two types of threads above. As a result, we have to add
177time domain restrictions to all threads to ensure low latency.
179First, we assign a preset time slice to every thread and they
180are only served within the interval (200ms). If a thread does not consume
181up its budget, the scheduler will reduce its budget to the amount it has
182consumed in the current time slice. Note that a lower budget does mean that
183lower bandwidth shared, because of the WF^2Q+ algorithm, the thread will be
184more frequently selected.
186Second, if a thread having enough budget pushes no further I/O requests
187even after the whole scheduler suspends to wait a while for it, the budget
188of it will be reduced as well. And if the the thread consumes its budget
189too slow (for example, at current speed, it will only consume less than 2/3
190of its budget), it will be punished by ''charging a full budget''. As a
191result, the time when it is selected next time will be later than expected.
193Third, if a thread runs up its budget within the time slice, its budget
194gets increased. There are two types of the increment:
195 @ If the current budget is less than a threshold, it gets doubled, or
196 @ it gets a pre-defined linear increment.
198As one can expect, through the process of budget adjusting, every thread
199will be assigned a proper budget to be consumed just in the time slice.
201** The AS feature
202It is possible that a thread pushes one ''bio'' and then waits for it to be
203done before pushing another. Although it may be doing sequential I/O, the
204scheduler could misunderstand this behavior and switch to another thread
205too early.
207To avoid the above issue, the AS feature is introduced in BFQ: the
208scheduler suspends for a while, when the current serving thread has enough
209budget but no ''bio'' exists in its queue. If the thread pushes one or more
210''bio''s during waiting, the service will not be interrupted after the
211scheduler resumes.
213However, if a thread takes too long to "think", it can not enjoy the AS
214feature. This will be described in the next section.
216Now the AS feature is implemented with the help of the ''callout''
219** Additional features: ''ttime_avg'', ''seek_avg'' and ''peak_rate''
221*** Average Thinking Time
222''ttime'' means the interval between the time when a thread pushes a
223''bio'' and the time when the last ''bio'' of it is done.
225We accumulate the think time and calculate an average value, by which the
226scheduler judges whether a thread takes too long to "think".
228If a thread is too "thinking", the AS waiting could be wasting of time,
229thus we turn of the AS feature of such a thread.
231*** Average Seek Distance
232''seek_avg'' is calculated by accumulating value ''current_seek_start -
233last_seek_end''. A "seeky" thread tends to have less budget, and the
234scheduler will not sample the disk peak rate after serving it.
236*** Disk Peak Rate Estimate
237The peak speed of the hard drive is estimated by the amount I/O done when:
238 @ a thread runs out of its budget
239 @ a not "seeky" thread runs out of its time slice
241The peak rate is used to adapt the max budget automatically:
243''max_budget = peak_rate * time_slice_length''
245** Debug interfaces
246*** ''dsched_debug''
247We have defined three ''dsched_debug'' levels:
248 @ ''BFQ_DEBUG_CRITICAL'': printing errors or warnings.
249 @ ''BFQ_DEBUG_NORMAL'': printing important and non-frequently appearing
250 scheduler decisions.
251 @ ''BFQ_DEBUG_VERBOSE'': printing all scheduler decisions.
253*** Kernel Tracing
254Also, we make use of the KTR facility to print the ''seek_avg'' and
255''ttime_avg'' before a thread is destroyed. To enable KTR, add the
256following lines in your kernel configure file:
258''options KTR''
260''options KTR_DSCHED_BFQ''
262*** ''sysctl'' subtree
263BFQ creates a subtree under node ''dsched'' for every device using it. The subtree has the following nodes:
264 @ ''max_budget'': [R/W] global maximum budget; if the auto max budget feature is turned on, this is the automatically adjusted maximum budget.
265 @ ''peak_rate'': [R] Estimated disk speed, unit: 1/1024 byte per microsecond (fixed point representation)
266 @ ''peak_samples'': [R] Valid number of samples that are used to calculate the peak rate. It remains still after reaching 80.
267 @ ''as_miss'': [R] Counter of times that a thread does not push any ''bio'' after AS waiting.
268 @ ''as_hit'': [R] Counter of times that a thread pushes at least one ''bio'' after AS waiting.
269 @ ''as_wait_avg_all'': [R] Average AS waiting time (ms).
270 @ ''as_wait_avg_miss'': [R] Average AS waiting time (ms), when AS is missed.
271 @ ''as_wait_max'': [R] The Maximum AS waiting time (ms), measured in the helper thread.
272 @ ''as_wait_max2'': [R] The Maximum AS waiting time (ms), measured in the ''callout'' callback.
273 @ ''as_high_wait_count'': [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the helper thread.
274 @ ''as_high_wait_count'': [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the ''callout'' callback.
275 @ ''avg_time_slice'': [R] Average length of time slice.
276 @ ''max_time_slice'': [R] Maximum length of time slice.
277 @ ''as_switch'': [R/W] Switch controlling the global AS feature.
278 @ ''auto_max_budget_switch'': [R/W] Switch controlling the auto max budget adapting feature.
279* Tuning
280Now BFQ has two tunable parameters: the global AS switch and the max
282** AS feature: on/off
283It is reported that turning AS on may affect the interactivity and increase
284max latency greatly. It is probably due to the over-serialized
285implementation of BFQ. However, the blogbench result shows that turning AS
286on will also increase the throughput greatly.
288Suggestion: turn on the AS feature, for it effects little on averate latency.
289** max budget: the advantages/disadvantages of a higher/lower/auto max budget
290One thread could be assigned a budget no more than the max budget. Generally,
291a higher budget means higher throughput because of less operations on WF2Q+
292augtree, while a lower budget force the scheduler cost more on those
295However, the real world experiments show that a too high budget affects
296interactivity heavily. A too low budget will also cause higher latency, and
297if the budget is less than 64KB (65536), which is smaller than the size of
298some ''bio''s , the scheduler will retrograde to a round-robin scheduler,
299which is not a good form for a disk scheduler.
302Do not use auto max budget, it is usually too high. A budget of
3031/10 of the automatic max budget may be proper. In general, 512K(default), 256K, 192K
304can be good. It is better to determine the best max budget by binary
305selecting by the result of some benchmarks.
307* Benchmark Results
308See http://leaf.dragonflybsd.org/~brillsp/bfq_bench/bfq_bench.html
309* Known Bugs & Bottlenecks
310 @ When switching to another ''dsched'' policy from BFQ, the system may
311 deadlock. (Happens when the sysctl process and the helper thread are on the
312 same CPU.)
313 @ Currently, the performance is not so ideal and it is not tested on large
314 number of machines. It is not recommanded to use this version in a
315 productivity environment.
317* Future Plans
319 @ Rewrite the scheduler to carefully and properly synchronize the operations
320 to acquire better performance
322 @ Distinguish sync and async ''bio''s, as the async ones takes less time to complete,
323 the budget and the length of time slice should be different from those of
324 the sync ''bio''s.
327* References
328[1] Paolo Valente, Fabio Checconi, High Throughput Disk Scheduling with Fair Bandwidth Distribution, IEEE Transactions on Computers, vol. 59 no. 9
330[2] http://retis.sssup.it/~fabio/linux/bfq/patches/
332[3] I. Stoica and H. Abdel-Wahab, Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation