| 1 | #TITLE: The Budget Fair Queueing Disk Scheduler for DragonFlyBSD |
| 2 | #INCLUDE: note.sty |
| 3 | #MAKETITLE |
| 4 | * Introduction |
| 5 | The BFQ disk scheduler is invented by Paolo Valente. The current version |
| 6 | of BFQ in DragonFlyBSD is implemented according to his technique report[1]. |
| 7 | Also, some additional features are added into the current version, |
| 8 | they are inspired by the Linux version[2], but are totally written from |
| 9 | scratch. |
| 10 | |
| 11 | Like the CFQ (complete fair queue) disk scheduler under Linux, BFQ is a |
| 12 | fair queueing scheduler that aims to improve the interactivity and lower |
| 13 | the latency of the system. Maximize throughput, however, is not the major |
| 14 | design goal of BFQ. So it is better to switch to BFQ if the computer is for |
| 15 | desktop usage, in which interactivity eclipses throughput in general. |
| 16 | |
| 17 | * Basic Principles of the BFQ Scheduler |
| 18 | |
| 19 | ** Budget |
| 20 | |
| 21 | The core conception of BFQ is the "budget" of every thread. It means the |
| 22 | maximum amount of service (measured by the size of the I/O requests) that a |
| 23 | thread can receive when the scheduler is serving it exclusively. Once a |
| 24 | thread consumes up its budget, it gets off from the scheduler, assigned |
| 25 | with a new budget, and queued (again) into the fair queue. Then BFQ will |
| 26 | select another thread to serve exclusively. |
| 27 | |
| 28 | ** The WF^2Q+ fair queueing algorithm |
| 29 | |
| 30 | BFQ is based on a fair queueing algorithm named WF^2Q+. This algorithm was |
| 31 | first used on routers to fairly dispatch network packets from various |
| 32 | connections. If we replace the term "packets" and "connections" by "I/O |
| 33 | requests" and "threads (or processes)", we have reached the basic idea of |
| 34 | how this algorithm is applied to BFQ scheduler. |
| 35 | |
| 36 | The WF^2Q+ algorithm decides which thread to select and to be served by BFQ |
| 37 | when the last thread runs up its budget. It is based on the term "virtual |
| 38 | time", which is actually the service offered and received (measured by |
| 39 | bytes or sectors in implementation). It maintains a global virtual time, |
| 40 | which is the amount of service offered globally. It also maintains two |
| 41 | attributes for every thread: the virtual eligible time and the virtual |
| 42 | deadline. The former one means the total service received while the latter |
| 43 | one means the expected "time" to be selected, that is, it expects to be |
| 44 | selected by the algorithm when the global virtual time reaches its |
| 45 | deadline. |
| 46 | |
| 47 | The WF^2Q+ algorithm will always select the thread with minimum deadline |
| 48 | among the threads whose eligible time is no later than the global virtual |
| 49 | time. Intuitively, if all threads consume the same amount of budget, they |
| 50 | will be selected alternately and have a same share of disk distribution; if |
| 51 | one thread consumes more budget than others, it will get selected fewer. |
| 52 | |
| 53 | * Implementation |
| 54 | The BFQ scheduler is written on top of the ''dsched'' framework. |
| 55 | However, more features are needed from ''dsched'' than it could provide: |
| 56 | the scheduler has to be notified when the disk is idle or about to idle and |
| 57 | only with this notification can it dispatch further I/O requests to the |
| 58 | driver. Therefore, before implementing the scheduler itself, request |
| 59 | polling feature is added to ''dsched'' framework. |
| 60 | |
| 61 | ** Request polling in dsched |
| 62 | Before request polling is implemented, the ''dsched'' framework does not |
| 63 | have a ''dequeue()'' interface for scheduling policy running on top of it. |
| 64 | Instead, it provides some ''strategy()'' functions for a scheduler to call |
| 65 | when it "guesses" that the disk may be able to receive more I/O requests. |
| 66 | |
| 67 | The request polling feature transfers the guessing work to ''dsched'' by |
| 68 | maintaining a variable called ''tag_queue_depth'', which is the estimated |
| 69 | depth of the disk's NCQ or TCQ. A variable called |
| 70 | ''max_tag_queue_depth'' is initialized as the maximum depth of the disk's |
| 71 | TCQ or NCQ, which can be acquired from the driver. |
| 72 | |
| 73 | The request polling feature is not restricted only to BFQ but can be made |
| 74 | use of by any policy on ''dsched'' framework. To use this feature, a policy |
| 75 | must: |
| 76 | @ Monitor ''current_tag_queue_depth'', and push as many ''bio''s as it |
| 77 | can until the depth reaches the maximum value. Monitoring can be |
| 78 | achieved by: |
| 79 | @ Creating a monitor thread and poll the value periodically (not |
| 80 | recommended) |
| 81 | @ Monitoring the value when: |
| 82 | @ some ''bio''s are done |
| 83 | @ some ''bio''s are pushed to the scheduler by ''dsched'''s |
| 84 | ''queue()'' interface. Actually, the policy may register a |
| 85 | ''polling_func'' callback, being called by ''dsched'' when |
| 86 | a ''bio'' dispatched by |
| 87 | ''dsched_strategy_request_polling()''is done. |
| 88 | @ Use ''dsched_strategy_request_polling()'' to dispatch the ''bio''s. |
| 89 | This ''strategy()'' call will decrease the |
| 90 | ''current_tag_queue_depth''. Note that unlike |
| 91 | ''dsched_strategy_async()'', a policy cannot register a ''biodone()'' |
| 92 | callback which gets called when the dispatched ''bio'' is done. |
| 93 | Instead, if such a callback is needed, the policy should: |
| 94 | @ [optional] Register a biodone callback function (type |
| 95 | ''dsched_bio_done_t'') by assigning it to ''polling_func'' in |
| 96 | the policy structure. Note: this function should not be |
| 97 | blocked, (eg. by locks) and should be MPSAFE; this function should |
| 98 | not be changed after the ''prepare()'' interface is called. |
| 99 | |
| 100 | ** The WF^2Q fair queue |
| 101 | The WF^2Q fair queueing algorithm is implemented in |
| 102 | ''sys/kern/dsched/bfq/wf2q.c''. |
| 103 | |
| 104 | To efficiently implement the functions that WF^2Q provides, a data |
| 105 | structure named "augmented binary tree" is used. With its help, WF^2Q+ |
| 106 | can select a proper thread described above within O(log(N)) time, where N |
| 107 | is the number of threads in the tree. The inserting and deleting |
| 108 | operations are scaled O(log(N)) as well. The detailed information about |
| 109 | how to implement WF2^Q with augmented tree is in [3]. |
| 110 | |
| 111 | Before the implementation of BFQ, the ''tree.h'', which contains the |
| 112 | definition of red-black tree in \DragonFly does not support the augment |
| 113 | function. Thus the ''tree.h'' from FreeBSD is ported. |
| 114 | |
| 115 | ** The structure of the BFQ scheduler: helper thread, lwkt message and why |
| 116 | |
| 117 | In current version, a helper thread is used to executing the following |
| 118 | operations: |
| 119 | |
| 120 | *** Serialized ''bfq_dequeue()'' |
| 121 | The ''bfq_dequeue()'' function is the core of the BFQ scheduler. It takes |
| 122 | the responsibility to serve a thread within a preset time slice, dispatche |
| 123 | ''bio''s of that thread and select another thread from the WF^2Q+ fair |
| 124 | queue when current thread runs out of its budget. It should be called |
| 125 | whenever the disk is idle or about to idle. |
| 126 | |
| 127 | To avoid blocking ''ithreads'' (interrupt threads), we use a helper thread |
| 128 | to dispatch all bios to the lower driver in current version, that is to |
| 129 | say, the ''bfq_dequeue()'' function is only called by the helper thread. |
| 130 | |
| 131 | Originally, ''bfq_dequeue()'' could be called by: |
| 132 | @ ''dsched_request_polling_biodone()'', which is called by a interrupt |
| 133 | thread when a I/O request is done by the hard drive. |
| 134 | @ ''bfq_queue()'', after a user thread pushing its bios to the |
| 135 | scheduler. |
| 136 | @ ''bfq_timeout()'', after the scheduler finishing suspending. |
| 137 | @ ''bfq_destroy_tdio()'', when the tdio being destroyed is waited by |
| 138 | the scheduler. |
| 139 | |
| 140 | Now these callers will uniformly send an lwkt message to the helper thread, |
| 141 | and all bfq_dequeue() will thus be serialized. |
| 142 | |
| 143 | *** Non-blocking ''bfq_timeout()'' |
| 144 | ''bfq_timeout()'' needs to acquire BFQ_LOCK, which may cause the calling |
| 145 | thread, the callout facility to block on it. To avoid this situation, |
| 146 | in current version a function sending message to the helper thread will |
| 147 | be called when the callout alarm strikes. |
| 148 | |
| 149 | *** Non-blocking ''bfq_destroy_tdio()'' |
| 150 | Due to high latency experienced in some test case (blogbench), we have |
| 151 | found that blocking on destroying a thread is not healthy. Therefore the |
| 152 | helper thread now receives message of destroying a tdio and call |
| 153 | ''bfq_destroy_tdio()'' instead. Note that in that function, no operation on |
| 154 | the destroyed ''thread_io'' structure should be applied, because it may |
| 155 | have been recycled. |
| 156 | |
| 157 | *** Possible Performance Issues |
| 158 | |
| 159 | As almost all major scheduler operations are serialized (actually, only |
| 160 | ''bfq_queue()'' and the customized biodone function are exceptions), the |
| 161 | performance will be not as high as expected, and it is proved in some |
| 162 | benchmarks. The helper thread seems to be the most possible source of the |
| 163 | high latency, and this should be fixed in the next version, by refactoring |
| 164 | all the synchronizing operations and use as few lockings as possible. |
| 165 | |
| 166 | |
| 167 | ** How the budget of a thread is adjusted by its behaviors |
| 168 | Ideally, equal budgets is excepted to assegned to all threads, and they |
| 169 | should run out of their budgets immediately. However, the abstract is far |
| 170 | from the real world conditions. First, a thread could do random I/O which |
| 171 | is very time consuming. Second, it could be a CPU-intensive thread that |
| 172 | seldom does I/O and thus consumes its budget very slowly. |
| 173 | |
| 174 | As the BFQ scheduler runs on the service domain and it cares no time domain |
| 175 | latency issues, the actual performance (and interactivity) could be |
| 176 | affected by the two types of threads above. As a result, we have to add |
| 177 | time domain restrictions to all threads to ensure low latency. |
| 178 | |
| 179 | First, we assign a preset time slice to every thread and they |
| 180 | are only served within the interval (200ms). If a thread does not consume |
| 181 | up its budget, the scheduler will reduce its budget to the amount it has |
| 182 | consumed in the current time slice. Note that a lower budget does mean that |
| 183 | lower bandwidth shared, because of the WF^2Q+ algorithm, the thread will be |
| 184 | more frequently selected. |
| 185 | |
| 186 | Second, if a thread having enough budget pushes no further I/O requests |
| 187 | even after the whole scheduler suspends to wait a while for it, the budget |
| 188 | of it will be reduced as well. And if the the thread consumes its budget |
| 189 | too slow (for example, at current speed, it will only consume less than 2/3 |
| 190 | of its budget), it will be punished by ''charging a full budget''. As a |
| 191 | result, the time when it is selected next time will be later than expected. |
| 192 | |
| 193 | Third, if a thread runs up its budget within the time slice, its budget |
| 194 | gets increased. There are two types of the increment: |
| 195 | @ If the current budget is less than a threshold, it gets doubled, or |
| 196 | @ it gets a pre-defined linear increment. |
| 197 | |
| 198 | As one can expect, through the process of budget adjusting, every thread |
| 199 | will be assigned a proper budget to be consumed just in the time slice. |
| 200 | |
| 201 | ** The AS feature |
| 202 | It is possible that a thread pushes one ''bio'' and then waits for it to be |
| 203 | done before pushing another. Although it may be doing sequential I/O, the |
| 204 | scheduler could misunderstand this behavior and switch to another thread |
| 205 | too early. |
| 206 | |
| 207 | To avoid the above issue, the AS feature is introduced in BFQ: the |
| 208 | scheduler suspends for a while, when the current serving thread has enough |
| 209 | budget but no ''bio'' exists in its queue. If the thread pushes one or more |
| 210 | ''bio''s during waiting, the service will not be interrupted after the |
| 211 | scheduler resumes. |
| 212 | |
| 213 | However, if a thread takes too long to "think", it can not enjoy the AS |
| 214 | feature. This will be described in the next section. |
| 215 | |
| 216 | Now the AS feature is implemented with the help of the ''callout'' |
| 217 | facility. |
| 218 | |
| 219 | ** Additional features: ''ttime_avg'', ''seek_avg'' and ''peak_rate'' |
| 220 | |
| 221 | *** Average Thinking Time |
| 222 | ''ttime'' means the interval between the time when a thread pushes a |
| 223 | ''bio'' and the time when the last ''bio'' of it is done. |
| 224 | |
| 225 | We accumulate the think time and calculate an average value, by which the |
| 226 | scheduler judges whether a thread takes too long to "think". |
| 227 | |
| 228 | If a thread is too "thinking", the AS waiting could be wasting of time, |
| 229 | thus we turn of the AS feature of such a thread. |
| 230 | |
| 231 | *** Average Seek Distance |
| 232 | ''seek_avg'' is calculated by accumulating value ''current_seek_start - |
| 233 | last_seek_end''. A "seeky" thread tends to have less budget, and the |
| 234 | scheduler will not sample the disk peak rate after serving it. |
| 235 | |
| 236 | *** Disk Peak Rate Estimate |
| 237 | The peak speed of the hard drive is estimated by the amount I/O done when: |
| 238 | @ a thread runs out of its budget |
| 239 | @ a not "seeky" thread runs out of its time slice |
| 240 | |
| 241 | The peak rate is used to adapt the max budget automatically: |
| 242 | |
| 243 | ''max_budget = peak_rate * time_slice_length'' |
| 244 | |
| 245 | ** Debug interfaces |
| 246 | *** ''dsched_debug'' |
| 247 | We have defined three ''dsched_debug'' levels: |
| 248 | @ ''BFQ_DEBUG_CRITICAL'': printing errors or warnings. |
| 249 | @ ''BFQ_DEBUG_NORMAL'': printing important and non-frequently appearing |
| 250 | scheduler decisions. |
| 251 | @ ''BFQ_DEBUG_VERBOSE'': printing all scheduler decisions. |
| 252 | |
| 253 | *** Kernel Tracing |
| 254 | Also, we make use of the KTR facility to print the ''seek_avg'' and |
| 255 | ''ttime_avg'' before a thread is destroyed. To enable KTR, add the |
| 256 | following lines in your kernel configure file: |
| 257 | |
| 258 | ''options KTR'' |
| 259 | |
| 260 | ''options KTR_DSCHED_BFQ'' |
| 261 | |
| 262 | *** ''sysctl'' subtree |
| 263 | BFQ creates a subtree under node ''dsched'' for every device using it. The subtree has the following nodes: |
| 264 | @ ''max_budget'': [R/W] global maximum budget; if the auto max budget feature is turned on, this is the automatically adjusted maximum budget. |
| 265 | @ ''peak_rate'': [R] Estimated disk speed, unit: 1/1024 byte per microsecond (fixed point representation) |
| 266 | @ ''peak_samples'': [R] Valid number of samples that are used to calculate the peak rate. It remains still after reaching 80. |
| 267 | @ ''as_miss'': [R] Counter of times that a thread does not push any ''bio'' after AS waiting. |
| 268 | @ ''as_hit'': [R] Counter of times that a thread pushes at least one ''bio'' after AS waiting. |
| 269 | @ ''as_wait_avg_all'': [R] Average AS waiting time (ms). |
| 270 | @ ''as_wait_avg_miss'': [R] Average AS waiting time (ms), when AS is missed. |
| 271 | @ ''as_wait_max'': [R] The Maximum AS waiting time (ms), measured in the helper thread. |
| 272 | @ ''as_wait_max2'': [R] The Maximum AS waiting time (ms), measured in the ''callout'' callback. |
| 273 | @ ''as_high_wait_count'': [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the helper thread. |
| 274 | @ ''as_high_wait_count'': [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the ''callout'' callback. |
| 275 | @ ''avg_time_slice'': [R] Average length of time slice. |
| 276 | @ ''max_time_slice'': [R] Maximum length of time slice. |
| 277 | @ ''as_switch'': [R/W] Switch controlling the global AS feature. |
| 278 | @ ''auto_max_budget_switch'': [R/W] Switch controlling the auto max budget adapting feature. |
| 279 | * Tuning |
| 280 | Now BFQ has two tunable parameters: the global AS switch and the max |
| 281 | budget. |
| 282 | ** AS feature: on/off |
| 283 | It is reported that turning AS on may affect the interactivity and increase |
| 284 | max latency greatly. It is probably due to the over-serialized |
| 285 | implementation of BFQ. However, the blogbench result shows that turning AS |
| 286 | on will also increase the throughput greatly. |
| 287 | |
| 288 | Suggestion: turn on the AS feature, for it effects little on averate latency. |
| 289 | ** max budget: the advantages/disadvantages of a higher/lower/auto max budget |
| 290 | One thread could be assigned a budget no more than the max budget. Generally, |
| 291 | a higher budget means higher throughput because of less operations on WF2Q+ |
| 292 | augtree, while a lower budget force the scheduler cost more on those |
| 293 | operations. |
| 294 | |
| 295 | However, the real world experiments show that a too high budget affects |
| 296 | interactivity heavily. A too low budget will also cause higher latency, and |
| 297 | if the budget is less than 64KB (65536), which is smaller than the size of |
| 298 | some ''bio''s , the scheduler will retrograde to a round-robin scheduler, |
| 299 | which is not a good form for a disk scheduler. |
| 300 | |
| 301 | Suggestions: |
| 302 | Do not use auto max budget, it is usually too high. A budget of |
| 303 | 1/10 of the automatic max budget may be proper. In general, 512K(default), 256K, 192K |
| 304 | can be good. It is better to determine the best max budget by binary |
| 305 | selecting by the result of some benchmarks. |
| 306 | |
| 307 | * Benchmark Results |
| 308 | See http://leaf.dragonflybsd.org/~brillsp/bfq_bench/bfq_bench.html |
| 309 | * Known Bugs & Bottlenecks |
| 310 | @ When switching to another ''dsched'' policy from BFQ, the system may |
| 311 | deadlock. (Happens when the sysctl process and the helper thread are on the |
| 312 | same CPU.) |
| 313 | @ Currently, the performance is not so ideal and it is not tested on large |
| 314 | number of machines. It is not recommanded to use this version in a |
| 315 | productivity environment. |
| 316 | |
| 317 | * Future Plans |
| 318 | |
| 319 | @ Rewrite the scheduler to carefully and properly synchronize the operations |
| 320 | to acquire better performance |
| 321 | |
| 322 | @ Distinguish sync and async ''bio''s, as the async ones takes less time to complete, |
| 323 | the budget and the length of time slice should be different from those of |
| 324 | the sync ''bio''s. |
| 325 | |
| 326 | |
| 327 | * References |
| 328 | [1] Paolo Valente, Fabio Checconi, High Throughput Disk Scheduling with Fair Bandwidth Distribution, IEEE Transactions on Computers, vol. 59 no. 9 |
| 329 | |
| 330 | [2] http://retis.sssup.it/~fabio/linux/bfq/patches/ |
| 331 | |
| 332 | [3] I. Stoica and H. Abdel-Wahab, Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation |