gitweb.dragonflybsd.org Git - dragonfly.git/blame_incremental

... / ...

Commit	Line	Data
	1	#TITLE: The Budget Fair Queueing Disk Scheduler for DragonFlyBSD
	2	#INCLUDE: note.sty
	3	#MAKETITLE
	4	* Introduction
	5	The BFQ disk scheduler is invented by Paolo Valente. The current version
	6	of BFQ in DragonFlyBSD is implemented according to his technique report[1].
	7	Also, some additional features are added into the current version,
	8	they are inspired by the Linux version[2], but are totally written from
	9	scratch.
	10
	11	Like the CFQ (complete fair queue) disk scheduler under Linux, BFQ is a
	12	fair queueing scheduler that aims to improve the interactivity and lower
	13	the latency of the system. Maximize throughput, however, is not the major
	14	design goal of BFQ. So it is better to switch to BFQ if the computer is for
	15	desktop usage, in which interactivity eclipses throughput in general.
	16
	17	* Basic Principles of the BFQ Scheduler
	18
	19	** Budget
	20
	21	The core conception of BFQ is the "budget" of every thread. It means the
	22	maximum amount of service (measured by the size of the I/O requests) that a
	23	thread can receive when the scheduler is serving it exclusively. Once a
	24	thread consumes up its budget, it gets off from the scheduler, assigned
	25	with a new budget, and queued (again) into the fair queue. Then BFQ will
	26	select another thread to serve exclusively.
	27
	28	** The WF^2Q+ fair queueing algorithm
	29
	30	BFQ is based on a fair queueing algorithm named WF^2Q+. This algorithm was
	31	first used on routers to fairly dispatch network packets from various
	32	connections. If we replace the term "packets" and "connections" by "I/O
	33	requests" and "threads (or processes)", we have reached the basic idea of
	34	how this algorithm is applied to BFQ scheduler.
	35
	36	The WF^2Q+ algorithm decides which thread to select and to be served by BFQ
	37	when the last thread runs up its budget. It is based on the term "virtual
	38	time", which is actually the service offered and received (measured by
	39	bytes or sectors in implementation). It maintains a global virtual time,
	40	which is the amount of service offered globally. It also maintains two
	41	attributes for every thread: the virtual eligible time and the virtual
	42	deadline. The former one means the total service received while the latter
	43	one means the expected "time" to be selected, that is, it expects to be
	44	selected by the algorithm when the global virtual time reaches its
	45	deadline.
	46
	47	The WF^2Q+ algorithm will always select the thread with minimum deadline
	48	among the threads whose eligible time is no later than the global virtual
	49	time. Intuitively, if all threads consume the same amount of budget, they
	50	will be selected alternately and have a same share of disk distribution; if
	51	one thread consumes more budget than others, it will get selected fewer.
	52
	53	* Implementation
	54	The BFQ scheduler is written on top of the ''dsched'' framework.
	55	However, more features are needed from ''dsched'' than it could provide:
	56	the scheduler has to be notified when the disk is idle or about to idle and
	57	only with this notification can it dispatch further I/O requests to the
	58	driver. Therefore, before implementing the scheduler itself, request
	59	polling feature is added to ''dsched'' framework.
	60
	61	** Request polling in dsched
	62	Before request polling is implemented, the ''dsched'' framework does not
	63	have a ''dequeue()'' interface for scheduling policy running on top of it.
	64	Instead, it provides some ''strategy()'' functions for a scheduler to call
	65	when it "guesses" that the disk may be able to receive more I/O requests.
	66
	67	The request polling feature transfers the guessing work to ''dsched'' by
	68	maintaining a variable called ''tag_queue_depth'', which is the estimated
	69	depth of the disk's NCQ or TCQ. A variable called
	70	''max_tag_queue_depth'' is initialized as the maximum depth of the disk's
	71	TCQ or NCQ, which can be acquired from the driver.
	72
	73	The request polling feature is not restricted only to BFQ but can be made
	74	use of by any policy on ''dsched'' framework. To use this feature, a policy
	75	must:
	76	@ Monitor ''current_tag_queue_depth'', and push as many ''bio''s as it
	77	can until the depth reaches the maximum value. Monitoring can be
	78	achieved by:
	79	@ Creating a monitor thread and poll the value periodically (not
	80	recommended)
	81	@ Monitoring the value when:
	82	@ some ''bio''s are done
	83	@ some ''bio''s are pushed to the scheduler by ''dsched'''s
	84	''queue()'' interface. Actually, the policy may register a
	85	''polling_func'' callback, being called by ''dsched'' when
	86	a ''bio'' dispatched by
	87	''dsched_strategy_request_polling()''is done.
	88	@ Use ''dsched_strategy_request_polling()'' to dispatch the ''bio''s.
	89	This ''strategy()'' call will decrease the
	90	''current_tag_queue_depth''. Note that unlike
	91	''dsched_strategy_async()'', a policy cannot register a ''biodone()''
	92	callback which gets called when the dispatched ''bio'' is done.
	93	Instead, if such a callback is needed, the policy should:
	94	@ [optional] Register a biodone callback function (type
	95	''dsched_bio_done_t'') by assigning it to ''polling_func'' in
	96	the policy structure. Note: this function should not be
	97	blocked, (eg. by locks) and should be MPSAFE; this function should
	98	not be changed after the ''prepare()'' interface is called.
	99
	100	** The WF^2Q fair queue
	101	The WF^2Q fair queueing algorithm is implemented in
	102	''sys/kern/dsched/bfq/wf2q.c''.
	103
	104	To efficiently implement the functions that WF^2Q provides, a data
	105	structure named "augmented binary tree" is used. With its help, WF^2Q+
	106	can select a proper thread described above within O(log(N)) time, where N
	107	is the number of threads in the tree. The inserting and deleting
	108	operations are scaled O(log(N)) as well. The detailed information about
	109	how to implement WF2^Q with augmented tree is in [3].
	110
	111	Before the implementation of BFQ, the ''tree.h'', which contains the
	112	definition of red-black tree in \DragonFly does not support the augment
	113	function. Thus the ''tree.h'' from FreeBSD is ported.
	114
	115	** The structure of the BFQ scheduler: helper thread, lwkt message and why
	116
	117	In current version, a helper thread is used to executing the following
	118	operations:
	119
	120	*** Serialized ''bfq_dequeue()''
	121	The ''bfq_dequeue()'' function is the core of the BFQ scheduler. It takes
	122	the responsibility to serve a thread within a preset time slice, dispatche
	123	''bio''s of that thread and select another thread from the WF^2Q+ fair
	124	queue when current thread runs out of its budget. It should be called
	125	whenever the disk is idle or about to idle.
	126
	127	To avoid blocking ''ithreads'' (interrupt threads), we use a helper thread
	128	to dispatch all bios to the lower driver in current version, that is to
	129	say, the ''bfq_dequeue()'' function is only called by the helper thread.
	130
	131	Originally, ''bfq_dequeue()'' could be called by:
	132	@ ''dsched_request_polling_biodone()'', which is called by a interrupt
	133	thread when a I/O request is done by the hard drive.
	134	@ ''bfq_queue()'', after a user thread pushing its bios to the
	135	scheduler.
	136	@ ''bfq_timeout()'', after the scheduler finishing suspending.
	137	@ ''bfq_destroy_tdio()'', when the tdio being destroyed is waited by
	138	the scheduler.
	139
	140	Now these callers will uniformly send an lwkt message to the helper thread,
	141	and all bfq_dequeue() will thus be serialized.
	142
	143	*** Non-blocking ''bfq_timeout()''
	144	''bfq_timeout()'' needs to acquire BFQ_LOCK, which may cause the calling
	145	thread, the callout facility to block on it. To avoid this situation,
	146	in current version a function sending message to the helper thread will
	147	be called when the callout alarm strikes.
	148
	149	*** Non-blocking ''bfq_destroy_tdio()''
	150	Due to high latency experienced in some test case (blogbench), we have
	151	found that blocking on destroying a thread is not healthy. Therefore the
	152	helper thread now receives message of destroying a tdio and call
	153	''bfq_destroy_tdio()'' instead. Note that in that function, no operation on
	154	the destroyed ''thread_io'' structure should be applied, because it may
	155	have been recycled.
	156
	157	*** Possible Performance Issues
	158
	159	As almost all major scheduler operations are serialized (actually, only
	160	''bfq_queue()'' and the customized biodone function are exceptions), the
	161	performance will be not as high as expected, and it is proved in some
	162	benchmarks. The helper thread seems to be the most possible source of the
	163	high latency, and this should be fixed in the next version, by refactoring
	164	all the synchronizing operations and use as few lockings as possible.
	165
	166
	167	** How the budget of a thread is adjusted by its behaviors
	168	Ideally, equal budgets is excepted to assegned to all threads, and they
	169	should run out of their budgets immediately. However, the abstract is far
	170	from the real world conditions. First, a thread could do random I/O which
	171	is very time consuming. Second, it could be a CPU-intensive thread that
	172	seldom does I/O and thus consumes its budget very slowly.
	173
	174	As the BFQ scheduler runs on the service domain and it cares no time domain
	175	latency issues, the actual performance (and interactivity) could be
	176	affected by the two types of threads above. As a result, we have to add
	177	time domain restrictions to all threads to ensure low latency.
	178
	179	First, we assign a preset time slice to every thread and they
	180	are only served within the interval (200ms). If a thread does not consume
	181	up its budget, the scheduler will reduce its budget to the amount it has
	182	consumed in the current time slice. Note that a lower budget does mean that
	183	lower bandwidth shared, because of the WF^2Q+ algorithm, the thread will be
	184	more frequently selected.
	185
	186	Second, if a thread having enough budget pushes no further I/O requests
	187	even after the whole scheduler suspends to wait a while for it, the budget
	188	of it will be reduced as well. And if the the thread consumes its budget
	189	too slow (for example, at current speed, it will only consume less than 2/3
	190	of its budget), it will be punished by ''charging a full budget''. As a
	191	result, the time when it is selected next time will be later than expected.
	192
	193	Third, if a thread runs up its budget within the time slice, its budget
	194	gets increased. There are two types of the increment:
	195	@ If the current budget is less than a threshold, it gets doubled, or
	196	@ it gets a pre-defined linear increment.
	197
	198	As one can expect, through the process of budget adjusting, every thread
	199	will be assigned a proper budget to be consumed just in the time slice.
	200
	201	** The AS feature
	202	It is possible that a thread pushes one ''bio'' and then waits for it to be
	203	done before pushing another. Although it may be doing sequential I/O, the
	204	scheduler could misunderstand this behavior and switch to another thread
	205	too early.
	206
	207	To avoid the above issue, the AS feature is introduced in BFQ: the
	208	scheduler suspends for a while, when the current serving thread has enough
	209	budget but no ''bio'' exists in its queue. If the thread pushes one or more
	210	''bio''s during waiting, the service will not be interrupted after the
	211	scheduler resumes.
	212
	213	However, if a thread takes too long to "think", it can not enjoy the AS
	214	feature. This will be described in the next section.
	215
	216	Now the AS feature is implemented with the help of the ''callout''
	217	facility.
	218
	219	** Additional features: ''ttime_avg'', ''seek_avg'' and ''peak_rate''
	220
	221	*** Average Thinking Time
	222	''ttime'' means the interval between the time when a thread pushes a
	223	''bio'' and the time when the last ''bio'' of it is done.
	224
	225	We accumulate the think time and calculate an average value, by which the
	226	scheduler judges whether a thread takes too long to "think".
	227
	228	If a thread is too "thinking", the AS waiting could be wasting of time,
	229	thus we turn of the AS feature of such a thread.
	230
	231	*** Average Seek Distance
	232	''seek_avg'' is calculated by accumulating value ''current_seek_start -
	233	last_seek_end''. A "seeky" thread tends to have less budget, and the
	234	scheduler will not sample the disk peak rate after serving it.
	235
	236	*** Disk Peak Rate Estimate
	237	The peak speed of the hard drive is estimated by the amount I/O done when:
	238	@ a thread runs out of its budget
	239	@ a not "seeky" thread runs out of its time slice
	240
	241	The peak rate is used to adapt the max budget automatically:
	242
	243	''max_budget = peak_rate * time_slice_length''
	244
	245	** Debug interfaces
	246	*** ''dsched_debug''
	247	We have defined three ''dsched_debug'' levels:
	248	@ ''BFQ_DEBUG_CRITICAL'': printing errors or warnings.
	249	@ ''BFQ_DEBUG_NORMAL'': printing important and non-frequently appearing
	250	scheduler decisions.
	251	@ ''BFQ_DEBUG_VERBOSE'': printing all scheduler decisions.
	252
	253	*** Kernel Tracing
	254	Also, we make use of the KTR facility to print the ''seek_avg'' and
	255	''ttime_avg'' before a thread is destroyed. To enable KTR, add the
	256	following lines in your kernel configure file:
	257
	258	''options KTR''
	259
	260	''options KTR_DSCHED_BFQ''
	261
	262	*** ''sysctl'' subtree
	263	BFQ creates a subtree under node ''dsched'' for every device using it. The subtree has the following nodes:
	264	@ ''max_budget'': [R/W] global maximum budget; if the auto max budget feature is turned on, this is the automatically adjusted maximum budget.
	265	@ ''peak_rate'': [R] Estimated disk speed, unit: 1/1024 byte per microsecond (fixed point representation)
	266	@ ''peak_samples'': [R] Valid number of samples that are used to calculate the peak rate. It remains still after reaching 80.
	267	@ ''as_miss'': [R] Counter of times that a thread does not push any ''bio'' after AS waiting.
	268	@ ''as_hit'': [R] Counter of times that a thread pushes at least one ''bio'' after AS waiting.
	269	@ ''as_wait_avg_all'': [R] Average AS waiting time (ms).
	270	@ ''as_wait_avg_miss'': [R] Average AS waiting time (ms), when AS is missed.
	271	@ ''as_wait_max'': [R] The Maximum AS waiting time (ms), measured in the helper thread.
	272	@ ''as_wait_max2'': [R] The Maximum AS waiting time (ms), measured in the ''callout'' callback.
	273	@ ''as_high_wait_count'': [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the helper thread.
	274	@ ''as_high_wait_count'': [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the ''callout'' callback.
	275	@ ''avg_time_slice'': [R] Average length of time slice.
	276	@ ''max_time_slice'': [R] Maximum length of time slice.
	277	@ ''as_switch'': [R/W] Switch controlling the global AS feature.
	278	@ ''auto_max_budget_switch'': [R/W] Switch controlling the auto max budget adapting feature.
	279	* Tuning
	280	Now BFQ has two tunable parameters: the global AS switch and the max
	281	budget.
	282	** AS feature: on/off
	283	It is reported that turning AS on may affect the interactivity and increase
	284	max latency greatly. It is probably due to the over-serialized
	285	implementation of BFQ. However, the blogbench result shows that turning AS
	286	on will also increase the throughput greatly.
	287
	288	Suggestion: turn on the AS feature, for it effects little on averate latency.
	289	** max budget: the advantages/disadvantages of a higher/lower/auto max budget
	290	One thread could be assigned a budget no more than the max budget. Generally,
	291	a higher budget means higher throughput because of less operations on WF2Q+
	292	augtree, while a lower budget force the scheduler cost more on those
	293	operations.
	294
	295	However, the real world experiments show that a too high budget affects
	296	interactivity heavily. A too low budget will also cause higher latency, and
	297	if the budget is less than 64KB (65536), which is smaller than the size of
	298	some ''bio''s , the scheduler will retrograde to a round-robin scheduler,
	299	which is not a good form for a disk scheduler.
	300
	301	Suggestions:
	302	Do not use auto max budget, it is usually too high. A budget of
	303	1/10 of the automatic max budget may be proper. In general, 512K(default), 256K, 192K
	304	can be good. It is better to determine the best max budget by binary
	305	selecting by the result of some benchmarks.
	306
	307	* Benchmark Results
	308	See http://leaf.dragonflybsd.org/~brillsp/bfq_bench/bfq_bench.html
	309	* Known Bugs & Bottlenecks
	310	@ When switching to another ''dsched'' policy from BFQ, the system may
	311	deadlock. (Happens when the sysctl process and the helper thread are on the
	312	same CPU.)
	313	@ Currently, the performance is not so ideal and it is not tested on large
	314	number of machines. It is not recommanded to use this version in a
	315	productivity environment.
	316
	317	* Future Plans
	318
	319	@ Rewrite the scheduler to carefully and properly synchronize the operations
	320	to acquire better performance
	321
	322	@ Distinguish sync and async ''bio''s, as the async ones takes less time to complete,
	323	the budget and the length of time slice should be different from those of
	324	the sync ''bio''s.
	325
	326
	327	* References
	328	[1] Paolo Valente, Fabio Checconi, High Throughput Disk Scheduling with Fair Bandwidth Distribution, IEEE Transactions on Computers, vol. 59 no. 9
	329
	330	[2] http://retis.sssup.it/~fabio/linux/bfq/patches/
	331
	332	[3] I. Stoica and H. Abdel-Wahab, Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation