Commit | Line | Data |
---|---|---|
b790f880 C |
1 | Initial specification of the new DragonFly filesystem.\r |
2 | Currently there is no name for this, DFS would have been nice, but is already overloaded.\r | |
3 | \r | |
4 | ## Feature Summary \r | |
5 | \r | |
6 | ||
7 | * On-demand filesystem check and recovery. No need to scan the entire\r | |
8 | filesystem before going live after a reboot.\r | |
9 | \r | |
10 | ||
11 | * Infinite snapshots (e.g. on the 30 second filesystem sync), ability\r | |
12 | to collapse snapshots for any time interval as a means of\r | |
13 | recovering space.\r | |
14 | \r | |
15 | ||
16 | * Multi-master operation, including the ability to self-heal a\r | |
17 | corrupted filesystem by accessing replicated data. Multi-master \r | |
18 | means that each replicated store can act as a master. New\r | |
19 | filesystem ops can run independantly on any of the replicated stores\r | |
20 | and will flow to the others. This is done by giving the controller\r | |
21 | of each store a certain amount of guarenteed free space that can be\r | |
22 | filled without having to notify other controllers.\r | |
23 | \r | |
24 | ||
25 | * Infinite logless Replication. No requirement to keep a log of changes\r | |
26 | for the purposes of replication, meaning that replication targets can\r | |
27 | be offline for 'days' without effecting performance or operation.\r | |
28 | ('mobile' computing, but also replication over slow links for backup\r | |
29 | purposes and other things). Certain information is still logged,\r | |
30 | but only to streamline performance.\r | |
31 | \r | |
32 | ||
33 | * 64 bit file space, 64 bit filesystem space. No space restrictions\r | |
34 | whatsoever.\r | |
35 | \r | |
36 | ||
37 | * Reliably handles data storage for huge multi-hundred-terrabyte\r | |
38 | filesystems without fear of unrecoverable corruption.\r | |
39 | \r | |
40 | ||
41 | * Cluster operation - ability to commit data to locally replicated\r | |
42 | store independantly of other replication nodes, with access governed\r | |
43 | by cache coherency protocols.\r | |
44 | \r | |
45 | ||
46 | * Independant index. Data is laid out in a highly recoverable fashion,\r | |
47 | independant of index generation. Indexes can be regenerated from\r | |
48 | scratch and thus indexes can be updated asynchronously.\r | |
49 | \r | |
50 | ## Segmentation \r | |
51 | \r | |
52 | The physical storage backing a filesystem is broken up into large\r | |
53 | 1MB-4GB segments (64MB is a typical value). Each segment is\r | |
54 | self-identifying and contains its own header, data table, and record\r | |
55 | table. The operating system glues together filesystems and determines\r | |
56 | availability based on the segments it finds.\r | |
57 | \r | |
58 | All segments on a block device are the same size (power-of-2 restricted)\r | |
59 | and the OS can probe the disk without any other information simply by\r | |
60 | locating a segment header, starting with the largest possible segment\r | |
61 | size. Segment size is typically chosen such that a scan of all segment\r | |
62 | headers takes no longer then a few seconds.\r | |
63 | \r | |
64 | ||
65 | * The segment header contains all the information required to associate\r | |
66 | the segment with a filesystem. Certain information in the segment\r | |
67 | header is updated with an ordered write to 'commit' other work already\r | |
68 | flushed out into the segment. The segment header also contains a \r | |
69 | bit indicating whether the segment is 'open' or not, to aid in \r | |
70 | on-demand recovery.\r | |
71 | \r | |
72 | The segment ID is stored in the segment header, allowing segments\r | |
73 | to be easily relocated. That is, the segment's location in the\r | |
74 | physical backing store is irrelevant.\r | |
75 | \r | |
76 | ||
77 | * The data table consists of pure data, laid out linearly in the forward\r | |
78 | direction within the segment. Data blocks are variable-sized entities\r | |
79 | containing pure data, with no other identifying information, suitable\r | |
80 | for direct DMA. The segment header has a simple append index for\r | |
81 | the data table.\r | |
82 | \r | |
83 | ||
84 | * The record table consists of fixed-sized records and a reference to\r | |
85 | data in the data table. The record table is built backwards from\r | |
86 | the end of the segment. The segment header has a simple pre-append\r | |
87 | index for the record table. Each record consists of:\r | |
88 | \r | |
89 | ||
90 | * an object identifier (e.g. inode number)\r | |
91 | ||
92 | * creation transid\r | |
93 | ||
94 | * deletion transid (updated in-place)\r | |
95 | ||
96 | * indexing key (offset or filename key)\r | |
97 | ||
98 | * data table reference (offset, bytes) (up to a 1GB contiguous ref)\r | |
99 | ||
100 | * integrity check\r | |
101 | \r | |
102 | All data elements are completely identified by the record table. All\r | |
103 | modifications are historical in nature... that is, no actual data is\r | |
104 | destroyed and one can access a 'snapshot' of the segment as of any\r | |
105 | transaction id (i.e. timestamp) simply by ignoring those records\r | |
106 | marked as deleted prior to the specified time or created after\r | |
107 | the specified time. \r | |
108 | \r | |
109 | Certain updates to the object table occur in-place. In particular,\r | |
110 | the deletion transaction id is updated in-place. However, such\r | |
111 | updates are not considered 'committed' until the segment header itself\r | |
112 | is updated with the latest committed transaction id and a recovery\r | |
113 | operation will undo any deletion transaction id greater then the\r | |
114 | committed transaction id in the segment header, as well as free\r | |
115 | any uncommitted objects and their related data.\r | |
116 | \r | |
117 | The entire filesystem can be recovered from just the record and data\r | |
118 | tables. Indexing and cross-segment spanning is described in later\r | |
119 | sections.\r | |
120 | \r | |
121 | ## Physical space recovery \r | |
122 | \r | |
123 | Physical space recovery requires actually destroying records marked\r | |
124 | as deleted. Snapshots can be collapsed by destroying records whos\r | |
125 | creation AND deletion id's fall within the collapsed space. The oldest\r | |
126 | data is freed up by destroying records whos deletion id is less then\r | |
127 | the terminal timestamp. \r | |
128 | \r | |
129 | Record destruction creates holes in both the data table and the record\r | |
130 | table. Any holes adjacent to the data table append point or the record\r | |
131 | table prepend point are immediately recovered by adjusting the \r | |
132 | appropriate indices in the segment header. The operating system may\r | |
133 | cache a record of non-adjacent holes (in memory) and reuse the space,\r | |
134 | and can also generate an in-memory index of available holes on the\r | |
135 | fly when space is very tight (which requires scanning the record table),\r | |
136 | but otherwise the recovery of any space not adjacent to the data table\r | |
137 | append point requires a performance reorganization of the segment.\r | |
138 | \r | |
139 | ## Locating Data objects, and the Master Record \r | |
140 | \r | |
141 | Data objects are basically the amalgamation of all records with\r | |
142 | the same 64 bit object identifier. The top N bits of this identifier\r | |
143 | indicate the segment the master record for the data object resides in.\r | |
144 | All 64 bits are made available to userland.\r | |
145 | \r | |
146 | The filesystem needs to be free to relocate the master record for\r | |
147 | a data object on the fly. Relocation is a dynamic process whereby \r | |
148 | the master record in a segment becomes a forwarding record to another\r | |
149 | segment. Any references to a forwarding record is adjusted on the fly\r | |
150 | in order to collapse the chain, and any intermediate forwarding records\r | |
151 | can be physically destroyed once all references to them have been\r | |
152 | adjusted. However, the ORIGINAL forwarding record must be retained\r | |
153 | for all time in order to maintain the uniqueness of the originally\r | |
154 | assigned user-visible inode number and to give us a way to locate\r | |
155 | the master record given the inode number. We cannot change the inode\r | |
156 | number. Overhead is minimal.\r | |
157 | \r | |
158 | A forwarding record is created in two situations:\r | |
159 | 1. To move the master record to improve performance\r | |
160 | 2. If the current segment does not have sufficient space to increase the\r | |
161 | size the master record if it becomes necessary to do so.\r | |
162 | \r | |
163 | A forwarding record is a degenerate case of a master record and the\r | |
164 | forwarding information is embedded in the record itself, with no\r | |
165 | data table reference at all. The space used is only the space required\r | |
166 | to store a record table entry.\r | |
167 | \r | |
168 | The master record for a data object is a special record in the record\r | |
169 | table. It is special because it is NOT part of the historical data\r | |
170 | store. The creation and deletion transaction ids represent the creation\r | |
171 | or deletion of the entire data object, and the data block contains\r | |
172 | information on where to find the other bits and pieces of the data\r | |
173 | object (in particular, if the data object spans more then one segment).\r | |
174 | ***'A corrupted Master Record can be completely reconstructed by scanninG\r | |
175 | the filesystem***'.\r | |
176 | \r | |
177 | The master record can be thought of as a persistent cache. It gives\r | |
178 | the filesystem a way to quickly locate all the segments that might\r | |
179 | contain records for the data object and reserves a limited amount of\r | |
180 | space for the filesystem to store direct references to data residing\r | |
181 | in the same segment as the master record.\r | |
182 | \r | |
183 | Master record updates are fairly rare. For the most part a master\r | |
184 | record must only be updated if a file or directory grows into a\r | |
185 | new segment.\r | |
186 | \r | |
187 | ## Performance reorganization \r | |
188 | \r | |
189 | Segments can be repacked in order to clean up fragmentation and \r | |
190 | reorganize data and records for more optimal access. Repacking is\r | |
191 | accomplished by copying the segment's data into a wholely new\r | |
192 | segment on the physical disk, then destroying the old segment.\r | |
193 | \r | |
194 | Since a segment is identified by its header the actual location of\r | |
195 | the segment on the physical media is irrelevant.\r | |
196 | \r | |
197 | For example, master records can wind up strewn about the record\r | |
198 | table for a segment. Repacking would pack all of those master\r | |
199 | records next to each other.\r | |
200 | \r | |
201 | Similarly, a file or directory might have bits and pieces of data\r | |
202 | strewn about a segment. Repacking would de-fragment those entities,\r | |
203 | as well as pack together the data related to small files and \r | |
204 | separate out the larger chunks related to larger files.\r | |
205 | \r | |
206 | ## Segment Allocation \r | |
207 | \r | |
208 | Remember that since the actual physical location of a segment is\r | |
209 | independant of the segment identifier (typically an 8 or 16 bit\r | |
210 | number), the allocation of segment numbers does not have to be\r | |
211 | linear. The filesystem will typically be generous in its allocation\r | |
212 | of segment numbers in order to allow for spill over growth of files\r | |
213 | into logically adjacent segments, thus simplifying the segment\r | |
214 | range stored in the master record that describes which segments\r | |
215 | might contain data records for a data object. For example,\r | |
216 | the first segment allocated by the filesystem when using a 16 bit\r | |
217 | segment id would not be 0x0000 or 0xffff, but probably 0x8000. The\r | |
218 | next segment allocated (when not doing a spill-over) might be 0x4000\r | |
219 | or 0xc000, and so forth. The entire segment space is used even\r | |
220 | if there are fewer actual (physical) segments.\r | |
221 | \r | |
222 | ## Large cross-segment objects \r | |
223 | \r | |
224 | A Data object can wind up being far larger then a segment. For that\r | |
225 | matter, even a small data object with a lot of history can wind up being\r | |
226 | far larger then a segment.\r | |
227 | \r | |
228 | The filesystem does its best to organize the data for such objects\r | |
229 | to reduce the size of the master records required to describe them\r | |
230 | and to separate historical data from current data, to maintain the\r | |
231 | performance of the live filesystem.\r | |
232 | \r | |
233 | ## Indexing \r | |
234 | \r | |
235 | An object's master record generally describes the segments containing\r | |
236 | the data for the object, and may contain direct references to data\r | |
237 | in the same segment as the master record (an optimization or performance \r | |
238 | reorganization that is desireable for files smaller then a single\r | |
239 | segment).\r | |
240 | \r | |
241 | However, data objects can grow to be many records due to fragmentation,\r | |
242 | simply being large files, or due to the retention of a great deal of\r | |
243 | history. The records pertaining to these objects may have to be indexed\r | |
244 | beyond the space the filesystem is willing to reserve in the master\r | |
245 | record in order to improve access.\r | |
246 | \r | |
247 | To make it clear, indexing is an optimization. The index is not\r | |
248 | required to recover a segment or to recover a filesystem. The intent\r | |
249 | is for the master record to always contain sufficient information\r | |
250 | to reduce the number I/O's required to resolve a file access request\r | |
251 | to a reasonable number, even if no index is available.\r | |
252 | \r | |
253 | Indexing can occur in three places:\r | |
254 | \r | |
255 | ||
256 | * First, it can occur in the segment itself to organize the records\r | |
257 | in that segment. Such indexes are usually persistently cached in the\r | |
258 | dead space between the data table and the record table, and the \r | |
259 | filesystem heuristically determines how much space is needed for\r | |
260 | that. If the data table or the record table grows into the index\r | |
261 | the filesystem can choose to relocate, regenerate, or shift the index\r | |
262 | data, or to disallow the growth (extend the data object into a new\r | |
263 | segment). These heuristics are fairly simple.\r | |
264 | \r | |
265 | ||
266 | * Second, it can occur in the master record to roughly organize the\r | |
267 | data space... for example so the filesystem does not have to check\r | |
268 | all segments if a particular file offset (or offset range) and all\r | |
269 | of its history is known to reside in just one. The filesystem\r | |
270 | typically is only willing to allow so much space in the master record\r | |
271 | for a data object to store such an index. If this level of index\r | |
272 | becomes too large it is basically simplified in-place and starts\r | |
273 | requiring the use of the per-segment index to further resolve the\r | |
274 | location of data records for the object.\r | |
275 | \r | |
276 | ||
277 | * Third, it can be generated and cached in memory. When dealing with\r | |
278 | very large multi-segment files it may be beneficial to scan\r | |
279 | the relatively few records in the record table for the related segments\r | |
280 | and simply index them in memory, rather then store an index on-disk.\r | |
281 | \r | |
282 | For example if you are using 64MB segments and have a 20GB file,\r | |
283 | literally only 320 disk accesses (with a few data records read in\r | |
284 | each access) are required to construct an index of the entire 20GB\r | |
285 | file and the index itself would require very little memory.\r | |
286 | \r | |
287 | Indexes are asynchronous entities. The 'official' filesystem store is\r | |
288 | the data table and the record table. The host can cache updates in\r | |
289 | memory, and asynchronously update the performance-improving index. \r | |
290 | \r | |
291 | Indexes residing in segments are regenerated if the segment is marked\r | |
292 | open on initial access (due to an unclean shutdown). This allows\r | |
293 | persistent per-segment indexes to be updated without requiring any\r | |
294 | particular transactional disk synchronization or block ordering.\r | |
295 | \r | |
296 | Indexes are generally implemented using a B+Tree structure.\r | |
297 | \r | |
298 | ## Replication \r | |
299 | \r | |
300 | Data and record elements are only directly referenced by their offset\r | |
301 | within a segment when referenced from within that same segment. This\r | |
302 | means that replication is possible on a segment-by-segment basis.\r | |
303 | \r | |
304 | Segment headers contain asynchronously updated record update log areas\r | |
305 | for deletion events (because these modify the deletion timestamp in\r | |
306 | an existing record rather then append a new record). These log areas\r | |
307 | are of limited size and can overflow under normal operation. They are\r | |
308 | ONLY used to streamline replication. If a replication target is not\r | |
309 | fast enough, it will have to resynchronize by scanning the records\r | |
310 | on the source (which itself doesn't usually take very long to do so it\r | |
311 | isn't considered a big deal). But otherwise it can check the log area\r | |
312 | and simply pick up where it left off. The intention is to completely\r | |
313 | support both very slow replication slaves and mostly off-line slaves.\r | |
314 | \r | |
315 | The only thing that is actually replicated are the record table\r | |
316 | entries and (with the exception of a master record) the data table\r | |
317 | entries. The data table entry for a master record is never replicated,\r | |
318 | but (re)generated by the replication target. The replication target\r | |
319 | is responsible for maintaining its own indexes and managing its own\r | |
320 | physical segment space. This gives any replication target a great\r | |
321 | deal of robustness and flexibility.\r | |
322 | \r | |
323 | Also note that the physical location of the logical segments on the\r | |
324 | replication target is entirely up to the replication target. Replication\r | |
325 | is done logically, not physically.\r | |
326 | \r | |
327 | ## Segment Expansion - Virtual segments \r | |
328 | \r | |
329 | A replication target may wish to retain historical data that the \r | |
330 | replication source has not, or destroy historical data that the\r | |
331 | replication source intends to retain. This can result in the replication\r | |
332 | target running out of room in a logical segment, and can also complicate\r | |
333 | recovery operations if the replication source needs to self-heal a\r | |
334 | corrupted segment. This presents a problem because all filesystem\r | |
335 | accesses and all replication and recovery activities are segment-centric.\r | |
336 | \r | |
337 | To deal with this situation a logical segment can be turned into a \r | |
338 | virtual segment. A virtual segment is made up of multiple logical\r | |
339 | segments, all with the same identifier plus additional information\r | |
340 | in the segment distinguishing the pieces from each other. Each logical\r | |
341 | segment is maintained independantly but API accesses check both\r | |
342 | (or all N such segments) when doing lookups, and write to whichever\r | |
343 | one has free space. \r | |
344 | \r | |
345 | Virtual segments are pretty standard fare on replication slaves which\r | |
346 | are intended to archive a great deal more history then replication\r | |
347 | masters, but are intended to be very rare features of replication\r | |
348 | masters. If a virtual segment must be created on a replication master\r | |
349 | the filesystem does all it can to shift data off the virtual segment\r | |
350 | in order to be able to collapse it back into a logical segment.\r | |
351 | \r | |
352 | Virtual segmentation is not space efficient.\r | |
353 | \r | |
354 | ## Files and Directories \r | |
355 | \r | |
356 | As has been described, a data object (which can be a file or directory\r | |
357 | or other filesystem entity) is identified by a data object id, which\r | |
358 | is effectively its inode, and a 64 bit key field. The key field is\r | |
359 | typically a base offset for a file or a sortable key for a directory\r | |
360 | entry. Negative numbers are used for meta-data. For example, -1 will\r | |
361 | be used for the inode data & stat information. Other negative numbers\r | |
362 | will be used to support other meta-data such as ACLs.\r | |
363 | \r | |
364 | Since records support variable-length data, the data storage for a file\r | |
365 | is effectively extent-based. Minimum granularity will be something like\r | |
366 | 32 bytes.\r | |
367 | \r | |
368 | Files are thus fairly straight forward.\r | |
369 | \r | |
370 | From the point of view of the filesystem each directory entry will\r | |
371 | be a data record. The data record will basically just contain the\r | |
372 | inode number, type, and filename. It will be variable-length insofar\r | |
373 | as the filename is variable length.\r | |
374 | \r | |
375 | Most files will be representable in a single extent (or at least\r | |
376 | can be optimized into a single extent), and so will not depend very\r | |
377 | heavily on indexing.\r | |
378 | \r | |
379 | Directory lookups WILL depend on the indexing of the directory records\r | |
380 | for efficiency, otherwise every record would have to be scanned to\r | |
381 | lookup a filename. In this regard a B+Tree is probably the best \r | |
382 | solution.\r | |
383 | \r | |
384 | ## Inode Number Uniqueness \r | |
385 | \r | |
386 | Inode numbers are 64 bits where the top N bits represent the segment\r | |
387 | in which the inode was originally allocated, identifying the master\r | |
388 | record for the data object. Inode numbers do not change even if\r | |
389 | the master record is relocated and the original master record replaced\r | |
390 | with a forwarding record. The forwarding record must be retained\r | |
391 | in order to guarentee the uniqueness of the inode number.\r | |
392 | \r | |
393 | This also allows inode numbers to be allocated on a per-segment basis,\r | |
394 | with 48 bits of uniqueness (one trillion file creates & deletes) within\r | |
395 | each segment.\r | |
396 | \r | |
397 | The filesystem will always allocate new inode numbers using a sequence\r | |
398 | number maintained in the segment header. When all 48 bits are used up,\r | |
399 | however, the filesystem must check new inode numbers against existing\r | |
400 | numbers (including any deleted historical objects). \r | |
401 | \r | |
402 | Inode number uniqueness over all time can no longer be guarenteed, but\r | |
403 | inode number uniqueness over a period of time **can** still be guarenteed\r | |
404 | by retaining the master record for deleted files for a period of time\r | |
405 | **even after they have been destroyed** . The system operator can specify\r | |
406 | the retention time with the caveat that certain benchmarks might cause\r | |
407 | the disk to fill up or become somewhat less efficient even when \r | |
408 | historical data is purged.\r | |
409 | \r | |
410 | It is recommended that any exported or clustered filesystems set the\r | |
411 | inode number retention time to at least a week. Inode number uniqueness\r | |
412 | is crucial to cluster operation.\r |