Commit | Line | Data |
---|---|---|
1a7cfe5a | 1 | |
93f3933a | 2 | HAMMER2 Freemap Design Notes |
1a7cfe5a MD |
3 | |
4 | Overview | |
5 | ||
6 | HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone | |
a71db85d MD |
7 | contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The |
8 | blocks in this header are reserved for various purposes. For example, | |
3148f677 MD |
9 | block #0 is reserved for a volume header in the first four zones. Most |
10 | of the remaining 64KB blocks in this header are reserved for use by the | |
11 | freemap. | |
a71db85d MD |
12 | |
13 | The freemap only uses blocks from these reserved areas. In order to | |
bca9f8e6 | 14 | ensure that any of the four volume headers can be used by the mount code |
a71db85d | 15 | (in case some are found to be corrupted), each freemap block in the |
bca9f8e6 MD |
16 | logical freemap topology will iterate through up to 8 copies whos |
17 | block numbers are taken the reserved area. | |
18 | ||
19 | - Four copies, one for each of the four volume headers which H2 sequences | |
20 | through on each flush. This ensures that a mount from any of the four | |
21 | volume headers is handed a consistent freemap topology. | |
22 | ||
23 | - One copy to ensure that recovery operations during mount do not modify | |
24 | the state of the freemap topology pointed to by older volume headers | |
25 | which are still valid. Note that the freemap for volume headers | |
26 | indexed after the mount point being recovered may lose freemap | |
27 | consistency, so if you choose an older mount point for a RW mount, | |
28 | you have to stick with it. | |
29 | ||
30 | - One copy for live operations. This allows HAMMER2 to retire the | |
3148f677 MD |
31 | related buffer asynchronously in the background (or for the OS to |
32 | retire the buffer cache buffer on its own) prior to the formal | |
33 | flush. The later formal flush then has less work to do. | |
bca9f8e6 MD |
34 | |
35 | - The two remaining copies add robustness to the specification. For | |
3148f677 MD |
36 | example, with appropriate feature code added the filesystem can |
37 | tolerate a limited number of bad blocks in the reserved area. | |
bca9f8e6 MD |
38 | |
39 | For the moment we use a simple calculation for the freemap block. In | |
40 | a later version I would like to mix the blocks up a bit so the blocks | |
41 | in each set of 8 are not situated near each other. | |
42 | ||
3148f677 | 43 | RW Mount Restrictions |
bca9f8e6 MD |
44 | |
45 | If an older volume header is explicitly selected by the mount code, any | |
46 | newer (presumably corrupt since the mount code didn't select it) volume | |
47 | headers will lose freemap consistency as the freemap code rotates into | |
48 | freemap blocks that might have been used by the topology pointed to by | |
49 | the newer (but not selected) volume headers. For a RW mount, this means | |
50 | that if an older volume header is selected, the newer ones that were | |
51 | not selected WILL be formally invalidated by the mount code and cannot | |
52 | be used in a remount attempt. | |
53 | ||
54 | During normal operation, each filesystem flush rotates to a new volume | |
55 | header. A filesystem may have up to four volume headers spread at 2GB | |
56 | intervals. Filesystems smaller than ~9GB or so will have fewer volume | |
57 | headers to rotate through. | |
93f3933a MD |
58 | |
59 | Freemap Topology | |
60 | ||
61 | The freemap topology contains 4 levels of meta-data (blockref arrays), | |
62 | one of which is embedded in the volume header (so only three real | |
a71db85d MD |
63 | meta-data levels), plus one level of leaf-data. Unlike normal files, |
64 | which use a variable-radix, the freemap topology uses a fixed radix to | |
65 | simplify the algorithm and to ensure freemap locality to the blocks | |
66 | under management. | |
93f3933a | 67 | |
bca9f8e6 MD |
68 | Freemap blocks are allocated from the reserved area in each 2GB zone. |
69 | The leafs represent data in the zone. Higher levels in the freemap | |
70 | topology will cover more area but the physical freemap meta-data blocks | |
71 | always occur prior to the area being covered. Thus a HAMMER2 filesystem | |
72 | of almost any size can be formatted and the related freemap blocks | |
73 | will always exist. | |
74 | ||
e07becf8 MD |
75 | Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented |
76 | by a hammer2_bmap_data[1024] array. Each entry represents | |
77 | 2MB worth of media storage x 1024 entries to represent 2GB. | |
a71db85d | 78 | Each entry contains a 128x2 bit bitmap representing 16KB |
e07becf8 | 79 | of storage in 2 bits (128 x 16KB = 2MB). |
93f3933a MD |
80 | |
81 | Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) | |
82 | Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) | |
83 | Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) | |
84 | Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) | |
85 | (this conveniently eats one 512-byte 'sector' of the 64KB | |
86 | volume header). | |
1a7cfe5a | 87 | |
93f3933a | 88 | Each level is assign reserved blocks in the 4MB header per 2GB zone. |
bca9f8e6 MD |
89 | Since we use block 0 for the volume header, the first freemap reserved |
90 | block in the zone begins at block 1. | |
91 | ||
92 | Freemap copy #0: | |
93 | Level 1 uses block 1 (this is the leaf block) | |
94 | Level 2 uses block 2 | |
95 | Level 3 uses block 3 | |
96 | Level 4 uses block 4 | |
97 | ||
98 | Freemap copy #1: | |
99 | Level 1 uses block 5 (this is the leaf block) | |
100 | Level 2 uses block 6 | |
101 | Level 3 uses block 7 | |
102 | Level 4 uses block 8 | |
103 | ||
104 | ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32. | |
1a7cfe5a | 105 | |
a71db85d MD |
106 | Flushing |
107 | ||
108 | The freemap does not have to be flushed by fsync/sync, but should probably | |
109 | be flushed at least once a minute by the normal filesystem sync. The | |
3148f677 | 110 | reason it does not have to be flushed with fsync is that freemap recovery |
bca9f8e6 MD |
111 | is executed on-mount and will use the last fully flushed freemap TID |
112 | stored in the volume header to do an incremental meta-data scan of the | |
113 | H2 filesystem between that TID and the last flushed TID. All blocks not | |
114 | found to have been marked allocated will be marked allocated. Simple as | |
3148f677 MD |
115 | that. Since the scan is incremental, this typically costs very little |
116 | time. | |
a71db85d MD |
117 | |
118 | Freemap Granularity | |
119 | ||
120 | The freemap granularity is 16KB (radix of 14) but the minimum | |
121 | allocation radix is 1KB (radix of 10) (and can be in multiples of | |
122 | 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct | |
123 | data, so tiny files eat exactly 1KB of media storage inclusive of the | |
124 | inode. | |
125 | ||
126 | The freemap keeps track of partial allocations in-memory but not | |
127 | on-media, so even a normal umount will cause partially allocated | |
128 | blocks to appear fully allocated until some later date when the | |
129 | bulk scan code defragments it. | |
130 | ||
bca9f8e6 | 131 | Block Selection |
a71db85d MD |
132 | |
133 | Block selection is localized to be near the inode's (or nearby data) | |
134 | blockref. The algorithmic complexity of determining locality is not | |
135 | defined here atm. | |
1a7cfe5a | 136 | |
bca9f8e6 MD |
137 | Freemap Leaf Substructure |
138 | ||
139 | * linear - Linear sub-granular allocation offset. Allows ~1KB granular | |
140 | linear allocations. | |
141 | ||
142 | * class - Allocation clustering class ((type << 8) | radix). | |
1a7cfe5a | 143 | |
bca9f8e6 MD |
144 | * avail - Available space in bytes, currently only used by layer 1 leaf. |
145 | Used as an allocation clustering aid. | |
1a7cfe5a | 146 | |
bca9f8e6 | 147 | * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks |
93f3933a MD |
148 | at 2 bits per chunk. The filesystem allocation granularity |
149 | can be smaller (currently ~1KB minimum), and the live | |
bca9f8e6 MD |
150 | filesystem caches iterations when allocating multiple chunks. |
151 | However, on remount any partial allocations out of a 64KB | |
152 | allocation block MAY cause the entire 64KB to be considered | |
153 | allocated. Fragmented space can potentially be reclaimed | |
154 | and/or relocated by the bulk block free scan. | |
1a7cfe5a | 155 | |
93f3933a | 156 | The 2-bit bitmap fields are assigned as follows: |
1a7cfe5a | 157 | |
93f3933a | 158 | 00 FREE |
bca9f8e6 MD |
159 | 01 POSSIBLY FREE (type 1) |
160 | 10 POSSIBLY FREE (type 2) | |
93f3933a | 161 | 11 ALLOCATED |
1a7cfe5a | 162 | |
bca9f8e6 MD |
163 | Freemap Metadata Substructure |
164 | (Levels 2, 3, 4, and 5) | |
1a7cfe5a | 165 | |
bca9f8e6 MD |
166 | Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal |
167 | some of the check area (a 24-byte area) for freemap-specific meta-data. | |
168 | We reserve a few fields to store information which allows the block | |
169 | allocator to do its work more efficiently. | |
1a7cfe5a | 170 | |
93f3933a MD |
171 | * bigmask - A mask of radixes available for allocation under this |
172 | blockref. Typically initialized to -1. | |
1a7cfe5a MD |
173 | |
174 | * avail - Total available space in bytes. | |
175 | ||
176 | The freemap allocator uses a cylinder-group-like abstraction using | |
177 | the localized allocation concept first implemented by UFS. In HAMMER2 | |
bca9f8e6 MD |
178 | there is no such thing as a real cylinder group, nor are there specific |
179 | reserved areas for inodes vs data, but we do the next best thing by | |
180 | roughly typing leafs (each leaf representing ~2MB) to hopefully allow | |
181 | the drive to employ its zone-cache to make both stat-only and tar-style | |
182 | bulk accesses efficient (in addition to normal file accesses). | |
1a7cfe5a | 183 | |
a71db85d | 184 | Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), |
bca9f8e6 MD |
185 | supplying 10 bits of address space each. Level 5 is a blockmap[8] |
186 | stored in the volume header supplying 3 bits of address space. | |
187 | (level 0 supplies 10 + 21 bits of address space). | |
a71db85d MD |
188 | |
189 | The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus | |
190 | effectively fixed at multiples of ~2MB or so. | |
1a7cfe5a | 191 | |
93f3933a | 192 | Initial Conditions |
1a7cfe5a | 193 | |
3148f677 MD |
194 | newfs_hammer2 does not need to format the freemap. Instead, newfs_hammer2 |
195 | simply leaves the associated top-level indirect blocks empty and uses | |
196 | the (voldata->allocator_beg) field to allocate space linearly, then | |
197 | leaves it to the live filesystem to initialize the freemap as more space | |
198 | gets allocated. | |
93f3933a | 199 | |
a71db85d MD |
200 | The freemap does NOT use a fixed 5-level radix tree. It uses the same |
201 | blockmap algorithm used for file blocks but restricts any recursion to | |
202 | specific radix values. This means that small filesystems will have much | |
203 | smaller freemap depths. 2 layers (and not counting the volume header as | |
204 | a layer) gets us 16GB, 3 layers gets us 16TB. | |
205 | ||
206 | How blocks are allocated and freed | |
93f3933a | 207 | |
bca9f8e6 MD |
208 | The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also |
209 | contains a linear allocation offset that can keep track of sub-16KB | |
210 | allocations with certain restrictions. More random sub-16KB allocations | |
211 | are tracked in-memory, but will be lost (assumed to be a full 16KB) if | |
3148f677 MD |
212 | a crash occurs. Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10, |
213 | or 11. | |
bca9f8e6 MD |
214 | |
215 | NOTE! All operations on the freemap occur on the current live version | |
216 | of the freemap, including bulkfree operations. | |
217 | ||
218 | Blocks are allocated by transitioning the 2-bit pattern in the leaf | |
3148f677 MD |
219 | to 11. That is, (00, 01, 10) -> (11). |
220 | ||
221 | The primary mechanism used to free a block is via the asynchronous | |
222 | bulkfree scan. This scans all filesystem meta-data in two major passes | |
223 | (and potentially multiple sub-passes). | |
224 | ||
225 | Pass#1 - The first pass figures which blocks might be freeable. The | |
226 | most recently flushed meta-data topology (including all four | |
227 | volume headers and all snapshots) is scanned and an in-memory | |
228 | copy of the FreeMap is built from scratch. Multiple sub-scans | |
229 | might be required to break the larger scan up into more easily | |
230 | digested pieces based on the amount of memory available to hold | |
231 | the temporary freemap. | |
232 | ||
233 | Any allocated blocks in the live freemap are then transitioned | |
234 | from (11) to either (10) or (01) if after the scan they are | |
235 | found to not be allocated. | |
236 | ||
237 | The blocks are still assumed to be allocated at this time and | |
238 | any new allocations will transition them back to (11). | |
239 | ||
240 | Pass#2 - The second pass is required to deal with races against the | |
241 | live filesystem while the freemap scan was running. It also | |
242 | allows the freemap scans to run asynchronously from any flush, | |
243 | improving concurrency. However, at least one synchronous flush | |
244 | is required between Pass#1 and Pass#2. | |
245 | ||
246 | The second pass is a duplicate of the first pass. The meta-data | |
247 | topology is scanned and a freemap is built in-memory and then | |
248 | compared against the live freemap. Instead transitioning from | |
249 | (11)->(10)/(01) this pass transitions from (10)/(01) to (00). | |
250 | ||
251 | If a block that it thinks is free is (11), no transition occurs | |
252 | because this could be due to a race against the live filesystem. | |
253 | ||
254 | This pass will incidentally transition (10)/(01) back to (11) | |
255 | if the block was found not to be allocated, but it is perfectly | |
256 | acceptable for the block to remain in a (10)/(01) state after | |
257 | completion. | |
258 | ||
259 | NOTE! The meta-data scanning passes must also explicitly scan blocks | |
260 | associated with any open files, since these might represent | |
261 | open-but-deleted files. These blocks must not be accidently freed | |
262 | while the system is still using the file. Again, since this is | |
263 | done in two passes it does not have to be synchronized against | |
264 | frontend operations. So in total: | |
265 | ||
266 | * Topology under all four volume headers. This includes all | |
267 | PFSs and snapshots. | |
268 | ||
269 | * Topology under all open hammer2 files. | |
270 | ||
271 | The Bulk-free operation is expensive but uses a bounded amount of ram. | |
272 | The ADVANTAGE of this mechanism is that deletions in the live filesystem | |
273 | do not have to clean up the freemap and thus do not have to recurse | |
274 | the topology during the deletion. In fact, a 'rm -rf' equivalent of a | |
275 | directory topology can be handled simply by blowing away the top-level | |
276 | directory inode. This is instantanious and thus can be dangerous but | |
277 | you always have your snapshots to fall-back on. | |
278 | ||
279 | The DISADVANTAGE is that all meta-data must be scanned. Twice. This | |
280 | can be mitigated by using swapcache(8) to cache the meta-data on a SSD. | |
281 | This is also mitigated by the fact that you can do the bulkfree scan less | |
282 | often on very large filesystems which presumably have a lot of freespace | |
283 | (so the interval is not as big an issue). In a sense the operation does | |
284 | scale in that it takes longer on larger filesystems but also can be run | |
285 | less often. | |
bca9f8e6 MD |
286 | |
287 | The biggest issue is that *NO* space can be freed up by the live | |
288 | filesystem without the bulkfree process unless we optimize the case | |
289 | where data is created and deleted from within a single snapshot. | |
290 | This is made more difficult by the fact that each flush represents | |
291 | a fine-grained snapshot (up to four, representing the four volume | |
292 | headers the flush iterates through). | |
293 | ||
294 | Snapshots and Replicated Topologies | |
295 | ||
296 | The bulkfree code maintains information in-memory to the best of its | |
297 | ability for a multitude of reasons, including attempting to detect | |
298 | snapshot recursions down block chains which have already been scanned | |
299 | via some other snapshot. Without this, a large number of snapshots | |
300 | can cause a huge multiplication of disk I/O reads (but not writes) during | |
301 | the topology scan. | |
1a7cfe5a MD |
302 | |
303 | Use of Generic indirect-block API | |
304 | ||
305 | I decided to use the same indirect-block allocation model for the | |
306 | freemap that normal files use, with a few special cases added to force | |
307 | specific radix values and to 'allocate' the freemap-related blocks | |
308 | and indirect blocks via a reserved-block calculation and (obviously) | |
309 | not via a recursive call to the allocator. | |
310 | ||
93f3933a | 311 | The Freemap is defined above as a fixed 5-level scheme (level 1-5), |
1a7cfe5a | 312 | but in actual operation the radix tree can be shortcut just as it |
bca9f8e6 MD |
313 | is with normal files. However, unlike normal files, shorcuts will |
314 | be forced to use specific radix values in order to guarantee that | |
315 | reserved block numbers can be trivially calculated. As the freemap | |
316 | becomes more fleshed out the tree on-media will look more and more like | |
317 | the actual specification. | |
1a7cfe5a MD |
318 | |
319 | One advantage of doing things this way is that smaller filesystems | |
bca9f8e6 MD |
320 | won't actually use a 5-level scheme. A 16GB filesystem can use 8 |
321 | blockrefs in the volume header which point directly to layer 1 leaf | |
322 | blocks. A 16TB filesystem can be managed with only three levels | |
323 | (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in | |
324 | the volume header). And so forth. | |
1a7cfe5a MD |
325 | |
326 | At the moment we have no plans to return any of the unused 4MB zone | |
93f3933a MD |
327 | header space (per 2GB of storage) back to the filesystem for general use. |
328 | There are lots of things we may want to use the reserved areas for in | |
329 | the future. | |
bca9f8e6 MD |
330 | |
331 | Emergency Deletions | |
332 | ||
333 | All filesystem modifications including deletions must allocate blocks | |
334 | in order to update the main topology all the way to the root. H2 will | |
335 | reserve roughly 5% of the available blocks in the filesystem for | |
336 | deletions in order to allow a system operator to recover from a | |
337 | filesystem full condition. | |
338 | ||
3148f677 MD |
339 | However, due to the snapshot capability as well as the possibility of |
340 | fragmentation, it is possible for the administrator to not delete enough | |
341 | to actually be able to free up blocks. Once the reserve is used up | |
342 | the filesystem can become unwritable. | |
343 | ||
344 | When this situation occurs the only way to recover is to update blocks | |
345 | in-place. Updating blocks in-place will destroy the data on any | |
346 | related snapshots or otherwise corrupt the snapshots. Emergency recovery | |
347 | thus recommends that all related snapshots be destroyed. You can choose | |
348 | not to do this in which case your snapshots might wind up containing | |
349 | broken links and generate CRC failure messages. | |
350 | ||
351 | For the moment the spec for dealing with these situations remains | |
352 | incomplete. |