gdb - Local mods (compile)
[dragonfly.git] / sys / vfs / hammer2 / FREEMAP
CommitLineData
1a7cfe5a 1
93f3933a 2 HAMMER2 Freemap Design Notes
1a7cfe5a
MD
3
4 Overview
5
6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone
a71db85d
MD
7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The
8 blocks in this header are reserved for various purposes. For example,
3148f677
MD
9 block #0 is reserved for a volume header in the first four zones. Most
10 of the remaining 64KB blocks in this header are reserved for use by the
11 freemap.
a71db85d
MD
12
13 The freemap only uses blocks from these reserved areas. In order to
bca9f8e6 14 ensure that any of the four volume headers can be used by the mount code
a71db85d 15 (in case some are found to be corrupted), each freemap block in the
bca9f8e6
MD
16 logical freemap topology will iterate through up to 8 copies whos
17 block numbers are taken the reserved area.
18
19 - Four copies, one for each of the four volume headers which H2 sequences
20 through on each flush. This ensures that a mount from any of the four
21 volume headers is handed a consistent freemap topology.
22
23 - One copy to ensure that recovery operations during mount do not modify
24 the state of the freemap topology pointed to by older volume headers
25 which are still valid. Note that the freemap for volume headers
26 indexed after the mount point being recovered may lose freemap
27 consistency, so if you choose an older mount point for a RW mount,
28 you have to stick with it.
29
30 - One copy for live operations. This allows HAMMER2 to retire the
3148f677
MD
31 related buffer asynchronously in the background (or for the OS to
32 retire the buffer cache buffer on its own) prior to the formal
33 flush. The later formal flush then has less work to do.
bca9f8e6
MD
34
35 - The two remaining copies add robustness to the specification. For
3148f677
MD
36 example, with appropriate feature code added the filesystem can
37 tolerate a limited number of bad blocks in the reserved area.
bca9f8e6
MD
38
39 For the moment we use a simple calculation for the freemap block. In
40 a later version I would like to mix the blocks up a bit so the blocks
41 in each set of 8 are not situated near each other.
42
3148f677 43 RW Mount Restrictions
bca9f8e6
MD
44
45 If an older volume header is explicitly selected by the mount code, any
46 newer (presumably corrupt since the mount code didn't select it) volume
47 headers will lose freemap consistency as the freemap code rotates into
48 freemap blocks that might have been used by the topology pointed to by
49 the newer (but not selected) volume headers. For a RW mount, this means
50 that if an older volume header is selected, the newer ones that were
51 not selected WILL be formally invalidated by the mount code and cannot
52 be used in a remount attempt.
53
54 During normal operation, each filesystem flush rotates to a new volume
55 header. A filesystem may have up to four volume headers spread at 2GB
56 intervals. Filesystems smaller than ~9GB or so will have fewer volume
57 headers to rotate through.
93f3933a
MD
58
59 Freemap Topology
60
61 The freemap topology contains 4 levels of meta-data (blockref arrays),
62 one of which is embedded in the volume header (so only three real
a71db85d
MD
63 meta-data levels), plus one level of leaf-data. Unlike normal files,
64 which use a variable-radix, the freemap topology uses a fixed radix to
65 simplify the algorithm and to ensure freemap locality to the blocks
66 under management.
93f3933a 67
bca9f8e6
MD
68 Freemap blocks are allocated from the reserved area in each 2GB zone.
69 The leafs represent data in the zone. Higher levels in the freemap
70 topology will cover more area but the physical freemap meta-data blocks
71 always occur prior to the area being covered. Thus a HAMMER2 filesystem
72 of almost any size can be formatted and the related freemap blocks
73 will always exist.
74
e07becf8
MD
75 Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented
76 by a hammer2_bmap_data[1024] array. Each entry represents
77 2MB worth of media storage x 1024 entries to represent 2GB.
a71db85d 78 Each entry contains a 128x2 bit bitmap representing 16KB
e07becf8 79 of storage in 2 bits (128 x 16KB = 2MB).
93f3933a
MD
80
81 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
82 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
83 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
84 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
85 (this conveniently eats one 512-byte 'sector' of the 64KB
86 volume header).
1a7cfe5a 87
93f3933a 88 Each level is assign reserved blocks in the 4MB header per 2GB zone.
bca9f8e6
MD
89 Since we use block 0 for the volume header, the first freemap reserved
90 block in the zone begins at block 1.
91
92 Freemap copy #0:
93 Level 1 uses block 1 (this is the leaf block)
94 Level 2 uses block 2
95 Level 3 uses block 3
96 Level 4 uses block 4
97
98 Freemap copy #1:
99 Level 1 uses block 5 (this is the leaf block)
100 Level 2 uses block 6
101 Level 3 uses block 7
102 Level 4 uses block 8
103
104 ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.
1a7cfe5a 105
a71db85d
MD
106 Flushing
107
108 The freemap does not have to be flushed by fsync/sync, but should probably
109 be flushed at least once a minute by the normal filesystem sync. The
3148f677 110 reason it does not have to be flushed with fsync is that freemap recovery
bca9f8e6
MD
111 is executed on-mount and will use the last fully flushed freemap TID
112 stored in the volume header to do an incremental meta-data scan of the
113 H2 filesystem between that TID and the last flushed TID. All blocks not
114 found to have been marked allocated will be marked allocated. Simple as
3148f677
MD
115 that. Since the scan is incremental, this typically costs very little
116 time.
a71db85d
MD
117
118 Freemap Granularity
119
120 The freemap granularity is 16KB (radix of 14) but the minimum
121 allocation radix is 1KB (radix of 10) (and can be in multiples of
122 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct
123 data, so tiny files eat exactly 1KB of media storage inclusive of the
124 inode.
125
126 The freemap keeps track of partial allocations in-memory but not
127 on-media, so even a normal umount will cause partially allocated
128 blocks to appear fully allocated until some later date when the
129 bulk scan code defragments it.
130
bca9f8e6 131 Block Selection
a71db85d
MD
132
133 Block selection is localized to be near the inode's (or nearby data)
134 blockref. The algorithmic complexity of determining locality is not
135 defined here atm.
1a7cfe5a 136
bca9f8e6
MD
137 Freemap Leaf Substructure
138
139 * linear - Linear sub-granular allocation offset. Allows ~1KB granular
140 linear allocations.
141
142 * class - Allocation clustering class ((type << 8) | radix).
1a7cfe5a 143
bca9f8e6
MD
144 * avail - Available space in bytes, currently only used by layer 1 leaf.
145 Used as an allocation clustering aid.
1a7cfe5a 146
bca9f8e6 147 * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
93f3933a
MD
148 at 2 bits per chunk. The filesystem allocation granularity
149 can be smaller (currently ~1KB minimum), and the live
bca9f8e6
MD
150 filesystem caches iterations when allocating multiple chunks.
151 However, on remount any partial allocations out of a 64KB
152 allocation block MAY cause the entire 64KB to be considered
153 allocated. Fragmented space can potentially be reclaimed
154 and/or relocated by the bulk block free scan.
1a7cfe5a 155
93f3933a 156 The 2-bit bitmap fields are assigned as follows:
1a7cfe5a 157
93f3933a 158 00 FREE
bca9f8e6
MD
159 01 POSSIBLY FREE (type 1)
160 10 POSSIBLY FREE (type 2)
93f3933a 161 11 ALLOCATED
1a7cfe5a 162
bca9f8e6
MD
163 Freemap Metadata Substructure
164 (Levels 2, 3, 4, and 5)
1a7cfe5a 165
bca9f8e6
MD
166 Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
167 some of the check area (a 24-byte area) for freemap-specific meta-data.
168 We reserve a few fields to store information which allows the block
169 allocator to do its work more efficiently.
1a7cfe5a 170
93f3933a
MD
171 * bigmask - A mask of radixes available for allocation under this
172 blockref. Typically initialized to -1.
1a7cfe5a
MD
173
174 * avail - Total available space in bytes.
175
176 The freemap allocator uses a cylinder-group-like abstraction using
177 the localized allocation concept first implemented by UFS. In HAMMER2
bca9f8e6
MD
178 there is no such thing as a real cylinder group, nor are there specific
179 reserved areas for inodes vs data, but we do the next best thing by
180 roughly typing leafs (each leaf representing ~2MB) to hopefully allow
181 the drive to employ its zone-cache to make both stat-only and tar-style
182 bulk accesses efficient (in addition to normal file accesses).
1a7cfe5a 183
a71db85d 184 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
bca9f8e6
MD
185 supplying 10 bits of address space each. Level 5 is a blockmap[8]
186 stored in the volume header supplying 3 bits of address space.
187 (level 0 supplies 10 + 21 bits of address space).
a71db85d
MD
188
189 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
190 effectively fixed at multiples of ~2MB or so.
1a7cfe5a 191
93f3933a 192 Initial Conditions
1a7cfe5a 193
3148f677
MD
194 newfs_hammer2 does not need to format the freemap. Instead, newfs_hammer2
195 simply leaves the associated top-level indirect blocks empty and uses
196 the (voldata->allocator_beg) field to allocate space linearly, then
197 leaves it to the live filesystem to initialize the freemap as more space
198 gets allocated.
93f3933a 199
a71db85d
MD
200 The freemap does NOT use a fixed 5-level radix tree. It uses the same
201 blockmap algorithm used for file blocks but restricts any recursion to
202 specific radix values. This means that small filesystems will have much
203 smaller freemap depths. 2 layers (and not counting the volume header as
204 a layer) gets us 16GB, 3 layers gets us 16TB.
205
206 How blocks are allocated and freed
93f3933a 207
bca9f8e6
MD
208 The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
209 contains a linear allocation offset that can keep track of sub-16KB
210 allocations with certain restrictions. More random sub-16KB allocations
211 are tracked in-memory, but will be lost (assumed to be a full 16KB) if
3148f677
MD
212 a crash occurs. Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10,
213 or 11.
bca9f8e6
MD
214
215 NOTE! All operations on the freemap occur on the current live version
216 of the freemap, including bulkfree operations.
217
218 Blocks are allocated by transitioning the 2-bit pattern in the leaf
3148f677
MD
219 to 11. That is, (00, 01, 10) -> (11).
220
221 The primary mechanism used to free a block is via the asynchronous
222 bulkfree scan. This scans all filesystem meta-data in two major passes
223 (and potentially multiple sub-passes).
224
225 Pass#1 - The first pass figures which blocks might be freeable. The
226 most recently flushed meta-data topology (including all four
227 volume headers and all snapshots) is scanned and an in-memory
228 copy of the FreeMap is built from scratch. Multiple sub-scans
229 might be required to break the larger scan up into more easily
230 digested pieces based on the amount of memory available to hold
231 the temporary freemap.
232
233 Any allocated blocks in the live freemap are then transitioned
234 from (11) to either (10) or (01) if after the scan they are
235 found to not be allocated.
236
237 The blocks are still assumed to be allocated at this time and
238 any new allocations will transition them back to (11).
239
240 Pass#2 - The second pass is required to deal with races against the
241 live filesystem while the freemap scan was running. It also
242 allows the freemap scans to run asynchronously from any flush,
243 improving concurrency. However, at least one synchronous flush
244 is required between Pass#1 and Pass#2.
245
246 The second pass is a duplicate of the first pass. The meta-data
247 topology is scanned and a freemap is built in-memory and then
248 compared against the live freemap. Instead transitioning from
249 (11)->(10)/(01) this pass transitions from (10)/(01) to (00).
250
251 If a block that it thinks is free is (11), no transition occurs
252 because this could be due to a race against the live filesystem.
253
254 This pass will incidentally transition (10)/(01) back to (11)
255 if the block was found not to be allocated, but it is perfectly
256 acceptable for the block to remain in a (10)/(01) state after
257 completion.
258
259 NOTE! The meta-data scanning passes must also explicitly scan blocks
260 associated with any open files, since these might represent
261 open-but-deleted files. These blocks must not be accidently freed
262 while the system is still using the file. Again, since this is
263 done in two passes it does not have to be synchronized against
264 frontend operations. So in total:
265
266 * Topology under all four volume headers. This includes all
267 PFSs and snapshots.
268
269 * Topology under all open hammer2 files.
270
271 The Bulk-free operation is expensive but uses a bounded amount of ram.
272 The ADVANTAGE of this mechanism is that deletions in the live filesystem
273 do not have to clean up the freemap and thus do not have to recurse
274 the topology during the deletion. In fact, a 'rm -rf' equivalent of a
275 directory topology can be handled simply by blowing away the top-level
276 directory inode. This is instantanious and thus can be dangerous but
277 you always have your snapshots to fall-back on.
278
279 The DISADVANTAGE is that all meta-data must be scanned. Twice. This
280 can be mitigated by using swapcache(8) to cache the meta-data on a SSD.
281 This is also mitigated by the fact that you can do the bulkfree scan less
282 often on very large filesystems which presumably have a lot of freespace
283 (so the interval is not as big an issue). In a sense the operation does
284 scale in that it takes longer on larger filesystems but also can be run
285 less often.
bca9f8e6
MD
286
287 The biggest issue is that *NO* space can be freed up by the live
288 filesystem without the bulkfree process unless we optimize the case
289 where data is created and deleted from within a single snapshot.
290 This is made more difficult by the fact that each flush represents
291 a fine-grained snapshot (up to four, representing the four volume
292 headers the flush iterates through).
293
294 Snapshots and Replicated Topologies
295
296 The bulkfree code maintains information in-memory to the best of its
297 ability for a multitude of reasons, including attempting to detect
298 snapshot recursions down block chains which have already been scanned
299 via some other snapshot. Without this, a large number of snapshots
300 can cause a huge multiplication of disk I/O reads (but not writes) during
301 the topology scan.
1a7cfe5a
MD
302
303 Use of Generic indirect-block API
304
305 I decided to use the same indirect-block allocation model for the
306 freemap that normal files use, with a few special cases added to force
307 specific radix values and to 'allocate' the freemap-related blocks
308 and indirect blocks via a reserved-block calculation and (obviously)
309 not via a recursive call to the allocator.
310
93f3933a 311 The Freemap is defined above as a fixed 5-level scheme (level 1-5),
1a7cfe5a 312 but in actual operation the radix tree can be shortcut just as it
bca9f8e6
MD
313 is with normal files. However, unlike normal files, shorcuts will
314 be forced to use specific radix values in order to guarantee that
315 reserved block numbers can be trivially calculated. As the freemap
316 becomes more fleshed out the tree on-media will look more and more like
317 the actual specification.
1a7cfe5a
MD
318
319 One advantage of doing things this way is that smaller filesystems
bca9f8e6
MD
320 won't actually use a 5-level scheme. A 16GB filesystem can use 8
321 blockrefs in the volume header which point directly to layer 1 leaf
322 blocks. A 16TB filesystem can be managed with only three levels
323 (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
324 the volume header). And so forth.
1a7cfe5a
MD
325
326 At the moment we have no plans to return any of the unused 4MB zone
93f3933a
MD
327 header space (per 2GB of storage) back to the filesystem for general use.
328 There are lots of things we may want to use the reserved areas for in
329 the future.
bca9f8e6
MD
330
331 Emergency Deletions
332
333 All filesystem modifications including deletions must allocate blocks
334 in order to update the main topology all the way to the root. H2 will
335 reserve roughly 5% of the available blocks in the filesystem for
336 deletions in order to allow a system operator to recover from a
337 filesystem full condition.
338
3148f677
MD
339 However, due to the snapshot capability as well as the possibility of
340 fragmentation, it is possible for the administrator to not delete enough
341 to actually be able to free up blocks. Once the reserve is used up
342 the filesystem can become unwritable.
343
344 When this situation occurs the only way to recover is to update blocks
345 in-place. Updating blocks in-place will destroy the data on any
346 related snapshots or otherwise corrupt the snapshots. Emergency recovery
347 thus recommends that all related snapshots be destroyed. You can choose
348 not to do this in which case your snapshots might wind up containing
349 broken links and generate CRC failure messages.
350
351 For the moment the spec for dealing with these situations remains
352 incomplete.