hammer2 - update documentation, begin working on callback I/O
[dragonfly.git] / sys / vfs / hammer2 / FREEMAP
CommitLineData
1a7cfe5a 1
93f3933a 2 HAMMER2 Freemap Design Notes
1a7cfe5a
MD
3
4 Overview
5
6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone
a71db85d
MD
7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The
8 blocks in this header are reserved for various purposes. For example,
bca9f8e6
MD
9 block #0 is reserved for a volume headers. Most of the remaining
10 64KB blocks in this header are reserved for use by the freemap.
a71db85d
MD
11
12 The freemap only uses blocks from these reserved areas. In order to
bca9f8e6 13 ensure that any of the four volume headers can be used by the mount code
a71db85d 14 (in case some are found to be corrupted), each freemap block in the
bca9f8e6
MD
15 logical freemap topology will iterate through up to 8 copies whos
16 block numbers are taken the reserved area.
17
18 - Four copies, one for each of the four volume headers which H2 sequences
19 through on each flush. This ensures that a mount from any of the four
20 volume headers is handed a consistent freemap topology.
21
22 - One copy to ensure that recovery operations during mount do not modify
23 the state of the freemap topology pointed to by older volume headers
24 which are still valid. Note that the freemap for volume headers
25 indexed after the mount point being recovered may lose freemap
26 consistency, so if you choose an older mount point for a RW mount,
27 you have to stick with it.
28
29 - One copy for live operations. This allows HAMMER2 to retire the
30 related buffer (or for the OS to retire the buffer cache buffer)
31 prior to the next flush and also allows the buffers to be flushed
32 asynchronously.
33
34 - The two remaining copies add robustness to the specification. For
35 example, with appropriate feature code the filesystem can tolerate
36 a limited number of bad blocks in the reserved area.
37
38 For the moment we use a simple calculation for the freemap block. In
39 a later version I would like to mix the blocks up a bit so the blocks
40 in each set of 8 are not situated near each other.
41
42 RW Mount Restrictions
43
44 If an older volume header is explicitly selected by the mount code, any
45 newer (presumably corrupt since the mount code didn't select it) volume
46 headers will lose freemap consistency as the freemap code rotates into
47 freemap blocks that might have been used by the topology pointed to by
48 the newer (but not selected) volume headers. For a RW mount, this means
49 that if an older volume header is selected, the newer ones that were
50 not selected WILL be formally invalidated by the mount code and cannot
51 be used in a remount attempt.
52
53 During normal operation, each filesystem flush rotates to a new volume
54 header. A filesystem may have up to four volume headers spread at 2GB
55 intervals. Filesystems smaller than ~9GB or so will have fewer volume
56 headers to rotate through.
93f3933a
MD
57
58 Freemap Topology
59
60 The freemap topology contains 4 levels of meta-data (blockref arrays),
61 one of which is embedded in the volume header (so only three real
a71db85d
MD
62 meta-data levels), plus one level of leaf-data. Unlike normal files,
63 which use a variable-radix, the freemap topology uses a fixed radix to
64 simplify the algorithm and to ensure freemap locality to the blocks
65 under management.
93f3933a 66
bca9f8e6
MD
67 Freemap blocks are allocated from the reserved area in each 2GB zone.
68 The leafs represent data in the zone. Higher levels in the freemap
69 topology will cover more area but the physical freemap meta-data blocks
70 always occur prior to the area being covered. Thus a HAMMER2 filesystem
71 of almost any size can be formatted and the related freemap blocks
72 will always exist.
73
e07becf8
MD
74 Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented
75 by a hammer2_bmap_data[1024] array. Each entry represents
76 2MB worth of media storage x 1024 entries to represent 2GB.
a71db85d 77 Each entry contains a 128x2 bit bitmap representing 16KB
e07becf8 78 of storage in 2 bits (128 x 16KB = 2MB).
93f3933a
MD
79
80 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
81 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
82 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
83 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
84 (this conveniently eats one 512-byte 'sector' of the 64KB
85 volume header).
1a7cfe5a 86
93f3933a 87 Each level is assign reserved blocks in the 4MB header per 2GB zone.
bca9f8e6
MD
88 Since we use block 0 for the volume header, the first freemap reserved
89 block in the zone begins at block 1.
90
91 Freemap copy #0:
92 Level 1 uses block 1 (this is the leaf block)
93 Level 2 uses block 2
94 Level 3 uses block 3
95 Level 4 uses block 4
96
97 Freemap copy #1:
98 Level 1 uses block 5 (this is the leaf block)
99 Level 2 uses block 6
100 Level 3 uses block 7
101 Level 4 uses block 8
102
103 ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.
1a7cfe5a 104
a71db85d
MD
105 Flushing
106
107 The freemap does not have to be flushed by fsync/sync, but should probably
108 be flushed at least once a minute by the normal filesystem sync. The
bca9f8e6
MD
109 reason it does not have to be flushed every time is that freemap recovery
110 is executed on-mount and will use the last fully flushed freemap TID
111 stored in the volume header to do an incremental meta-data scan of the
112 H2 filesystem between that TID and the last flushed TID. All blocks not
113 found to have been marked allocated will be marked allocated. Simple as
114 that.
a71db85d
MD
115
116 Freemap Granularity
117
118 The freemap granularity is 16KB (radix of 14) but the minimum
119 allocation radix is 1KB (radix of 10) (and can be in multiples of
120 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct
121 data, so tiny files eat exactly 1KB of media storage inclusive of the
122 inode.
123
124 The freemap keeps track of partial allocations in-memory but not
125 on-media, so even a normal umount will cause partially allocated
126 blocks to appear fully allocated until some later date when the
127 bulk scan code defragments it.
128
bca9f8e6 129 Block Selection
a71db85d
MD
130
131 Block selection is localized to be near the inode's (or nearby data)
132 blockref. The algorithmic complexity of determining locality is not
133 defined here atm.
1a7cfe5a 134
bca9f8e6
MD
135 Freemap Leaf Substructure
136
137 * linear - Linear sub-granular allocation offset. Allows ~1KB granular
138 linear allocations.
139
140 * class - Allocation clustering class ((type << 8) | radix).
1a7cfe5a 141
bca9f8e6
MD
142 * avail - Available space in bytes, currently only used by layer 1 leaf.
143 Used as an allocation clustering aid.
1a7cfe5a 144
bca9f8e6 145 * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
93f3933a
MD
146 at 2 bits per chunk. The filesystem allocation granularity
147 can be smaller (currently ~1KB minimum), and the live
bca9f8e6
MD
148 filesystem caches iterations when allocating multiple chunks.
149 However, on remount any partial allocations out of a 64KB
150 allocation block MAY cause the entire 64KB to be considered
151 allocated. Fragmented space can potentially be reclaimed
152 and/or relocated by the bulk block free scan.
1a7cfe5a 153
93f3933a 154 The 2-bit bitmap fields are assigned as follows:
1a7cfe5a 155
93f3933a 156 00 FREE
bca9f8e6
MD
157 01 POSSIBLY FREE (type 1)
158 10 POSSIBLY FREE (type 2)
93f3933a 159 11 ALLOCATED
1a7cfe5a 160
bca9f8e6
MD
161 Freemap Metadata Substructure
162 (Levels 2, 3, 4, and 5)
1a7cfe5a 163
bca9f8e6
MD
164 Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
165 some of the check area (a 24-byte area) for freemap-specific meta-data.
166 We reserve a few fields to store information which allows the block
167 allocator to do its work more efficiently.
1a7cfe5a 168
93f3933a
MD
169 * bigmask - A mask of radixes available for allocation under this
170 blockref. Typically initialized to -1.
1a7cfe5a
MD
171
172 * avail - Total available space in bytes.
173
174 The freemap allocator uses a cylinder-group-like abstraction using
175 the localized allocation concept first implemented by UFS. In HAMMER2
bca9f8e6
MD
176 there is no such thing as a real cylinder group, nor are there specific
177 reserved areas for inodes vs data, but we do the next best thing by
178 roughly typing leafs (each leaf representing ~2MB) to hopefully allow
179 the drive to employ its zone-cache to make both stat-only and tar-style
180 bulk accesses efficient (in addition to normal file accesses).
1a7cfe5a 181
a71db85d 182 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
bca9f8e6
MD
183 supplying 10 bits of address space each. Level 5 is a blockmap[8]
184 stored in the volume header supplying 3 bits of address space.
185 (level 0 supplies 10 + 21 bits of address space).
a71db85d
MD
186
187 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
188 effectively fixed at multiples of ~2MB or so.
1a7cfe5a 189
93f3933a 190 Initial Conditions
1a7cfe5a
MD
191
192 The freemap is a multi-indirect block structure but there is no real
193 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply
93f3933a
MD
194 leaves the associated top-level indirect blocks empty and uses the
195 (voldata->allocator_beg) field to allocate space linearly, then leaves
196 it to the live filesystem to initialize the freemap as more space gets
197 allocated.
198
a71db85d
MD
199 The freemap does NOT use a fixed 5-level radix tree. It uses the same
200 blockmap algorithm used for file blocks but restricts any recursion to
201 specific radix values. This means that small filesystems will have much
202 smaller freemap depths. 2 layers (and not counting the volume header as
203 a layer) gets us 16GB, 3 layers gets us 16TB.
204
205 How blocks are allocated and freed
93f3933a 206
bca9f8e6
MD
207 The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
208 contains a linear allocation offset that can keep track of sub-16KB
209 allocations with certain restrictions. More random sub-16KB allocations
210 are tracked in-memory, but will be lost (assumed to be a full 16KB) if
211 a crash occurs.
212
213 NOTE! All operations on the freemap occur on the current live version
214 of the freemap, including bulkfree operations.
215
216 Blocks are allocated by transitioning the 2-bit pattern in the leaf
217 to 11. That is, (00, 01, 10) -> (11). This handles races between the
218 live allocator and the asynchronous bulkfree code. A live allocation
219 which occurs while the asynchronous bulkfree process is running will
220 operate race-free by transitioning the (01) an (10) states back
221 to (11), which prevents bulkfree from later marking those blocks
222 FREE (00).
223
224 Blocks can be freed several ways, but all block freeing operations
225 require at least two passes before the related blocks can actually be
226 reused.
227
228 Method #1 - Removal in the filesystem marks the related freemap bitmap
229 POSSIBLY FREE (either 01 or 10). The asynchronous bulkfree
230 process later determines that the block is actually free and
231 transitions it to FREE (00), or moves it back to
232 ALLOCATED (11).
233
234 This works whether the blocks can actually be freed or not,
235 so we don't care if the related blocks are part of some other
236 snapshot or not. bulkfree will figure it out.
237
238 Method #2 - Removal in the filesystem ignores the freemap. The freemap
239 blocks are left alone (typically ALLOCATED (11)).
240
241 In this situation bulkfree must make extra passes to determine
242 if blocks are possibly free, then transition the leaf bitmap
243 entries to POSSIBLY FREE (01 or 10). bulkfree cannot directly
244 transition the entries to FREE (00) without another pass.
245
246 However, this method has numerous advantages including making
247 snapshot manipulation (including deletions) instantanious
248 and allow whole subtrees and/or large-files to be rm -rf'd
249 with only a single disk write to update the inode in the
250 best case.
251
252 Method #3 - Brute force. *ALL* freemap bitmap entries are marked
253 POSSIBLY FREE and bulkfree then must do multiple passes
254 (particularly in order to ensure that its memory use remains
255 bounded) to again transition all the freemap bitmap entries
256 to either FREE (00) or ALLOCATED (11).
257
258 This method can be faster than #2 but wastes a considerable
259 amount of write-bandwidth (and SSD wear if the target drive
260 is a SSD).
261
262 In all cases the bulkfree code must make a final pass on the filesystem
263 to do the final transition of POSSIBLY FREE blocks to FREE (00) or
264 ALLOCATED (11). Again, races for the FREE (00) are handled by observing
265 if the bitmap code was moved to ALLOCATED (11) by the live system while
266 bulkfree ran asynchrnously and not transitioning the element to FREE (00)
267 in that situation.
268
269 All bulkfree passes are done on meta-data. Actual data blocks do not
270 need to be read unless the media is being verified. H2 uses method #2
271 by default and efficiency depends on how much ram the system has to
272 cache scan information. That said, the bulkfree process is not only
273 incremental but it is also interruptable and restartable. It does not
274 interfere with live operations other than using disk bandwidth, so
275 there are several ways to run it including in the background.
276
277 The biggest issue is that *NO* space can be freed up by the live
278 filesystem without the bulkfree process unless we optimize the case
279 where data is created and deleted from within a single snapshot.
280 This is made more difficult by the fact that each flush represents
281 a fine-grained snapshot (up to four, representing the four volume
282 headers the flush iterates through).
283
284 Snapshots and Replicated Topologies
285
286 The bulkfree code maintains information in-memory to the best of its
287 ability for a multitude of reasons, including attempting to detect
288 snapshot recursions down block chains which have already been scanned
289 via some other snapshot. Without this, a large number of snapshots
290 can cause a huge multiplication of disk I/O reads (but not writes) during
291 the topology scan.
1a7cfe5a
MD
292
293 Use of Generic indirect-block API
294
295 I decided to use the same indirect-block allocation model for the
296 freemap that normal files use, with a few special cases added to force
297 specific radix values and to 'allocate' the freemap-related blocks
298 and indirect blocks via a reserved-block calculation and (obviously)
299 not via a recursive call to the allocator.
300
93f3933a 301 The Freemap is defined above as a fixed 5-level scheme (level 1-5),
1a7cfe5a 302 but in actual operation the radix tree can be shortcut just as it
bca9f8e6
MD
303 is with normal files. However, unlike normal files, shorcuts will
304 be forced to use specific radix values in order to guarantee that
305 reserved block numbers can be trivially calculated. As the freemap
306 becomes more fleshed out the tree on-media will look more and more like
307 the actual specification.
1a7cfe5a
MD
308
309 One advantage of doing things this way is that smaller filesystems
bca9f8e6
MD
310 won't actually use a 5-level scheme. A 16GB filesystem can use 8
311 blockrefs in the volume header which point directly to layer 1 leaf
312 blocks. A 16TB filesystem can be managed with only three levels
313 (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
314 the volume header). And so forth.
1a7cfe5a
MD
315
316 At the moment we have no plans to return any of the unused 4MB zone
93f3933a
MD
317 header space (per 2GB of storage) back to the filesystem for general use.
318 There are lots of things we may want to use the reserved areas for in
319 the future.
bca9f8e6
MD
320
321 Emergency Deletions
322
323 All filesystem modifications including deletions must allocate blocks
324 in order to update the main topology all the way to the root. H2 will
325 reserve roughly 5% of the available blocks in the filesystem for
326 deletions in order to allow a system operator to recover from a
327 filesystem full condition.
328
329 Despite this, situations may come up, due to having snapshots, where
330 deletions eat up available blocks but fail to create freeable space.
331 When this situation occurs the system operator may be forced to issue
332 emergency in-place deletions which replace existing blocks rather then
333 allocate new blocks. For the moment the spec for dealing with these
334 situations remains incomplete.