2 HAMMER2 Freemap Design Notes
6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone
7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The
8 blocks in this header are reserved for various purposes. For example,
9 block #0 is reserved for a volume header or volume header backup. Most
10 of the 64KB blocks in this header are reserved for use by the freemap.
12 The freemap only uses blocks from these reserved areas. In order to
13 ensure that any of the four volume headers can be used for the mount
14 (in case some are found to be corrupted), each freemap block in the
15 logical freemap topology iterates through 6 different copies. The
16 freemap is a 4-layer topology (+1 3-bit layer in the volume header),
17 so 6x4 = 24 of the 64 reserved blocks are dedicated to freemap
20 Any given modification of a freemap block that crosses a flush group
21 must cycle to the next copy of the freemap block. Having 6 copies
24 - Each of the four backup volume headers points to a consistent
25 freemap topology. This eats 4 copies.
27 - That recovery operations during mount do not modify the state of the
28 freemap topology pointed to by older volume headers that are still
29 valid. This eats 1 copy.
31 - The bulk free scan eats 1 copy to use as spool-off space if the
32 thread hits its ram limits. This copy is not part of the normal
37 However, there is one major restriction: If an older volume header is
38 selected by the mount code, any newer (presumably corrupt since the
39 mount code didn't select it) volume headers will lose freemap consistency
40 as the freemap code rotates into freemap blocks that might have been used
41 by the topology pointed to by the newer (but not selected) backup
42 volume headers. For a RW mount, this means that if an older volume
43 backup is selected, the newer ones that were not selected MUST be
44 formally invalidated and cannot be used in a remount attempt. To
45 mitigate the potential loss of data, any volume headers lost in this
46 manner can be snapshotted and the freemap recovery scan (in a RW mount)
47 can also scan the snapshots to try to ensure that the blocks are marked
48 as allocated. The system operator can then check the snapshot manually.
50 During normal operation, each filesystem flush rotates to a new backup
51 volume header (a filesystem has up to four) and retains full consistency
52 for the older volume headers. Each logical freemap block in the topology
53 rotates through the 6 possible versions (on-modify only).
57 The freemap topology contains 4 levels of meta-data (blockref arrays),
58 one of which is embedded in the volume header (so only three real
59 meta-data levels), plus one level of leaf-data. Unlike normal files,
60 which use a variable-radix, the freemap topology uses a fixed radix to
61 simplify the algorithm and to ensure freemap locality to the blocks
64 Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented
65 by a hammer2_bmap_data[1024] array. Each entry represents
66 2MB worth of media storage x 1024 entries to represent 2GB.
67 Each entry contains a 128x2 bit bitmap representing 16KB
68 of storage in 2 bits (128 x 16KB = 2MB).
70 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
71 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
72 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
73 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
74 (this conveniently eats one 512-byte 'sector' of the 64KB
77 Each level is assign reserved blocks in the 4MB header per 2GB zone.
78 Since we use block 0 for the volume header / volume header backup,
79 our level names above can simply also represent the relative block
80 number. Level 1 uses block 1 through level 4 using block 4. Level 5
81 is stored in the volume header.
85 The freemap does not have to be flushed by fsync/sync, but should probably
86 be flushed at least once a minute by the normal filesystem sync. The
87 reason it does not have to be flushed every time is that the freemap
88 recovery (using the last fully flushed freemap TID) will simply do an
89 incremental scan of the main filesystem tree between the freemap TID
90 and the main filesystem tree's TID to ensure that blocks allocated in
91 the interim are properly allocated in the freemap. Simple as that.
95 The freemap granularity is 16KB (radix of 14) but the minimum
96 allocation radix is 1KB (radix of 10) (and can be in multiples of
97 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct
98 data, so tiny files eat exactly 1KB of media storage inclusive of the
101 The freemap keeps track of partial allocations in-memory but not
102 on-media, so even a normal umount will cause partially allocated
103 blocks to appear fully allocated until some later date when the
104 bulk scan code defragments it.
108 Block selection is localized to be near the inode's (or nearby data)
109 blockref. The algorithmic complexity of determining locality is not
114 * radix - Clustering radix. All allocations for any given ~2MB zone
115 are always the same size, allowing the filesystem code to
116 cluster buffer cache I/O.
118 * bitmap - four 32 bit words representing ~2MB in 16KB allocation chunks
119 at 2 bits per chunk. The filesystem allocation granularity
120 can be smaller (currently ~1KB minimum), and the live
121 filesystem keeps caches iterations when allocating multiple
122 chunks. However, on remount any partial allocations out of
123 a 64KB allocation block causes the entire 64KB to be
124 considered allocated. Fragmented space can potentially be
125 reclaimed and/or relocated by the bulk block free scan.
127 The 2-bit bitmap fields are assigned as follows:
130 01 ARMED for free stage (future use)
131 10 ARMED for free stage (future use)
134 It should be noted that in some cases, such as snapshot
135 destruction, H2 does not bother to actually ARM the related
136 blocks (which would take a long time). Instead, the bulk
137 free-scan may have to do a more exhaustive scan.
139 Blockref Substructure
141 The blockref substructure at each level steals some space from the
142 check code area (a 24-byte area). We only need 4 bytes for the check
143 code icrc. We use some of the remaining space to store information
144 that allows the block allocator to do its work more efficiently.
146 * bigmask - A mask of radixes available for allocation under this
147 blockref. Typically initialized to -1.
149 * avail - Total available space in bytes.
151 The freemap allocator uses a cylinder-group-like abstraction using
152 the localized allocation concept first implemented by UFS. In HAMMER2
153 there is no such thing as a real cylinder group, but we do the next
154 best thing by implementing our layer 1 blockmap representing 2GB.
158 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
159 supplying 10 bits of address space each. Level 5 is a blockmap[8] stored
160 in the volume header supplying 3 bits of address space. (level 0
161 supplies 10 + 21 bits of address space).
163 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
164 effectively fixed at multiples of ~2MB or so.
168 The freemap is a multi-indirect block structure but there is no real
169 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply
170 leaves the associated top-level indirect blocks empty and uses the
171 (voldata->allocator_beg) field to allocate space linearly, then leaves
172 it to the live filesystem to initialize the freemap as more space gets
175 The freemap does NOT use a fixed 5-level radix tree. It uses the same
176 blockmap algorithm used for file blocks but restricts any recursion to
177 specific radix values. This means that small filesystems will have much
178 smaller freemap depths. 2 layers (and not counting the volume header as
179 a layer) gets us 16GB, 3 layers gets us 16TB.
181 How blocks are allocated and freed
183 H2 keeps track of sub-16KB allocations in-memory. On a crash/reboot any
184 partial allocations effectively become full 16KB block allocations until
185 the bulk freeing code comes along and fixes it. 2-bit patterns are as
189 01 ARMED (for free) (future use)
190 10 ARMED (for free) (future use)
193 Currently H2 only implements 00 and 11. When a file, topology, or
194 snapshot is deleted H2 simply leaves the blocks marked allocated but
195 records the related freezone/radix(s) in memory.
197 At some point a background bulk free-scan will run. This code must
198 scan meta-data and has a limited cache to detect duplicative sub-trees
199 (due to snapshots). It uses the freezone/radix information recorded
200 in memory to reduce the complexity of the scan, find all references to
201 the related blocks in the meta-data, and determines what can actually
202 be freed. Once this determination is made the bulk free-scan sets
203 the related freemap bits to FREE (00).
205 An exhaustive free-scan is not usually required during normal operation
206 but is typically run incrementally by cron every so often to ensure, over
207 time, that all freeable blocks are actually freed. This is most useful
208 when maintaining multiple snapshots.
210 Use of Generic indirect-block API
212 I decided to use the same indirect-block allocation model for the
213 freemap that normal files use, with a few special cases added to force
214 specific radix values and to 'allocate' the freemap-related blocks
215 and indirect blocks via a reserved-block calculation and (obviously)
216 not via a recursive call to the allocator.
218 The Freemap is defined above as a fixed 5-level scheme (level 1-5),
219 but in actual operation the radix tree can be shortcut just as it
220 is with normal files. However, shorcuts are forced into the radix
221 values of this specification and reserved blocks are calculated based
222 on the radix level and offset, so as the freemap becomes more fleshed
223 out the tree looks more and more like the specification.
225 One advantage of doing things this way is that smaller filesystems
226 won't actually use a 6-level scheme. A 16GB filesystem can use 8
227 blockrefs at layer 5 (in the volume header) that point directly to
228 layer 1. A 16TB filesystem can use 8 blockrefs at layer5 that point
229 to layer 2. And so forth.
231 At the moment we have no plans to return any of the unused 4MB zone
232 header space (per 2GB of storage) back to the filesystem for general use.
233 There are lots of things we may want to use the reserved areas for in