2 HAMMER2 Freemap Design Notes
6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone
7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The
8 blocks in this header are reserved for various purposes. For example,
9 block #0 is reserved for a volume header or volume header backup. Most
10 of the 64KB blocks in this header are reserved for use by the freemap.
12 The freemap only uses blocks from these reserved areas. In order to
13 ensure that any of the four volume headers can be used for the mount
14 (in case some are found to be corrupted), each freemap block in the
15 logical freemap topology iterates through 6 different copies. The
16 freemap is a 4-layer topology (+1 3-bit layer in the volume header),
17 so 6x4 = 24 of the 64 reserved blocks are dedicated to freemap
20 Any given modification of a freemap block that crosses a flush group
21 must cycle to the next copy of the freemap block. Having 6 copies
24 - Each of the four backup volume headers points to a consistent
25 freemap topology. This eats 4 copies.
27 - That recovery operations during mount do not modify the state of the
28 freemap topology pointed to by older volume headers that are still
29 valid. This eats 1 copy.
31 - The bulk free scan eats 1 copy to use as spool-off space if the
32 thread hits its ram limits. This copy is not part of the normal
37 However, there is one major restriction: If an older volume header is
38 selected by the mount code, any newer (presumably corrupt since the
39 mount code didn't select it) volume headers will lose freemap consistency
40 as the freemap code rotates into freemap blocks that might have been used
41 by the topology pointed to by the newer (but not selected) backup
42 volume headers. For a RW mount, this means that if an older volume
43 backup is selected, the newer ones that were not selected MUST be
44 formally invalidated and cannot be used in a remount attempt. To
45 mitigate the potential loss of data, any volume headers lost in this
46 manner can be snapshotted and the freemap recovery scan (in a RW mount)
47 can also scan the snapshots to try to ensure that the blocks are marked
48 as allocated. The system operator can then check the snapshot manually.
50 During normal operation, each filesystem flush rotates to a new backup
51 volume header (a filesystem has up to four) and retains full consistency
52 for the older volume headers. Each logical freemap block in the topology
53 rotates through the 6 possible versions (on-modify only).
57 The freemap topology contains 4 levels of meta-data (blockref arrays),
58 one of which is embedded in the volume header (so only three real
59 meta-data levels), plus one level of leaf-data. Unlike normal files,
60 which use a variable-radix, the freemap topology uses a fixed radix to
61 simplify the algorithm and to ensure freemap locality to the blocks
64 Level 1 - (radix 10 + 21) 16KB blockmap representing 2GB. There are 1024
65 entries representing ~2MB worth of media storage per entry.
66 Each entry contains a 128x2 bit bitmap representing 16KB
69 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
70 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
71 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
72 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
73 (this conveniently eats one 512-byte 'sector' of the 64KB
76 Each level is assign reserved blocks in the 4MB header per 2GB zone.
77 Since we use block 0 for the volume header / volume header backup,
78 our level names above can simply also represent the relative block
79 number. Level 1 uses block 1 through level 4 using block 4. Level 5
80 is stored in the volume header.
84 The freemap does not have to be flushed by fsync/sync, but should probably
85 be flushed at least once a minute by the normal filesystem sync. The
86 reason it does not have to be flushed every time is that the freemap
87 recovery (using the last fully flushed freemap TID) will simply do an
88 incremental scan of the main filesystem tree between the freemap TID
89 and the main filesystem tree's TID to ensure that blocks allocated in
90 the interim are properly allocated in the freemap. Simple as that.
94 The freemap granularity is 16KB (radix of 14) but the minimum
95 allocation radix is 1KB (radix of 10) (and can be in multiples of
96 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct
97 data, so tiny files eat exactly 1KB of media storage inclusive of the
100 The freemap keeps track of partial allocations in-memory but not
101 on-media, so even a normal umount will cause partially allocated
102 blocks to appear fully allocated until some later date when the
103 bulk scan code defragments it.
107 Block selection is localized to be near the inode's (or nearby data)
108 blockref. The algorithmic complexity of determining locality is not
113 * radix - Clustering radix. All allocations for any given ~2MB zone
114 are always the same size, allowing the filesystem code to
115 cluster buffer cache I/O.
117 * bitmap - four 32 bit words representing ~2MB in 16KB allocation chunks
118 at 2 bits per chunk. The filesystem allocation granularity
119 can be smaller (currently ~1KB minimum), and the live
120 filesystem keeps caches iterations when allocating multiple
121 chunks. However, on remount any partial allocations out of
122 a 64KB allocation block causes the entire 64KB to be
123 considered allocated. Fragmented space can potentially be
124 reclaimed and/or relocated by the bulk block free scan.
126 The 2-bit bitmap fields are assigned as follows:
129 01 ARMED for free stage (future use)
130 10 ARMED for free stage (future use)
133 It should be noted that in some cases, such as snapshot
134 destruction, H2 does not bother to actually ARM the related
135 blocks (which would take a long time). Instead, the bulk
136 free-scan may have to do a more exhaustive scan.
138 Blockref Substructure
140 The blockref substructure at each level steals some space from the
141 check code area (a 24-byte area). We only need 4 bytes for the check
142 code icrc. We use some of the remaining space to store information
143 that allows the block allocator to do its work more efficiently.
145 * bigmask - A mask of radixes available for allocation under this
146 blockref. Typically initialized to -1.
148 * avail - Total available space in bytes.
150 The freemap allocator uses a cylinder-group-like abstraction using
151 the localized allocation concept first implemented by UFS. In HAMMER2
152 there is no such thing as a real cylinder group, but we do the next
153 best thing by implementing our layer 1 blockmap representing 2GB.
157 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
158 supplying 10 bits of address space each. Level 5 is a blockmap[8] stored
159 in the volume header supplying 3 bits of address space. (level 0
160 supplies 10 + 21 bits of address space).
162 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
163 effectively fixed at multiples of ~2MB or so.
167 The freemap is a multi-indirect block structure but there is no real
168 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply
169 leaves the associated top-level indirect blocks empty and uses the
170 (voldata->allocator_beg) field to allocate space linearly, then leaves
171 it to the live filesystem to initialize the freemap as more space gets
174 The freemap does NOT use a fixed 5-level radix tree. It uses the same
175 blockmap algorithm used for file blocks but restricts any recursion to
176 specific radix values. This means that small filesystems will have much
177 smaller freemap depths. 2 layers (and not counting the volume header as
178 a layer) gets us 16GB, 3 layers gets us 16TB.
180 How blocks are allocated and freed
182 H2 keeps track of sub-16KB allocations in-memory. On a crash/reboot any
183 partial allocations effectively become full 16KB block allocations until
184 the bulk freeing code comes along and fixes it. 2-bit patterns are as
188 01 ARMED (for free) (future use)
189 10 ARMED (for free) (future use)
192 Currently H2 only implements 00 and 11. When a file, topology, or
193 snapshot is deleted H2 simply leaves the blocks marked allocated but
194 records the related freezone/radix(s) in memory.
196 At some point a background bulk free-scan will run. This code must
197 scan meta-data and has a limited cache to detect duplicative sub-trees
198 (due to snapshots). It uses the freezone/radix information recorded
199 in memory to reduce the complexity of the scan, find all references to
200 the related blocks in the meta-data, and determines what can actually
201 be freed. Once this determination is made the bulk free-scan sets
202 the related freemap bits to FREE (00).
204 An exhaustive free-scan is not usually required during normal operation
205 but is typically run incrementally by cron every so often to ensure, over
206 time, that all freeable blocks are actually freed. This is most useful
207 when maintaining multiple snapshots.
209 Use of Generic indirect-block API
211 I decided to use the same indirect-block allocation model for the
212 freemap that normal files use, with a few special cases added to force
213 specific radix values and to 'allocate' the freemap-related blocks
214 and indirect blocks via a reserved-block calculation and (obviously)
215 not via a recursive call to the allocator.
217 The Freemap is defined above as a fixed 5-level scheme (level 1-5),
218 but in actual operation the radix tree can be shortcut just as it
219 is with normal files. However, shorcuts are forced into the radix
220 values of this specification and reserved blocks are calculated based
221 on the radix level and offset, so as the freemap becomes more fleshed
222 out the tree looks more and more like the specification.
224 One advantage of doing things this way is that smaller filesystems
225 won't actually use a 6-level scheme. A 16GB filesystem can use 8
226 blockrefs at layer 5 (in the volume header) that point directly to
227 layer 1. A 16TB filesystem can use 8 blockrefs at layer5 that point
228 to layer 2. And so forth.
230 At the moment we have no plans to return any of the unused 4MB zone
231 header space (per 2GB of storage) back to the filesystem for general use.
232 There are lots of things we may want to use the reserved areas for in