2 HAMMER2 Freemap Design Notes
6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone
7 contains a 4 MByte header (64 x 64K blocks).
9 Block #0 is the volume header. Block #0 is the next three zones,
10 assuming the media is big enough, contain backup volume headers.
11 Flushes cycle through these four headers and the mount code iterates
12 all four to locate the best candidate to mount with. The reason for
13 this is to ensure that mounting works even if a crash catches a
14 volume header in a partial update.
16 Remaining blocks are used for various purposes, primarily by the
19 * It is very important to remember that the Freemap only uses
20 blocks from these reserved areas. Freemap blocks are NOT dynamically
23 * On-mount, the synchronization TID for the main H2 filesystem is
24 compared against the synchronization TID of the freemap and the
25 H2 topology is incrementally iterated using mirror_tid to update
26 the freemap with any missing information. This way the freemap flush
27 does not need to be synchronized with the normal H2 flush.
29 * The freemap is flushed in a manner similar to the normal H2 filesystem,
30 but as mentioned above it does not have to be synchronized. One freemap
31 flush could cover several H2 flushes. A freemap flush is not necessary
32 for e.g. a fsync() or sync() to complete successfully.
34 * The minimum allocation radix is 10 (1 Kilobyte). In Hammer2, the
35 inode is 1KB and contains up to 512 bytes of direct data, so in terms
36 of file storage efficiency H2 is can pack small files and their inodes
37 into a single 1KB block.
39 The freemap thus must handle a 1KB granularity, which comes to around
40 256KB per 2GB zone at 1-bit-per-block. Since we have ~4MB available,
41 there is plenty of space to implement redundancy.
45 The freemap topology contains 5 levels of meta-data (blockref arrays).
47 Level 0 - (radix 10+19+2) 256KB bitmap representing 2GB
49 Level 1 - (radix 10) 64KB blockmap representing 2GB. This level
50 shadows level 0 exactly. There are 1024 blockref entries
51 each representing ~2MB worth of media storage.
53 Level 2 - (radix 10) 64KB blockmap representing 2TB.
54 Level 3 - (radix 10) 64KB blockmap representing 2PB.
55 Level 4 - (radix 10) 64KB blockmap representing 2EB.
56 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
57 (this conveniently eats one 512-byte 'sector' of the 64KB
60 Each level is assign reserved blocks in the 4MB header per 2GB zone.
61 Level 0 requires four blocks (#1-#4), level 1, 2, 3, and 4 each require
62 one block (#5, #6, #7, #8), while level 5 is embedded in the volume
65 In addition, the reserved blocks 1-8 are not overwritten on each flush.
66 Instead, a different set of reserved blocks is used. Four sets, A-D,
67 are specified. A=1-8, B=9-16, C=17-24, D=25-32. Blocks 33-63 are unused
68 at present and reserved for future use.
72 The blockref substructure at each level steals some space from the
73 check code area (a 24-byte area). We only need 4 bytes for the check
74 code icrc. We use some of the remaining space to store information
75 that allows the block allocator to do its work more efficiently.
77 * biggest - Biggest available linear allocation radix (powers of 2).
78 May be initialized larger but the 2GB zone has a 4MB chunk
79 taken out of it for a header so the maximum linear allocation
80 is going to be 1GB (and not an efficient 1GB at that), which
83 * avail - Total available space in bytes.
85 The freemap allocator uses a cylinder-group-like abstraction using
86 the localized allocation concept first implemented by UFS. In HAMMER2
87 there is no such thing as a real cylinder group, but we do the next
88 best thing by implementing our layer 1 blockmap representing 2GB.
90 The layer 1 blockmap is an array of 1024 blockrefs, so each blockref
91 covers 2MB worth of media storage. HAMMER2's 'cylinder group' concept
92 thus has a minimum granularity of 2MB. A typical setting might be e.g.
95 By localizing allocations to cylinder groups based on various bits of
96 information, HAMMER2 tries to allocate space on the disk and still leave
97 some left over for localized expansion and to reduce fragmentation at
98 the same time. Not an easy task, especially considering the copy-on-write
99 nature of the filesystem. This part of the algorithm likely needs a lot
100 of work but I hope I've laid down a media format that will not have to be
101 changed down the line to accomodate better allocation strategies.
105 The freemap is a multi-indirect block structure but there is no real
106 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply
107 leaves the associated blockset empty and uses the (voldata->allocator_beg)
108 field to allocate space linearly, then leaves it to the live filesystem
109 to initialize the freemap as more space gets allocated.
111 To keep the abstraction simple, this means in the bitmap 0=unallocated,
112 1=allocated. The allocation blockmap is initialized for the zone's 4MB
113 reserve area as new zones are opened up for allocation. Initialization
114 of the freemap for the root zone at offset 0 is further adjusted based
115 on (voldata->allocator_beg). This field is not used once the freemap
116 for the root zone has been setup by the live filesystem.
118 Use of Generic indirect-block API
120 I decided to use the same indirect-block allocation model for the
121 freemap that normal files use, with a few special cases added to force
122 specific radix values and to 'allocate' the freemap-related blocks
123 and indirect blocks via a reserved-block calculation and (obviously)
124 not via a recursive call to the allocator.
126 The Freemap is defined above as a fixed 6-level scheme (level 0-5),
127 but in actual operation the radix tree can be shortcut just as it
128 is with normal files. However, shorcuts are forced into the radix
129 values of this specification and reserved blocks are calculated based on
130 the radix level and offset, so as the freemap becomes more fleshed
131 out the tree looks more and more like the specification.
133 One advantage of doing things this way is that smaller filesystems
134 won't actually use a 6-level scheme. A 16GB filesystem can use 8
135 blockrefs at layer 5 (in the volume header) that point directly to
136 layer 1. A 16TB filesystem can use 8 blockrefs at layer5 that point
137 to layer 2. And so forth.
139 At the moment we have no plans to return any of the unused 4MB zone
140 header space back to the filesystem for general use. There are lots
141 of things we may want to use the reserved areas for in the future.