HAMMER2 Freemap Design Notes Overview HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The blocks in this header are reserved for various purposes. For example, block #0 is reserved for a volume header or volume header backup. Most of the 64KB blocks in this header are reserved for use by the freemap. The freemap only uses blocks from these reserved areas. In order to ensure that any of the four volume headers can be used for the mount (in case some are found to be corrupted), each freemap block in the logical freemap topology iterates through 6 different copies. The freemap is a 4-layer topology (+1 3-bit layer in the volume header), so 6x4 = 24 of the 64 reserved blocks are dedicated to freemap operations. Any given modification of a freemap block that crosses a flush group must cycle to the next copy of the freemap block. Having 6 copies ensures that: - Each of the four backup volume headers points to a consistent freemap topology. This eats 4 copies. - That recovery operations during mount do not modify the state of the freemap topology pointed to by older volume headers that are still valid. This eats 1 copy. - The bulk free scan eats 1 copy to use as spool-off space if the thread hits its ram limits. This copy is not part of the normal rotation. - Total is 6 copies. However, there is one major restriction: If an older volume header is selected by the mount code, any newer (presumably corrupt since the mount code didn't select it) volume headers will lose freemap consistency as the freemap code rotates into freemap blocks that might have been used by the topology pointed to by the newer (but not selected) backup volume headers. For a RW mount, this means that if an older volume backup is selected, the newer ones that were not selected MUST be formally invalidated and cannot be used in a remount attempt. To mitigate the potential loss of data, any volume headers lost in this manner can be snapshotted and the freemap recovery scan (in a RW mount) can also scan the snapshots to try to ensure that the blocks are marked as allocated. The system operator can then check the snapshot manually. During normal operation, each filesystem flush rotates to a new backup volume header (a filesystem has up to four) and retains full consistency for the older volume headers. Each logical freemap block in the topology rotates through the 6 possible versions (on-modify only). Freemap Topology The freemap topology contains 4 levels of meta-data (blockref arrays), one of which is embedded in the volume header (so only three real meta-data levels), plus one level of leaf-data. Unlike normal files, which use a variable-radix, the freemap topology uses a fixed radix to simplify the algorithm and to ensure freemap locality to the blocks under management. Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented by a hammer2_bmap_data[1024] array. Each entry represents 2MB worth of media storage x 1024 entries to represent 2GB. Each entry contains a 128x2 bit bitmap representing 16KB of storage in 2 bits (128 x 16KB = 2MB). Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) (this conveniently eats one 512-byte 'sector' of the 64KB volume header). Each level is assign reserved blocks in the 4MB header per 2GB zone. Since we use block 0 for the volume header / volume header backup, our level names above can simply also represent the relative block number. Level 1 uses block 1 through level 4 using block 4. Level 5 is stored in the volume header. Flushing The freemap does not have to be flushed by fsync/sync, but should probably be flushed at least once a minute by the normal filesystem sync. The reason it does not have to be flushed every time is that the freemap recovery (using the last fully flushed freemap TID) will simply do an incremental scan of the main filesystem tree between the freemap TID and the main filesystem tree's TID to ensure that blocks allocated in the interim are properly allocated in the freemap. Simple as that. Freemap Granularity The freemap granularity is 16KB (radix of 14) but the minimum allocation radix is 1KB (radix of 10) (and can be in multiples of 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct data, so tiny files eat exactly 1KB of media storage inclusive of the inode. The freemap keeps track of partial allocations in-memory but not on-media, so even a normal umount will cause partially allocated blocks to appear fully allocated until some later date when the bulk scan code defragments it. Block Selection Block selection is localized to be near the inode's (or nearby data) blockref. The algorithmic complexity of determining locality is not defined here atm. Leaf Substructure * radix - Clustering radix. All allocations for any given ~2MB zone are always the same size, allowing the filesystem code to cluster buffer cache I/O. * bitmap - four 32 bit words representing ~2MB in 16KB allocation chunks at 2 bits per chunk. The filesystem allocation granularity can be smaller (currently ~1KB minimum), and the live filesystem keeps caches iterations when allocating multiple chunks. However, on remount any partial allocations out of a 64KB allocation block causes the entire 64KB to be considered allocated. Fragmented space can potentially be reclaimed and/or relocated by the bulk block free scan. The 2-bit bitmap fields are assigned as follows: 00 FREE 01 ARMED for free stage (future use) 10 ARMED for free stage (future use) 11 ALLOCATED It should be noted that in some cases, such as snapshot destruction, H2 does not bother to actually ARM the related blocks (which would take a long time). Instead, the bulk free-scan may have to do a more exhaustive scan. Blockref Substructure The blockref substructure at each level steals some space from the check code area (a 24-byte area). We only need 4 bytes for the check code icrc. We use some of the remaining space to store information that allows the block allocator to do its work more efficiently. * bigmask - A mask of radixes available for allocation under this blockref. Typically initialized to -1. * avail - Total available space in bytes. The freemap allocator uses a cylinder-group-like abstraction using the localized allocation concept first implemented by UFS. In HAMMER2 there is no such thing as a real cylinder group, but we do the next best thing by implementing our layer 1 blockmap representing 2GB. Level 2, 3, 4, 5 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), supplying 10 bits of address space each. Level 5 is a blockmap[8] stored in the volume header supplying 3 bits of address space. (level 0 supplies 10 + 21 bits of address space). The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus effectively fixed at multiples of ~2MB or so. Initial Conditions The freemap is a multi-indirect block structure but there is no real reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply leaves the associated top-level indirect blocks empty and uses the (voldata->allocator_beg) field to allocate space linearly, then leaves it to the live filesystem to initialize the freemap as more space gets allocated. The freemap does NOT use a fixed 5-level radix tree. It uses the same blockmap algorithm used for file blocks but restricts any recursion to specific radix values. This means that small filesystems will have much smaller freemap depths. 2 layers (and not counting the volume header as a layer) gets us 16GB, 3 layers gets us 16TB. How blocks are allocated and freed H2 keeps track of sub-16KB allocations in-memory. On a crash/reboot any partial allocations effectively become full 16KB block allocations until the bulk freeing code comes along and fixes it. 2-bit patterns are as follows: 00 FREE 01 ARMED (for free) (future use) 10 ARMED (for free) (future use) 11 ALLOCATED Currently H2 only implements 00 and 11. When a file, topology, or snapshot is deleted H2 simply leaves the blocks marked allocated but records the related freezone/radix(s) in memory. At some point a background bulk free-scan will run. This code must scan meta-data and has a limited cache to detect duplicative sub-trees (due to snapshots). It uses the freezone/radix information recorded in memory to reduce the complexity of the scan, find all references to the related blocks in the meta-data, and determines what can actually be freed. Once this determination is made the bulk free-scan sets the related freemap bits to FREE (00). An exhaustive free-scan is not usually required during normal operation but is typically run incrementally by cron every so often to ensure, over time, that all freeable blocks are actually freed. This is most useful when maintaining multiple snapshots. Use of Generic indirect-block API I decided to use the same indirect-block allocation model for the freemap that normal files use, with a few special cases added to force specific radix values and to 'allocate' the freemap-related blocks and indirect blocks via a reserved-block calculation and (obviously) not via a recursive call to the allocator. The Freemap is defined above as a fixed 5-level scheme (level 1-5), but in actual operation the radix tree can be shortcut just as it is with normal files. However, shorcuts are forced into the radix values of this specification and reserved blocks are calculated based on the radix level and offset, so as the freemap becomes more fleshed out the tree looks more and more like the specification. One advantage of doing things this way is that smaller filesystems won't actually use a 6-level scheme. A 16GB filesystem can use 8 blockrefs at layer 5 (in the volume header) that point directly to layer 1. A 16TB filesystem can use 8 blockrefs at layer5 that point to layer 2. And so forth. At the moment we have no plans to return any of the unused 4MB zone header space (per 2GB of storage) back to the filesystem for general use. There are lots of things we may want to use the reserved areas for in the future.