HAMMER2 Freemap Design Notes Overview HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The blocks in this header are reserved for various purposes. For example, block #0 is reserved for a volume header in the first four zones. Most of the remaining 64KB blocks in this header are reserved for use by the freemap. The freemap only uses blocks from these reserved areas. In order to ensure that any of the four volume headers can be used by the mount code (in case some are found to be corrupted), each freemap block in the logical freemap topology will iterate through up to 8 copies whos block numbers are taken the reserved area. - Four copies, one for each of the four volume headers which H2 sequences through on each flush. This ensures that a mount from any of the four volume headers is handed a consistent freemap topology. - One copy to ensure that recovery operations during mount do not modify the state of the freemap topology pointed to by older volume headers which are still valid. Note that the freemap for volume headers indexed after the mount point being recovered may lose freemap consistency, so if you choose an older mount point for a RW mount, you have to stick with it. - One copy for live operations. This allows HAMMER2 to retire the related buffer asynchronously in the background (or for the OS to retire the buffer cache buffer on its own) prior to the formal flush. The later formal flush then has less work to do. - The two remaining copies add robustness to the specification. For example, with appropriate feature code added the filesystem can tolerate a limited number of bad blocks in the reserved area. For the moment we use a simple calculation for the freemap block. In a later version I would like to mix the blocks up a bit so the blocks in each set of 8 are not situated near each other. RW Mount Restrictions If an older volume header is explicitly selected by the mount code, any newer (presumably corrupt since the mount code didn't select it) volume headers will lose freemap consistency as the freemap code rotates into freemap blocks that might have been used by the topology pointed to by the newer (but not selected) volume headers. For a RW mount, this means that if an older volume header is selected, the newer ones that were not selected WILL be formally invalidated by the mount code and cannot be used in a remount attempt. During normal operation, each filesystem flush rotates to a new volume header. A filesystem may have up to four volume headers spread at 2GB intervals. Filesystems smaller than ~9GB or so will have fewer volume headers to rotate through. Freemap Topology The freemap topology contains 4 levels of meta-data (blockref arrays), one of which is embedded in the volume header (so only three real meta-data levels), plus one level of leaf-data. Unlike normal files, which use a variable-radix, the freemap topology uses a fixed radix to simplify the algorithm and to ensure freemap locality to the blocks under management. Freemap blocks are allocated from the reserved area in each 2GB zone. The leafs represent data in the zone. Higher levels in the freemap topology will cover more area but the physical freemap meta-data blocks always occur prior to the area being covered. Thus a HAMMER2 filesystem of almost any size can be formatted and the related freemap blocks will always exist. Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented by a hammer2_bmap_data[1024] array. Each entry represents 2MB worth of media storage x 1024 entries to represent 2GB. Each entry contains a 128x2 bit bitmap representing 16KB of storage in 2 bits (128 x 16KB = 2MB). Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) (this conveniently eats one 512-byte 'sector' of the 64KB volume header). Each level is assign reserved blocks in the 4MB header per 2GB zone. Since we use block 0 for the volume header, the first freemap reserved block in the zone begins at block 1. Freemap copy #0: Level 1 uses block 1 (this is the leaf block) Level 2 uses block 2 Level 3 uses block 3 Level 4 uses block 4 Freemap copy #1: Level 1 uses block 5 (this is the leaf block) Level 2 uses block 6 Level 3 uses block 7 Level 4 uses block 8 ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32. Flushing The freemap does not have to be flushed by fsync/sync, but should probably be flushed at least once a minute by the normal filesystem sync. The reason it does not have to be flushed with fsync is that freemap recovery is executed on-mount and will use the last fully flushed freemap TID stored in the volume header to do an incremental meta-data scan of the H2 filesystem between that TID and the last flushed TID. All blocks not found to have been marked allocated will be marked allocated. Simple as that. Since the scan is incremental, this typically costs very little time. Freemap Granularity The freemap granularity is 16KB (radix of 14) but the minimum allocation radix is 1KB (radix of 10) (and can be in multiples of 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct data, so tiny files eat exactly 1KB of media storage inclusive of the inode. The freemap keeps track of partial allocations in-memory but not on-media, so even a normal umount will cause partially allocated blocks to appear fully allocated until some later date when the bulk scan code defragments it. Block Selection Block selection is localized to be near the inode's (or nearby data) blockref. The algorithmic complexity of determining locality is not defined here atm. Freemap Leaf Substructure * linear - Linear sub-granular allocation offset. Allows ~1KB granular linear allocations. * class - Allocation clustering class ((type << 8) | radix). * avail - Available space in bytes, currently only used by layer 1 leaf. Used as an allocation clustering aid. * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks at 2 bits per chunk. The filesystem allocation granularity can be smaller (currently ~1KB minimum), and the live filesystem caches iterations when allocating multiple chunks. However, on remount any partial allocations out of a 64KB allocation block MAY cause the entire 64KB to be considered allocated. Fragmented space can potentially be reclaimed and/or relocated by the bulk block free scan. The 2-bit bitmap fields are assigned as follows: 00 FREE 01 POSSIBLY FREE (type 1) 10 POSSIBLY FREE (type 2) 11 ALLOCATED Freemap Metadata Substructure (Levels 2, 3, 4, and 5) Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal some of the check area (a 24-byte area) for freemap-specific meta-data. We reserve a few fields to store information which allows the block allocator to do its work more efficiently. * bigmask - A mask of radixes available for allocation under this blockref. Typically initialized to -1. * avail - Total available space in bytes. The freemap allocator uses a cylinder-group-like abstraction using the localized allocation concept first implemented by UFS. In HAMMER2 there is no such thing as a real cylinder group, nor are there specific reserved areas for inodes vs data, but we do the next best thing by roughly typing leafs (each leaf representing ~2MB) to hopefully allow the drive to employ its zone-cache to make both stat-only and tar-style bulk accesses efficient (in addition to normal file accesses). Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), supplying 10 bits of address space each. Level 5 is a blockmap[8] stored in the volume header supplying 3 bits of address space. (level 0 supplies 10 + 21 bits of address space). The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus effectively fixed at multiples of ~2MB or so. Initial Conditions newfs_hammer2 does not need to format the freemap. Instead, newfs_hammer2 simply leaves the associated top-level indirect blocks empty and uses the (voldata->allocator_beg) field to allocate space linearly, then leaves it to the live filesystem to initialize the freemap as more space gets allocated. The freemap does NOT use a fixed 5-level radix tree. It uses the same blockmap algorithm used for file blocks but restricts any recursion to specific radix values. This means that small filesystems will have much smaller freemap depths. 2 layers (and not counting the volume header as a layer) gets us 16GB, 3 layers gets us 16TB. How blocks are allocated and freed The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also contains a linear allocation offset that can keep track of sub-16KB allocations with certain restrictions. More random sub-16KB allocations are tracked in-memory, but will be lost (assumed to be a full 16KB) if a crash occurs. Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10, or 11. NOTE! All operations on the freemap occur on the current live version of the freemap, including bulkfree operations. Blocks are allocated by transitioning the 2-bit pattern in the leaf to 11. That is, (00, 01, 10) -> (11). The primary mechanism used to free a block is via the asynchronous bulkfree scan. This scans all filesystem meta-data in two major passes (and potentially multiple sub-passes). Pass#1 - The first pass figures which blocks might be freeable. The most recently flushed meta-data topology (including all four volume headers and all snapshots) is scanned and an in-memory copy of the FreeMap is built from scratch. Multiple sub-scans might be required to break the larger scan up into more easily digested pieces based on the amount of memory available to hold the temporary freemap. Any allocated blocks in the live freemap are then transitioned from (11) to either (10) or (01) if after the scan they are found to not be allocated. The blocks are still assumed to be allocated at this time and any new allocations will transition them back to (11). Pass#2 - The second pass is required to deal with races against the live filesystem while the freemap scan was running. It also allows the freemap scans to run asynchronously from any flush, improving concurrency. However, at least one synchronous flush is required between Pass#1 and Pass#2. The second pass is a duplicate of the first pass. The meta-data topology is scanned and a freemap is built in-memory and then compared against the live freemap. Instead transitioning from (11)->(10)/(01) this pass transitions from (10)/(01) to (00). If a block that it thinks is free is (11), no transition occurs because this could be due to a race against the live filesystem. This pass will incidentally transition (10)/(01) back to (11) if the block was found not to be allocated, but it is perfectly acceptable for the block to remain in a (10)/(01) state after completion. NOTE! The meta-data scanning passes must also explicitly scan blocks associated with any open files, since these might represent open-but-deleted files. These blocks must not be accidently freed while the system is still using the file. Again, since this is done in two passes it does not have to be synchronized against frontend operations. So in total: * Topology under all four volume headers. This includes all PFSs and snapshots. * Topology under all open hammer2 files. The Bulk-free operation is expensive but uses a bounded amount of ram. The ADVANTAGE of this mechanism is that deletions in the live filesystem do not have to clean up the freemap and thus do not have to recurse the topology during the deletion. In fact, a 'rm -rf' equivalent of a directory topology can be handled simply by blowing away the top-level directory inode. This is instantanious and thus can be dangerous but you always have your snapshots to fall-back on. The DISADVANTAGE is that all meta-data must be scanned. Twice. This can be mitigated by using swapcache(8) to cache the meta-data on a SSD. This is also mitigated by the fact that you can do the bulkfree scan less often on very large filesystems which presumably have a lot of freespace (so the interval is not as big an issue). In a sense the operation does scale in that it takes longer on larger filesystems but also can be run less often. The biggest issue is that *NO* space can be freed up by the live filesystem without the bulkfree process unless we optimize the case where data is created and deleted from within a single snapshot. This is made more difficult by the fact that each flush represents a fine-grained snapshot (up to four, representing the four volume headers the flush iterates through). Snapshots and Replicated Topologies The bulkfree code maintains information in-memory to the best of its ability for a multitude of reasons, including attempting to detect snapshot recursions down block chains which have already been scanned via some other snapshot. Without this, a large number of snapshots can cause a huge multiplication of disk I/O reads (but not writes) during the topology scan. Use of Generic indirect-block API I decided to use the same indirect-block allocation model for the freemap that normal files use, with a few special cases added to force specific radix values and to 'allocate' the freemap-related blocks and indirect blocks via a reserved-block calculation and (obviously) not via a recursive call to the allocator. The Freemap is defined above as a fixed 5-level scheme (level 1-5), but in actual operation the radix tree can be shortcut just as it is with normal files. However, unlike normal files, shorcuts will be forced to use specific radix values in order to guarantee that reserved block numbers can be trivially calculated. As the freemap becomes more fleshed out the tree on-media will look more and more like the actual specification. One advantage of doing things this way is that smaller filesystems won't actually use a 5-level scheme. A 16GB filesystem can use 8 blockrefs in the volume header which point directly to layer 1 leaf blocks. A 16TB filesystem can be managed with only three levels (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in the volume header). And so forth. At the moment we have no plans to return any of the unused 4MB zone header space (per 2GB of storage) back to the filesystem for general use. There are lots of things we may want to use the reserved areas for in the future. Emergency Deletions All filesystem modifications including deletions must allocate blocks in order to update the main topology all the way to the root. H2 will reserve roughly 5% of the available blocks in the filesystem for deletions in order to allow a system operator to recover from a filesystem full condition. However, due to the snapshot capability as well as the possibility of fragmentation, it is possible for the administrator to not delete enough to actually be able to free up blocks. Once the reserve is used up the filesystem can become unwritable. When this situation occurs the only way to recover is to update blocks in-place. Updating blocks in-place will destroy the data on any related snapshots or otherwise corrupt the snapshots. Emergency recovery thus recommends that all related snapshots be destroyed. You can choose not to do this in which case your snapshots might wind up containing broken links and generate CRC failure messages. For the moment the spec for dealing with these situations remains incomplete.