HAMMER2 DESIGN DOCUMENT Matthew Dillon 08-Feb-2012 dillon@backplane.com * These features have been speced in the media structures. * Implementation work has begun. * A working filesystem with some features implemented is expected by July 2012. * A fully functional filesystem with most (but not all) features is expected by the end of 2012. * All elements of the filesystem have been designed except for the freemap (which isn't needed for initial work). 8MB per 2GB of filesystem storage has been reserved for the freemap. The design of the freemap is expected to be completely speced by mid-year. * This is my only project this year. I'm not going to be doing any major kernel bug hunting this year. Feature List * Multiple roots (allowing snapshots to be mounted). This is implemented via the super-root concept. When mounting a HAMMER2 filesystem you specify a device path and a directory name in the super-root. * HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory in the tree can be configured as a PFS, causing all elements recursively underneath that directory to become a part of that PFS. * Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots show up in the super-root. It is possible to snapshot a subdirectory and then later snapshot a parent of that subdirectory... really there are no limitations here. * Directory sub-hierarchy based quotas and space and inode usage tracking. Any directory sub-tree, whether at a mount point or not, tracks aggregate inode use and data space use. This is stored in the directory inode all the way up the chain. * Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is block-oriented and copy-on-write each blockref tracks both direct modifications to the referenced data via (modify_tid) and indirect modifications to the referenced data or any sub-tree via (mirror_tid). This makes it possible to do an incremental scan of meta-data that covers only changes made since the mirror_tid recorded in a prior-run. This feature is also intended to be used to locate recently allocated blocks and thus be able to fixup the freemap after a crash. HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in that HAMMER2 does not keep track of 'deleted' records. Instead any recursion by the mirroring code which finds that (modify_tid) has been updated must also send the direct block table or indirect block table state it winds up recursing through so the target can check similar key ranges and locate elements to be deleted. This can be avoided if the mirroring stream is mostly caught up in that very recent deletions will be cached in memory and can be queried, allowing shorter record deletions to be passed in the stream instead. * Will support multiple compression algorithms configured on subdirectory tree basis and on a file basis. Up to 64K block compression will be used. Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in this scheme because physical block allocations in HAMMER2 are always power-of-2. Compression algorithm #0 will mean no compression and no zero-checking. Compression algorithm #1 will mean zero-checking but no other compression. Real compression will be supported starting with algorithm 2. * Zero detection on write (writing all-zeros), which requires the data buffer to be scanned, will be supported as compression algorithm #1. This allows the writing of 0's to create holes and will be the default compression algorithm for HAMMER2. * Copies support for redundancy. Each copy has its own blockref. The blockrefs representing the copies must exist within the same blockset (set of 8 blockrefs), though I may relax this requirement in the implementation. The design is such that the filesystem should be able to function at full speed even if disks are pulled or inserted, as long as at least one good copy is present. A background task will be needed to resynchronize missing copies (or remove excessive copies in the case where the copies value is reduced on a live filesystem). Copies are specified using the same copyinfo[] array that is used to specify cluster interconnections for PFS's. * Clusterable with MESI cache coherency and dynamic granularity. The media format for HAMMER1 was less condusive to logical clustering than I had hoped so I was never able to get that aspect of my personal goals working with HAMMER1. HAMMER2 effectively solves the issues that cropped up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical file/directory hierarchy, making cache coherency very difficult). * Hardlinks will be supported. All other standard features will be supported too of course. Hardlinks in this sort of filesystem require significant work. * The media blockref structure is now large enough to support up to a 192-bit check value, which would typically be a cryptographic hash of some sort. Multiple check value algorithms will be supported with the default being a simple 32-bit iSCSI CRC. * Fully verified deduplication will be supported and automatic (and necessary in many respects). * Non-verified de-duplication will be supported as a configurable option on a file or subdirectory tree. Non-verified deduplication would use the largest available check code (192 bits) and not bother to verify data matches during the dedup pass, which is necessary on extremely large filesystems with a great deal of deduplicable data (as otherwise a large chunk of the media would have to be read to implement the dedup). This feature is intended only for those files where occassional corruption is ok, such as in a large data store of farmed web content. GENERAL DESIGN HAMMER2 generally implements a copy-on-write block design for the filesystem, which is very different from HAMMER1's B-Tree design. Because the design is copy-on-write it can be trivially snapshotted simply by referencing an existing block, and because the media structures logically match a standard filesystem directory/file hierarchy snapshots and other similar operations can be trivially performed on an entire subdirectory tree at any level in the filesystem. The copy-on-write nature of the filesystem implies that any modification whatsoever will have to eventually synchronize new disk blocks all the way to the super-root of the filesystem and the volume header itself. This forms the basis for crash recovery. All disk writes are to new blocks except for the volume header, thus allowing all writes to run concurrently except for the volume header update at the end. Clearly this method requires intermediate modifications to the chain to be cached so multiple modifications can be aggregated prior to being synchronized. One advantage, however, is that the cache can be flushed at any time WITHOUT having to allocate yet another new block when further modifications are made as long as the volume header has not yet been flushed. This means that buffer cache overhead is very well bounded and can handle filesystem operations of any complexity even on boxes with very small amounts of physical memory. I intend to implement a shortcut to make fsync()'s run fast, and that is to allow deep updates to blockrefs to shortcut to auxillary space in the volume header to satisfy the fsync requirement. The related blockref is then recorded when the filesystem is mounted after a crash and the update chain is reconstituted when a matching blockref is encountered again during normal operation of the filesystem. Basically this means that no real work needs to be done at mount-time even after a crash. Directories are hashed, and another major design element is that directory entries ARE INODES. They are one and the same. In addition to directory entries being inodes the data for very small files (512 bytes or smaller) can be directly embedded in the inode (overloaded onto the same space that the direct blockref array uses). This should result in very high performance. Inode numbers are not spatially referenced, which complicates NFS servers but doesn't complicate anything else. The inode number is stored in the inode itself, an absolutely necessary feature in order to support the hugely flexible snapshots that we want to have in HAMMER2. HARDLINKS Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of a spatial reference to the inode number. We do not want to have to have an index of inode numbers for any basic HAMMER2 feature if we can help it. Hardlinks are handled by placing the inode for a multiply-hardlinked file in the closest common parent directory. If "a/x" and "a/y" are hardlinked the inode for the hardlinked file will be placed in directory "a", e.g. "a/3239944", but it will be invisible and will be in an out-of-band namespace. The directory entries "a/x" and "a/y" will be given the same inode number but in fact just be placemarks that cause HAMMER2 to recurse upwards through the directory tree to find the invisible inode number. Because directories are hashed and a different namespace (hash key range) is used for hardlinked inodes, standard directory scans are able to trivially skip this invisible namespace and inode-specific lookups can restrict their lookup to within this space. The nature of snapshotting makes handling link-count 2->1 and 1->2 cases trivial. Basically the inode media structure is copied as needed to break-up or re-form the standard directory entry/inode. There are no backpointers in HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so it is an utterly trivial operation. FREEMAP NOTES In order to implement fast snapshots (and writable snapshots for that matter), HAMMER2 does NOT ref-count allocations. The freemap which is still under design just won't do that. All the freemap does is keep track of 100% free blocks. This not only trivializes all the snapshot features it also trivializes hardlink handling and solves the problem of keeping the freemap sychronized in the event of a crash. Now all we have to do after a crash is make sure blocks allocated before the freemap was flushed are properly marked as allocated in the allocmap. This is a trivial exercise using the same algorithm the mirror streaming code uses (which is very similar to HAMMER1)... an incremental meta-data scan that covers only the blocks that might have been allocated between the last allocation map sync and now. Thus the freemap does not have to be synchronized during a fsync(). The complexity is in figuring out what can be freed... that is, when one can mark blocks in the freemap as being free. HAMMER2 implements this as a background task which essentially must scan available meta-data to determine which blocks are not being referenced. Part of the ongoing design work is finding ways to reduce the scope of this meta-data scan so the entire filesystem's meta-data does not need to be scanned (though in tests with HAMMER1, even full meta-data scans have turned out to be fairly low cost). In other words, its an area that we can continue to improve on as the filesystem matures. Not only that, but we can completely change the freemap algorithms without creating incompatibilities (at worse simply having to require that a R+W mount do a full meta-data scan when upgrading or downgrading the freemap algorithm). CLUSTERING Clustering, as always, is the most difficult bit but we have some advantages with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media structures generally follow the kernel's filesystem hiearchy. Second, HAMMER2's writable snapshots make it possible to implement several forms of multi-master clustering. The mount device path you specify serves to bootstrap your entry into the cluster. This can be local media or directly specify a network cluster connection (or several). When a local media mount is used the volume header is scanned for local copies and the best volume header is selected from all available copies. Multiple devices may be specified for redundancy. The volume header on local media also contains cluster connection specifications keyed by super-root pfsid. Network connections are maintained to all targets. ALL ELEMENTS ARE TREATED ACCORDING TO TYPE NO MATTER WHICH ONE YOU MOUNT FROM. The actual networked cluster may be far larger than the elements you list in the hammer2_copy_data[] array, but your machine will only make direct connections as specified by the array. In the simplest case you simply network a few machines together as ring 0 masters and each client connects directly to all the masters (and/or are the masters themselves). Thus any quorum operation is straight-forward. These master nodes are labeled 'ring 0'. If you have too many clients to reasonably connect directly you set up sub-clusters as satellites. This is called 'ring 1'. Ring 1 may contain several sub-clusters. A client then connects to all the nodes in a particular sub-cluster (typically 3). The quorum protocol runs as per normal except that once the operation is resolved against the sub-cluster an aggregation must be resolved against the master nodes (ring 0). The sub-cluster does this for the client... all the client sees is the normal quorum operation against the sub-cluster. Since each node in the sub-cluster connects to all master nodes we get a multiplication. If we set a reasonable upper limit of, say, 256 connections at each master node then ring 1 may contain 85 sub-clusters x 3 nodes in each sub-cluster. In the most complex case when one wishes to support potentially millions of clients then further fan-out is required into ring 2, ring 3, and so forth. However, each sub-cluster in ring 2 must only connect to 1 sub-cluster in ring 1 (otherwise the cache state will become mightily confused). Using reasonable metrics this will allow ring 2 to contain 85 * 85 = 7225 sub-clusters. At this point you could have 1000 clients connect to each sub-cluster and support 7.2 million clients, but if that isn't enough going to another ring will support 61M clients, and so forth. Each ring imposes additional latencies for cache operations but the key to making this work efficiently is that the satellite clusters can negotiate coarse-grained cache coherency locks with the next lower ring and then fan-out finer-grained locks to the next higher ring. Since caching can occur anywhere (including on the connecting client), it is the cache coherency lock that ultimately dictates efficiency and allows a client (or satellite) to access large amoutns of data from local storage. Modifying operations, particularly commits, also have higher latencies when multiple rings are in use. In this situation it is possible to short-cut localized operations by having competing clients connect to to sub-clusters which are near each other topologically... having the competing clients connect to the same sub-cluster would be the most optimal. In addition, sub-clusters (typically in ring 1) can act in SOFT_MASTER mode which allows the sub-cluster to acknowledge a full commit within its own quorum only, and then resolve asynchronously to the masters in ring 0. The nodes in these intermediate rings can be pure proxies with only memory caches, use local media for persistent cache, or use local media to completely slave the filesystem. ADMIN - Media does not participate, administrative proxy only CLIENT - Media does not participate, client only CACHE - Media only acts as a persistent cache COPY - Media only acts as a local copy SLAVE - Media is a RO slave that can be mounted RW SOFT_SLAVE - This is a SLAVE which can become writable when the quorum is not available, but is not guaranteed to be able to be merged back when the quorum becomes available again. Elements which cannot be merged back remain localized and writable until manual or scripted intervention recombines them. SOFT_MASTER - Similar to the above but can form a sub-cluster and run the quorum protocol within the sub-cluster to serve machines that connect to the sub-cluster when the master cluster is not available. The SOFT_MASTER nodes in a sub-cluster must be fully interconnected with each other. MASTER - This is a MASTER node in the quorum protocol. The MASTER nodes in a cluster must be fully interconnected with each other. There are four major protocols: Quorum protocol This protocol is used between MASTER nodes to vote on operations and resolve deadlocks. This protocol is used between SOFT_MASTER nodes in a sub-cluster to vote on operations, resolve deadlocks, determine what the latest transaction id for an element is, and to perform commits. Cache sub-protocol This is the MESI sub-protocol which runs under the Quorum protocol. This protocol is used to maintain cache state for sub-trees to ensure that operations remain cache coherent. Depending on administrative rights this protocol may or may not allow a leaf node in the cluster to hold a cache element indefinitely. The administrative controller may preemptively downgrade a leaf with insufficient administrative rights without giving it a chance to synchronize any modified state back to the cluster. Proxy protocol The Quorum and Cache protocols only operate between MASTER and SOFT_MASTER nodes. All other node types must use the Proxy protocol to perform similar actions. This protocol differs in that proxy requests are typically sent to just one adjacent node and that node then maintains state and forwards the request or performs the required operation. When the link is lost to the proxy, the proxy automatically forwards a deletion of the state to the other nodes based on what it has recorded. If a leaf has insufficient administrative rights it may not be allowed to actually initiate a quorum operation and may only be allowed to maintain partial MESI cache state or perhaps none at all (since cache state can block other machines in the cluster). Instead a leaf with insufficient rights will have to make due with a preemptive loss of cache state and any allowed modifying operations will have to be forwarded to the proxy which continues forwarding it until a node with sufficient administrative rights is encountered. To reduce issues and give the cluster more breath, sub-clusters made up of SOFT_MASTERs can be formed in order to provide full cache coherent within a subset of machines and yet still tie them into a greater cluster that they normally would not have such access to. This effectively makes it possible to create a two or three-tier fan-out of groups of machines which are cache-coherent within the group, but perhaps not between groups, and use other means to synchronize between the groups. Media protocol This is basically the physical media protocol. There are lots of ways to implement multi-master environments using the above core features but the implementation is going to be fairly complex even with HAMMER2's feature set. Keep in mind that modifications propagate all the way to the super-root and volume header, so in any clustered arrangement the use of (modify_tid) and (mirror_tid) is critical in determining the synchronization state of portion(s) of the filesystem. Specifically, since any modification propagates to the root the (mirror_tid) in higher level directories is going to be in a constant state of flux. This state of flux DOES NOT invalidate the cache state for these higher levels of directories. Instead, the (modify_tid) is used on a node-by-node basis to determine cache state at any given level, and (mirror_tid) is used to determine whether any recursively underlying state is desynchronized. The inode structure also has two additional transaction ids used to optimize path lookups, stat, and directory lookup/scan operations.