From 5a9a531c724aee428513e9b7c4306b2b6f80d60d Mon Sep 17 00:00:00 2001 From: Matthew Dillon Date: Wed, 8 Feb 2012 19:08:47 -0800 Subject: [PATCH] hammer2 - More media format spec work, add DESIGN document * Add additional fields to support up to 256 configured copyid's in the volume header. Directory subtree copies are still limited to 8 but this will allow us to implement many different copy sets for different sub-trees. * Add a flags field to the blockref for synchronization and desynchronization flags. * Rename allocmap to freemap and add a freemap_version field which is separate from the global version field, which will allow a HAMMER2 volume to at least be mounted read-only on older versions of HAMMER2 if the only difference is in the freemap handling algorithm. * Add the DESIGN document. --- sys/vfs/hammer2/DESIGN | 423 +++++++++++++++++++++++++++++++++ sys/vfs/hammer2/hammer2_disk.h | 32 ++- 2 files changed, 443 insertions(+), 12 deletions(-) create mode 100644 sys/vfs/hammer2/DESIGN diff --git a/sys/vfs/hammer2/DESIGN b/sys/vfs/hammer2/DESIGN new file mode 100644 index 0000000000..f5bba629eb --- /dev/null +++ b/sys/vfs/hammer2/DESIGN @@ -0,0 +1,423 @@ + + HAMMER2 DESIGN DOCUMENT + +* These features have been speced in the media structures. + +* Implementation work has begun. + +* A working filesystem with some features implemented is expected by July 2012. + +* A fully functional filesystem with most (but not all) features is expected + by the end of 2012. + +* All elements of the filesystem have been designed except for the freemap + (which isn't needed for initial work). 8MB per 2GB of filesystem + storage has been reserved for the freemap. The design of the freemap + is expected to be completely speced by mid-year. + +* This is my only project this year. I'm not going to be doing any major + kernel bug hunting this year. + + Feature List + +* Multiple roots (allowing snapshots to be mounted). This is implemented + via the super-root concept. When mounting a HAMMER2 filesystem you specify + a device path and a directory name in the super-root. + +* HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory + in the tree can be configured as a PFS, causing all elements recursively + underneath that directory to become a part of that PFS. + +* Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots + show up in the super-root. It is possible to snapshot a subdirectory + and then later snapshot a parent of that subdirectory... really there are + no limitations here. + +* Directory sub-hierarchy based quotas and space and inode usage tracking. + Any directory sub-tree, whether at a mount point or not, tracks aggregate + inode use and data space use. This is stored in the directory inode all + the way up the chain. + +* Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is + block-oriented and copy-on-write each blockref tracks both direct + modifications to the referenced data via (modify_tid) and indirect + modifications to the referenced data or any sub-tree via (mirror_tid). + This makes it possible to do an incremental scan of meta-data that covers + only changes made since the mirror_tid recorded in a prior-run. + + This feature is also intended to be used to locate recently allocated + blocks and thus be able to fixup the freemap after a crash. + + HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in + that HAMMER2 does not keep track of 'deleted' records. Instead any + recursion by the mirroring code which finds that (modify_tid) has + been updated must also send the direct block table or indirect block + table state it winds up recursing through so the target can check + similar key ranges and locate elements to be deleted. This can be + avoided if the mirroring stream is mostly caught up in that very recent + deletions will be cached in memory and can be queried, allowing shorter + record deletions to be passed in the stream instead. + +* Will support multiple compression algorithms configured on subdirectory + tree basis and on a file basis. Up to 64K block compression will be used. + Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1, + 4:1, 8:1, etc) will work in this scheme because physical block allocations + in HAMMER2 are always power-of-2. + + Compression algorithm #0 will mean no compression and no zero-checking. + Compression algorithm #1 will mean zero-checking but no other compression. + Real compression will be supported starting with algorithm 2. + +* Zero detection on write (writing all-zeros), which requires the data + buffer to be scanned, will be supported as compression algorithm #1. + This allows the writing of 0's to create holes and will be the default + compression algorithm for HAMMER2. + +* Copies support for redundancy. The media blockref structure would + have become too bloated but I found a clean way to do copies using the + blockset structure (which is a set of 8 fully associative blockref's). + + The design is such that the filesystem should be able to function at + full speed even if disks are pulled or inserted, as long as at least one + good copy is present. A background task will be needed to resynchronize + missing copies (or remove excessive copies in the case where the copies + value is reduced on a live filesystem). + +* Intended to be clusterable, with a multi-master protocol under design + but not expected to be fully operational until mid-2013. The media + format for HAMMER1 was less condusive to logical clustering than I had + hoped so I was never able to get that aspect of my personal goals + working with HAMMER1. HAMMER2 effectively solves the issues that cropped + up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical + file/directory hierarchy, making cache coherency very difficult). + +* Hardlinks will be supported. All other standard features will be supported + too of course. Hardlinks in this sort of filesystem require significant + work. + +* The media blockref structure is now large enough to support up to a 192-bit + check value, which would typically be a cryptographic hash of some sort. + Multiple check value algorithms will be supported with the default being + a simple 32-bit iSCSI CRC. + +* Fully verified deduplication will be supported and automatic (and + necessary in many respects). + +* Non-verified de-duplication will be supported as a configurable option on + a file or subdirectory tree. Non-verified deduplication would use the + largest available check code (192 bits) and not bother to verify data + matches during the dedup pass, which is necessary on extremely large + filesystems with a great deal of deduplicable data (as otherwise a large + chunk of the media would have to be read to implement the dedup). + + This feature is intended only for those files where occassional corruption + is ok, such as in a large data store of farmed web content. + + GENERAL DESIGN + +HAMMER2 generally implements a copy-on-write block design for the filesystem, +which is very different from HAMMER1's B-Tree design. Because the design +is copy-on-write it can be trivially snapshotted simply by referencing an +existing block, and because the media structures logically match a standard +filesystem directory/file hierarchy snapshots and other similar operations +can be trivially performed on an entire subdirectory tree at any level in +the filesystem. + +The copy-on-write nature of the filesystem implies that any modification +whatsoever will have to eventually synchronize new disk blocks all the way +to the super-root of the filesystem and the volume header itself. This forms +the basis for crash recovery. All disk writes are to new blocks except for +the volume header, thus allowing all writes to run concurrently except for +the volume header update at the end. + +Clearly this method requires intermediate modifications to the chain to be +cached so multiple modifications can be aggregated prior to being +synchronized. One advantage, however, is that the cache can be flushed at +any time WITHOUT having to allocate yet another new block when further +modifications are made as long as the volume header has not yet been flushed. +This means that buffer cache overhead is very well bounded and can handle +filesystem operations of any complexity even on boxes with very small amounts +of physical memory. + +I intend to implement a shortcut to make fsync()'s run fast, and that is to +allow deep updates to blockrefs to shortcut to auxillary space in the +volume header to satisfy the fsync requirement. The related blockref is +then recorded when the filesystem is mounted after a crash and the update +chain is reconstituted when a matching blockref is encountered again during +normal operation of the filesystem. + +Basically this means that no real work needs to be done at mount-time +even after a crash. + +Directories are hashed, and another major design element is that directory +entries ARE INODES. They are one and the same. In addition to directory +entries being inodes the data for very small files (512 bytes or smaller) +can be directly embedded in the inode (overloaded onto the same space that +the direct blockref array uses). This should result in very high +performance. + +Inode numbers are not spatially referenced, which complicates NFS servers +but doesn't complicate anything else. The inode number is stored in the +inode itself, an absolutely necessary feature in order to support the +hugely flexible snapshots that we want to have in HAMMER2. + + HARDLINKS + +Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of +a spatial reference to the inode number. We do not want to have to have +an index of inode numbers for any basic HAMMER2 feature if we can help it. + +Hardlinks are handled by placing the inode for a multiply-hardlinked file +in the closest common parent directory. If "a/x" and "a/y" are hardlinked +the inode for the hardlinked file will be placed in directory "a", e.g. +"a/3239944", but it will be invisible and will be in an out-of-band namespace. +The directory entries "a/x" and "a/y" will be given the same inode number +but in fact just be placemarks that cause HAMMER2 to recurse upwards through +the directory tree to find the invisible inode number. + +Because directories are hashed and a different namespace (hash key range) +is used for hardlinked inodes, standard directory scans are able to trivially +skip this invisible namespace and inode-specific lookups can restrict their +lookup to within this space. + +The nature of snapshotting makes handling link-count 2->1 and 1->2 cases +trivial. Basically the inode media structure is copied as needed to break-up +or re-form the standard directory entry/inode. There are no backpointers in +HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so +it is an utterly trivial operation. + + FREEMAP NOTES + +In order to implement fast snapshots (and writable snapshots for that +matter), HAMMER2 does NOT ref-count allocations. The freemap which +is still under design just won't do that. All the freemap does is +keep track of 100% free blocks. + +This not only trivializes all the snapshot features it also trivializes +hardlink handling and solves the problem of keeping the freemap sychronized +in the event of a crash. Now all we have to do after a crash is make +sure blocks allocated before the freemap was flushed are properly +marked as allocated in the allocmap. This is a trivial exercise using the +same algorithm the mirror streaming code uses (which is very similar to +HAMMER1)... an incremental meta-data scan that covers only the blocks that +might have been allocated between the last allocation map sync and now. + +Thus the freemap does not have to be synchronized during a fsync(). + +The complexity is in figuring out what can be freed... that is, when one +can mark blocks in the freemap as being free. HAMMER2 implements this as +a background task which essentially must scan available meta-data to +determine which blocks are not being referenced. + +Part of the ongoing design work is finding ways to reduce the scope of this +meta-data scan so the entire filesystem's meta-data does not need to be +scanned (though in tests with HAMMER1, even full meta-data scans have +turned out to be fairly low cost). In other words, its an area that we +can continue to improve on as the filesystem matures. Not only that, but +we can completely change the freemap algorithms without creating +incompatibilities (at worse simply having to require that a R+W mount do +a full meta-data scan when upgrading or downgrading the freemap algorithm). + + CLUSTERING + +Clustering, as always, is the most difficult bit but we have some advantages +with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media +structures generally follow the kernel's filesystem hiearchy. Second, +HAMMER2's writable snapshots make it possible to implement several forms +of multi-master clustering. + +The general mechanics for most of the multi-master clustering implementations +will be as follows: + + (a) Use the copies mechanism to specify all elements of the cluster, + both local and remote (networked). + + (b) The core synchronization state operates just as it does for copies, + simply requiring a fully-flushed ack from the remote in order to + mark the blocks as having been fully synchronized. + + The mirror_tid may be used to locate these blocks, allowing the + synchronization state to be updated on the fly at a much later + time without requiring the state to be maintained in-memory. + (also for crash recovery resynchronization purposes). + + (c) Data/meta-data can be retrieved from those copies which are marked + as being synchronized, with priority given to the local storage + relative to any given physical machine. + + This means that e.g. even in a master-slave orientation the slave + may be able to satisfy a request from a program when the slave + happens to be the local storage. + + (d) Transaction id synchronization between all elements of the cluster, + typically through masking (assigning a cluster number using the low + 3 bits of the transaction id). + + (e) General access (synchronized or otherwise) may require cache + coherency mechanisms to run over the network. + + Implementing cache coherency is a major complexity issue. + + (f) General access (synchronized or otherwise) may require quorum + agreement, using the synchronization flags in the blockrefs + to determine whether agreement has been reached. + + Implementing quorum voting is a major complexity issue. + +There are lots of ways to implement multi-master environments using the +above core features but the implementation is going to be fairly complex +even with HAMMER2's feature set. + +Keep in mind that modifications propagate all the way to the super-root +and volume header, so in any clustered arrangement the use of (modify_tid) +and (mirror_tid) is critical in determining the synchronization state of +portion(s) of the filesystem. + +Specifically, since any modification propagates to the root the (mirror_tid) +in higher level directories is going to be in a constant state of flux. This +state of flux DOES NOT invalidate the cache state for these higher levels +of directories. Instead, the (modify_tid) is used on a node-by-node basis +to determine cache state at any given level, and (mirror_tid) is used to +determine whether any recursively underlying state is desynchronized. + +* Simple semi-synchronized multi-master environment. + + In this environment all nodes are considered masters and modifications + can be made on any of them, and then propagate to the others + asynchronously via HAMMER2 mirror streams. One difference here is + that kernel can activate these userland-managed streams automatically + when the copies configuration is used to specify the cluster. + + The only type of conflict which isn't readily resolvable by comparing + the (modify_tid) is when file data is updated. In this case user + intervention might be required but, theoretically, it should be + possible to automate most merges using a multi-way patch and, if not, + choosing one and creating backup copies if the others to allow the + user or sysop to resolve the conflict later. + +* Simple fully synchronized fail-over environment. + + In this environment there is one designated master and the remaining + nodes are slaves. If the master fails all remaining nodes agree on a + new master, possibly with the requirement that a quorum be achieved + (if you don't want to allow the cluster to split). + + If network splits are allowed the each sub-cluster operates in this + mode but recombining the clusters reverts to the first algorithm. + If not allowed whomever no longer has a quorum will be forced to stall. + + In this environment the current designated master is responsible for + managing locks for modifying operations. The designated master will + proactively tell the other nodes to mark the blocks related to the + modifying operation as no longer being synchronized while any local + data at the node that acquired the lock (master or slave) remains + marked as being synchronized. + + The node that succesfully gets the lock then issues the modifying + operation to both its local copy and to the master, marking the + master as being desynchronized until the master acknowledges receipt. + + In this environment any node can access data from local storage if + the designated master copy is marked synchronized AND its (modify_tid) + matches the slave copy's (modify_tid). + + However, if a slave disconnects from the master then reconnects the + slave will have lost the master's desynchronization stream and must + mark its root blockref for the master copy HAMMER2_BREF_DESYNCHLD as + well as clear the SYNC1/SYNC2 bits. Setting DESYNCCHLD forces on-demand + recursive reverification that the master and slave are (or are not) in + sync in order to reestablish on the slave the synchronization state of + the master. + + That might be a bit confusing but the whole point here is to allow + read accesses to the filesystem to be satisfied by any node in a + multi-master cluster, not just by the current designated master. + +* Fully cache coherent and synchronized multi-master environment. + + In this environment a quorum is required to perform any modifying + action. All nodes are masters (there is no 'designated' master) + and all nodes connect to all other nodes in a cross-bar. + + The quorum is specified by copies setup in the root volume configuration. + A quorum of nodes in the cluster must agree on the copies configuration. + If they do not the cluster cannot proceed to mount. Any other nodes + not in the quorum which are in the cluster which disagree with the + configuration will inherit the copies configuration from the quorum. + + Any modifying action will initiate a lock request locally to all nodes + in the cluster. The modifying action is allowed to proceed the instant + a quorum of nodes respond in the affirmative (even if some have not + yet responded or are down). The modifying action is considered complete + once the two-phase commit protocol succeeds. The modifying action + typically creates and commits a temporary snapshot on at least a quorum + of masters as phase-1 and then ties the snapshot back into the main + mount as phase-2. + + These locks are cache-coherency locks and may be passively maintained + in order to aggregate multiple operations under the same lock and thus + under the same transaction from the point of view of the rest of the + quorum. + + A lock request which interferes with a passively maintained lock will + force the two-phase commit protocol to complete and then transfer + ownership to the requesting entity, thus avoiding having to deal with + deadlock protocols at this point in the state machine. + + Since any node can initiate concurrent lock requests to many other nodes + it is possible to deadlock. When two nodes initiate conflicting lock + requests to the cluster the one achieving the quorum basically wins and + the other is forced to retry (go back one paragraph). In this situation + no deadlock will occur. + + If three are more nodes initiate conflicting lock requests to the + cluster a deadlock can occur whereby none of the nodes achieve a quorum. + In this case every node will know which of the other nodes was granted + the lock(s). Deadlock resolution then proceeds simultaniously on the + three nodes (since they have the same information), whereby the lock + holders on the losing end of the algorithm transfer their locks to one + of the other nodes. The lock state and knowledge of the lock state is + updated in real time on all nodes until a quorum is achieved. + +* Fully cache coherent and synchronized multi-master environment with + passive read locking. + + This is a more complex form of clustering than the previous form. + Take the previous form and add the ability to passively hold SHARED + locks in addition to the EXCLUSIVE locks the previous form is able + to hold. + + The advantage of being able to passively hold a shared lock on a sub-tree + (locks can be held on single nodes or entire sub-trees) is that it is + then possible for all nodes to validate a node (modify_tid) or entire + sub-tree (mirror_tid) with a very short network transaction and then + satisfy a large number of requests from local storage. + +* Fully cache coherent and synchronized multi-master environment with + passive read locking and slave-only nodes. + + This is the MOST complex form of clustering we intend to support. + In a multi-master environment requiring a quorum of masters to operate + we implement all of the above plus ALSO allow additional nodes to be + added to the cluster as slave-only nodes. + + The difference between a slave-only node and setting up a manual + mirror-stream from the cluster to a read-only snapshot on another + HAMMER2 filesystem is that the slave-only node will be fully + cache coherent with either the cluster proper (if connected to a quorum + of masters), or to one or more other nodes in the cluster (if not + connected to a quorum of masters), EVEN if the slave itself is not + completely caught up. + + So if the slave-only cluster node is connected to the rest of the cluster + over a slow connection you basically get a combination of local disk + speeds for any data that is locally in sync and network-limited speeds + for any data that is not locally in sync. + + slave-only cluster nodes run a standard mirror-stream in the background + to pull in the data as quickly as possible. + + This is in constrast to a manual mirror-stream to a read-only + snapshot (basically a simple slave), which has no ability to bypass + the local storage to handle out-of-date requests (in fact has no ability + to detect that the local storage is out-of-date anyway). diff --git a/sys/vfs/hammer2/hammer2_disk.h b/sys/vfs/hammer2/hammer2_disk.h index c9ee06f29c..a737d00ab4 100644 --- a/sys/vfs/hammer2/hammer2_disk.h +++ b/sys/vfs/hammer2/hammer2_disk.h @@ -216,8 +216,8 @@ struct hammer2_blockref { /* MUST BE EXACTLY 64 BYTES */ uint8_t methods; /* check method & compression method */ uint8_t copyid; /* specify which copy this is */ uint8_t keybits; /* key mask bits for recursion */ - uint8_t vradix; - uint8_t reserved05; + uint8_t vradix; /* virtual data/meta-data size */ + uint8_t flags; /* blockref flags */ uint8_t reserved06; uint8_t reserved07; hammer2_key_t key; /* key specification */ @@ -242,6 +242,11 @@ struct hammer2_blockref { /* MUST BE EXACTLY 64 BYTES */ typedef struct hammer2_blockref hammer2_blockref_t; +#define HAMMER2_BREF_SYNC1 0x01 /* modification synchronized */ +#define HAMMER2_BREF_SYNC2 0x02 /* modification committed */ +#define HAMMER2_BREF_DESYNCCHLD 0x04 /* children must be desynchronized */ +#define HAMMER2_BREF_DELETED 0x80 /* indicates a deletion */ + #define HAMMER2_BLOCKREF_BYTES 64 /* blockref struct in bytes */ #define HAMMER2_ENC_COMPMETHOD(n) (n) #define HAMMER2_ENC_CHECKMETHOD(n) ((n) << 4) @@ -425,7 +430,7 @@ struct hammer2_inode_data { * 11 bits off the 64-bit storage space, with leaf entries representing * 64KB blocks. So: (12, 12, 12, 12, 16) = 64 bit storage space. * - * Each 64K allocmap block breaks the 4096 entries into a 64x64 tree with + * Each 64K freemap block breaks the 4096 entries into a 64x64 tree with * big_hint1 representing the top level every 64th entry and big_hint2 * representing the lower level in each entry. These fields specify the * largest contiguous radix (1-63) available for allocation in the related @@ -568,7 +573,7 @@ typedef struct hammer2_copy_data hammer2_copy_data_t; struct hammer2_volume_data { /* - * First 512-byte section + * 512-byte sector #0 */ uint64_t magic; /* 0000 Signature */ hammer2_off_t boot_beg; /* 0008 Boot area (future) */ @@ -580,7 +585,7 @@ struct hammer2_volume_data { uint32_t version; /* 0030 */ uint32_t flags; /* 0034 */ uint8_t copyid; /* 0038 copyid of phys vol */ - uint8_t reserved0039; /* 0039 */ + uint8_t freemap_version; /* 0039 freemap algorithm */ uint8_t reserved003A; /* 003A */ uint8_t reserved003B; /* 003B */ uint32_t reserved003C; /* 003C */ @@ -626,25 +631,28 @@ struct hammer2_volume_data { hammer2_crc32_t icrc_sects[8]; /* 01E0-01FF */ /* - * Second 512-byte section. + * 512-byte sector #1 * * The entire sector is used by a blockset. */ hammer2_blockset_t sroot_blockset; /* 0200 Superroot directory */ /* - * Third 512-byte section . + * 512-byte sector #2-33 + * + * Up to 256 copyinfo specifications can be configured. Note that + * any given subdirectory tree can only use 8 of the 256. Having + * up to 256 configurable in the volume header allows * - * This entire section contains copyinfo specifications, typically - * device serno specifications such as 'serno/.s1d'. Each - * element eats 64 bytes x 8 elements is 512 bytes. + * A specification takes 64 bytes. Each specification typically + * configures a device path such as 'serno/.s1d'. */ - struct hammer2_copy_data copyinfo[8]; + struct hammer2_copy_data copyinfo[256]; /* 0400-43FF copyinfo config */ /* * Remaining sections are reserved for future use. */ - char reserved0400[0xFBFC]; /* 0400-FFFC reserved */ + char reserved0400[0xBBFC]; /* 4400-FFFB reserved */ /* * icrc on entire volume header -- 2.41.0