HAMMER2 Freemap Design Notes

				Overview

    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
    contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
    blocks in this header are reserved for various purposes.  For example,
    block #0 is reserved for a volume header in the first four zones.  Most
    of the remaining 64KB blocks in this header are reserved for use by the
    freemap.

    The freemap only uses blocks from these reserved areas.  In order to
    ensure that any of the four volume headers can be used by the mount code
    (in case some are found to be corrupted), each freemap block in the
    logical freemap topology will iterate through up to 8 copies whos
    block numbers are taken the reserved area.

    - Four copies, one for each of the four volume headers which H2 sequences
      through on each flush.  This ensures that a mount from any of the four
      volume headers is handed a consistent freemap topology.

    - One copy to ensure that recovery operations during mount do not modify
      the state of the freemap topology pointed to by older volume headers
      which are still valid.  Note that the freemap for volume headers
      indexed after the mount point being recovered may lose freemap
      consistency, so if you choose an older mount point for a RW mount,
      you have to stick with it.

    - One copy for live operations.  This allows HAMMER2 to retire the
      related buffer asynchronously in the background (or for the OS to
      retire the buffer cache buffer on its own) prior to the formal
      flush.  The later formal flush then has less work to do.

    - The two remaining copies add robustness to the specification.  For
      example, with appropriate feature code added the filesystem can
      tolerate a limited number of bad blocks in the reserved area.

    For the moment we use a simple calculation for the freemap block.  In
    a later version I would like to mix the blocks up a bit so the blocks
    in each set of 8 are not situated near each other.

			    RW Mount Restrictions

    If an older volume header is explicitly selected by the mount code, any
    newer (presumably corrupt since the mount code didn't select it) volume
    headers will lose freemap consistency as the freemap code rotates into
    freemap blocks that might have been used by the topology pointed to by
    the newer (but not selected) volume headers.  For a RW mount, this means
    that if an older volume header is selected, the newer ones that were
    not selected WILL be formally invalidated by the mount code and cannot
    be used in a remount attempt.

    During normal operation, each filesystem flush rotates to a new volume
    header.  A filesystem may have up to four volume headers spread at 2GB
    intervals.  Filesystems smaller than ~9GB or so will have fewer volume
    headers to rotate through.

				Freemap Topology

    The freemap topology contains 4 levels of meta-data (blockref arrays),
    one of which is embedded in the volume header (so only three real
    meta-data levels), plus one level of leaf-data.  Unlike normal files,
    which use a variable-radix, the freemap topology uses a fixed radix to
    simplify the algorithm and to ensure freemap locality to the blocks
    under management.

    Freemap blocks are allocated from the reserved area in each 2GB zone.
    The leafs represent data in the zone.  Higher levels in the freemap
    topology will cover more area but the physical freemap meta-data blocks
    always occur prior to the area being covered.  Thus a HAMMER2 filesystem
    of almost any size can be formatted and the related freemap blocks
    will always exist.

    Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
	      by a hammer2_bmap_data[1024] array.  Each entry represents
	      2MB worth of media storage x 1024 entries to represent 2GB.
	      Each entry contains a 128x2 bit bitmap representing 16KB
	      of storage in 2 bits (128 x 16KB = 2MB).

    Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
    Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
    Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
	      (this conveniently eats one 512-byte 'sector' of the 64KB
	      volume header).

    Each level is assign reserved blocks in the 4MB header per 2GB zone.
    Since we use block 0 for the volume header, the first freemap reserved
    block in the zone begins at block 1.

    Freemap copy #0:
	Level 1 uses block 1 (this is the leaf block)
	Level 2 uses block 2
	Level 3 uses block 3
	Level 4 uses block 4

    Freemap copy #1:
	Level 1 uses block 5 (this is the leaf block)
	Level 2 uses block 6
	Level 3 uses block 7
	Level 4 uses block 8

    ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.

				    Flushing

    The freemap does not have to be flushed by fsync/sync, but should probably
    be flushed at least once a minute by the normal filesystem sync.  The
    reason it does not have to be flushed with fsync is that freemap recovery
    is executed on-mount and will use the last fully flushed freemap TID
    stored in the volume header to do an incremental meta-data scan of the
    H2 filesystem between that TID and the last flushed TID.  All blocks not
    found to have been marked allocated will be marked allocated.  Simple as
    that.  Since the scan is incremental, this typically costs very little
    time.

				Freemap Granularity

    The freemap granularity is 16KB (radix of 14) but the minimum
    allocation radix is 1KB (radix of 10) (and can be in multiples of
    1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
    data, so tiny files eat exactly 1KB of media storage inclusive of the
    inode.

    The freemap keeps track of partial allocations in-memory but not
    on-media, so even a normal umount will cause partially allocated
    blocks to appear fully allocated until some later date when the
    bulk scan code defragments it.

				 Block Selection

    Block selection is localized to be near the inode's (or nearby data)
    blockref.  The algorithmic complexity of determining locality is not
    defined here atm.

			     Freemap Leaf Substructure

    * linear - Linear sub-granular allocation offset.  Allows ~1KB granular
	       linear allocations.

    * class  - Allocation clustering class ((type << 8) | radix).

    * avail  - Available space in bytes, currently only used by layer 1 leaf.
	       Used as an allocation clustering aid.

    * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
	       at 2 bits per chunk.  The filesystem allocation granularity
	       can be smaller (currently ~1KB minimum), and the live
	       filesystem caches iterations when allocating multiple chunks.
	       However, on remount any partial allocations out of a 64KB
	       allocation block MAY cause the entire 64KB to be considered
	       allocated.  Fragmented space can potentially be reclaimed
	       and/or relocated by the bulk block free scan.

	       The 2-bit bitmap fields are assigned as follows:

	       00	FREE
	       01	POSSIBLY FREE (type 1)
	       10	POSSIBLY FREE (type 2)
	       11	ALLOCATED

			  Freemap Metadata Substructure
			     (Levels 2, 3, 4, and 5)

    Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
    some of the check area (a 24-byte area) for freemap-specific meta-data.
    We reserve a few fields to store information which allows the block
    allocator to do its work more efficiently.

    * bigmask - A mask of radixes available for allocation under this
		blockref.  Typically initialized to -1.

    * avail   - Total available space in bytes.

    The freemap allocator uses a cylinder-group-like abstraction using
    the localized allocation concept first implemented by UFS.  In HAMMER2
    there is no such thing as a real cylinder group, nor are there specific
    reserved areas for inodes vs data, but we do the next best thing by
    roughly typing leafs (each leaf representing ~2MB) to hopefully allow
    the drive to employ its zone-cache to make both stat-only and tar-style
    bulk accesses efficient (in addition to normal file accesses).

    Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
    supplying 10 bits of address space each.  Level 5 is a blockmap[8]
    stored in the volume header supplying 3 bits of address space.
    (level 0 supplies 10 + 21 bits of address space).

    The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
    effectively fixed at multiples of ~2MB or so.

			        Initial Conditions

    newfs_hammer2 does not need to format the freemap.  Instead, newfs_hammer2
    simply leaves the associated top-level indirect blocks empty and uses
    the (voldata->allocator_beg) field to allocate space linearly, then
    leaves it to the live filesystem to initialize the freemap as more space
    gets allocated.

    The freemap does NOT use a fixed 5-level radix tree.  It uses the same
    blockmap algorithm used for file blocks but restricts any recursion to
    specific radix values.  This means that small filesystems will have much
    smaller freemap depths.  2 layers (and not counting the volume header as
    a layer) gets us 16GB, 3 layers gets us 16TB.

			How blocks are allocated and freed

    The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
    contains a linear allocation offset that can keep track of sub-16KB
    allocations with certain restrictions.  More random sub-16KB allocations
    are tracked in-memory, but will be lost (assumed to be a full 16KB) if
    a crash occurs.  Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10,
    or 11.

    NOTE!  All operations on the freemap occur on the current live version
	   of the freemap, including bulkfree operations.

    Blocks are allocated by transitioning the 2-bit pattern in the leaf
    to 11.  That is, (00, 01, 10) -> (11).

    The primary mechanism used to free a block is via the asynchronous
    bulkfree scan.  This scans all filesystem meta-data in two major passes
    (and potentially multiple sub-passes).

    Pass#1 - The first pass figures which blocks might be freeable.  The
	     most recently flushed meta-data topology (including all four
	     volume headers and all snapshots) is scanned and an in-memory
	     copy of the FreeMap is built from scratch.  Multiple sub-scans
	     might be required to break the larger scan up into more easily
	     digested pieces based on the amount of memory available to hold
	     the temporary freemap.

	     Any allocated blocks in the live freemap are then transitioned
	     from (11) to either (10) or (01) if after the scan they are
	     found to not be allocated.

	     The blocks are still assumed to be allocated at this time and
	     any new allocations will transition them back to (11).

    Pass#2 - The second pass is required to deal with races against the
	     live filesystem while the freemap scan was running. It also
	     allows the freemap scans to run asynchronously from any flush,
	     improving concurrency.  However, at least one synchronous flush
	     is required between Pass#1 and Pass#2.

	     The second pass is a duplicate of the first pass.  The meta-data
	     topology is scanned and a freemap is built in-memory and then
	     compared against the live freemap.  Instead transitioning from
	     (11)->(10)/(01) this pass transitions from (10)/(01) to (00).

	     If a block that it thinks is free is (11), no transition occurs
	     because this could be due to a race against the live filesystem.

	     This pass will incidentally transition (10)/(01) back to (11)
	     if the block was found not to be allocated, but it is perfectly
	     acceptable for the block to remain in a (10)/(01) state after
	     completion.

    NOTE! The meta-data scanning passes must also explicitly scan blocks
	  associated with any open files, since these might represent
	  open-but-deleted files.  These blocks must not be accidently freed
	  while the system is still using the file.  Again, since this is
	  done in two passes it does not have to be synchronized against
	  frontend operations.  So in total:

	  * Topology under all four volume headers.  This includes all
	    PFSs and snapshots.

	  * Topology under all open hammer2 files.

    The Bulk-free operation is expensive but uses a bounded amount of ram.
    The ADVANTAGE of this mechanism is that deletions in the live filesystem
    do not have to clean up the freemap and thus do not have to recurse
    the topology during the deletion.  In fact, a 'rm -rf' equivalent of a
    directory topology can be handled simply by blowing away the top-level
    directory inode.  This is instantanious and thus can be dangerous but
    you always have your snapshots to fall-back on.

    The DISADVANTAGE is that all meta-data must be scanned.  Twice.  This
    can be mitigated by using swapcache(8) to cache the meta-data on a SSD.
    This is also mitigated by the fact that you can do the bulkfree scan less
    often on very large filesystems which presumably have a lot of freespace
    (so the interval is not as big an issue).  In a sense the operation does
    scale in that it takes longer on larger filesystems but also can be run
    less often.

    The biggest issue is that *NO* space can be freed up by the live
    filesystem without the bulkfree process unless we optimize the case
    where data is created and deleted from within a single snapshot.
    This is made more difficult by the fact that each flush represents
    a fine-grained snapshot (up to four, representing the four volume
    headers the flush iterates through).

		      Snapshots and Replicated Topologies

    The bulkfree code maintains information in-memory to the best of its
    ability for a multitude of reasons, including attempting to detect
    snapshot recursions down block chains which have already been scanned
    via some other snapshot.  Without this, a large number of snapshots
    can cause a huge multiplication of disk I/O reads (but not writes) during
    the topology scan.

			Use of Generic indirect-block API

    I decided to use the same indirect-block allocation model for the
    freemap that normal files use, with a few special cases added to force
    specific radix values and to 'allocate' the freemap-related blocks
    and indirect blocks via a reserved-block calculation and (obviously)
    not via a recursive call to the allocator.

    The Freemap is defined above as a fixed 5-level scheme (level 1-5),
    but in actual operation the radix tree can be shortcut just as it
    is with normal files.  However, unlike normal files, shorcuts will
    be forced to use specific radix values in order to guarantee that
    reserved block numbers can be trivially calculated.  As the freemap
    becomes more fleshed out the tree on-media will look more and more like
    the actual specification.

    One advantage of doing things this way is that smaller filesystems
    won't actually use a 5-level scheme.  A 16GB filesystem can use 8
    blockrefs in the volume header which point directly to layer 1 leaf
    blocks.  A 16TB filesystem can be managed with only three levels
    (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
    the volume header).  And so forth.

    At the moment we have no plans to return any of the unused 4MB zone
    header space (per 2GB of storage) back to the filesystem for general use.
    There are lots of things we may want to use the reserved areas for in
    the future.

				Emergency Deletions

    All filesystem modifications including deletions must allocate blocks
    in order to update the main topology all the way to the root.  H2 will
    reserve roughly 5% of the available blocks in the filesystem for
    deletions in order to allow a system operator to recover from a
    filesystem full condition.

    However, due to the snapshot capability as well as the possibility of
    fragmentation, it is possible for the administrator to not delete enough
    to actually be able to free up blocks.  Once the reserve is used up
    the filesystem can become unwritable.

    When this situation occurs the only way to recover is to update blocks
    in-place.  Updating blocks in-place will destroy the data on any
    related snapshots or otherwise corrupt the snapshots.  Emergency recovery
    thus recommends that all related snapshots be destroyed.  You can choose
    not to do this in which case your snapshots might wind up containing
    broken links and generate CRC failure messages.

    For the moment the spec for dealing with these situations remains
    incomplete.