hammer2 - freemap part 1 - initial block allocator and media support
authorMatthew Dillon <dillon@apollo.backplane.com>
Fri, 17 May 2013 21:48:59 +0000 (14:48 -0700)
committerMatthew Dillon <dillon@apollo.backplane.com>
Fri, 17 May 2013 21:48:59 +0000 (14:48 -0700)
* Freemap document (FREEMAP in this directory)

* temporarily turn off clustering until the freemap gets that capability
  (mixed buffer sizes can be adjacent atm).

* Remove the freemap_blockref[1] from the volume header and replace it
  with a blockset array (8 blockrefs).

* Implement dynamic creation of freemap nodes and leafs on an as-needed
  basis using the normal indirect block creation code.  Most of the standard
  file handling code is reused for the freemap support.

* Major cleanup of hammer2_chain.c, the duplication code, the indirect
  block creation and handling, and the chain->flag handling.

13 files changed:
sys/vfs/hammer2/DESIGN
sys/vfs/hammer2/FREEMAP [new file with mode: 0644]
sys/vfs/hammer2/TODO
sys/vfs/hammer2/hammer2.h
sys/vfs/hammer2/hammer2_chain.c
sys/vfs/hammer2/hammer2_disk.h
sys/vfs/hammer2/hammer2_flush.c
sys/vfs/hammer2/hammer2_freemap.c
sys/vfs/hammer2/hammer2_inode.c
sys/vfs/hammer2/hammer2_ioctl.c
sys/vfs/hammer2/hammer2_subr.c
sys/vfs/hammer2/hammer2_vfsops.c
sys/vfs/hammer2/hammer2_vnops.c

index 6cb1cc4..47ff7bd 100644 (file)
@@ -3,44 +3,60 @@
 
                                Matthew Dillon
                                 08-Feb-2012
+                                14-May-2013
                             dillon@backplane.com
 
 * These features have been speced in the media structures.
 
 * Implementation work has begun.
 
-* A working filesystem with some features implemented is expected by July 2012.
+* Filesytem core is now operational, cluster messaging links are primitive
+  but work (and are fully encrypted).  Work continues on the block allocator
+  and work has not yet begun on copies, block-encryption, block-compression,
+  mirroring, or quorum/cluster ops.
 
-* A fully functional filesystem with most (but not all) features is expected
-  by the end of 2012.
+* Obviously a fully functional filesystem is not yet ready but once the
+  freemap and the backend garbage collector is implemented the HAMMER2
+  filesystem will be usable.  Missing many features, but usable.
 
-* All elements of the filesystem have been designed except for the freemap
-  (which isn't needed for initial work).  8MB per 2GB of filesystem
-  storage has been reserved for the freemap.  The design of the freemap
-  is expected to be completely speced by mid-year.
+* Design of all media elements is complete.
 
-* This is my only project this year.  I'm not going to be doing any major
-  kernel bug hunting this year.
-
-                               Feature List
+                                   Feature List
 
 * Multiple roots (allowing snapshots to be mounted).  This is implemented
   via the super-root concept.  When mounting a HAMMER2 filesystem you specify
-  a device path and a directory name in the super-root.
-
-* HAMMER1 had PFS's.  HAMMER2 does not.  Instead, in HAMMER2 any directory
-  in the tree can be configured as a PFS, causing all elements recursively
-  underneath that directory to become a part of that PFS.
-
-* Writable snapshots.  Any subdirectory tree can be snapshotted.  Snapshots
-  show up in the super-root.  It is possible to snapshot a subdirectory
-  and then later snapshot a parent of that subdirectory... really there are
-  no limitations here.
-
-* Directory sub-hierarchy based quotas and space and inode usage tracking.
-  Any directory sub-tree, whether at a mount point or not, tracks aggregate
-  inode use and data space use.  This is stored in the directory inode all
-  the way up the chain.
+  a device path and a directory name in the super-root.  (HAMMER1 had only
+  one root).
+
+* Roots are really no different from snapshots (HAMMER1 distinguished between
+  its root mount and its PFS's.  HAMMER2 does not).
+
+* Snapshots are writable (in HAMMER1 snapshots were read-only).
+
+* Snapshots are explicit but trivial to create.  In HAMMER1 snapshots were
+  both explicit and fine-grained/automatic.  HAMMER2 does not implement
+  automatic fine-grained snapshots.  H2 snapshots are cheap enough that you
+  can create fine-grained snapshots if you desire.
+
+* HAMMER2 flushes formalized a synchronization point for the flush, wait
+  for all running modifying operations to complete to memory (not to disk)
+  while temporarily stalling new modifying operation initiations.  The
+  flush is then able to proceed concurrent with unstalling and allowing
+  new modifying operations to run.
+
+* The flush is fully meta-data-synchronized in HAMMER2.  In HAMMER1 it was
+  possible for flushes to bisect inode creation vs directory entry creation
+  and to create problems with directory renames.  HAMMER2 has no issues with
+  any of these.  Dealing with data synchronization is another matter but
+  it should be possible to address explcit write()'s properly.  mmap()'d
+  R+W data... not so easy.
+
+* Directory sub-hierarchy-based quotas for space and inode usage tracking.
+  Any directory can be used.
+
+* Low memory footprint.  Except for the volume header, the buffer cache
+  is completely asynchronous and dirty buffers can be retired by the OS
+  directly to backing store with no further interactions with the filesystem.
 
 * Incremental queueless mirroring / mirroring-streams.  Because HAMMER2 is
   block-oriented and copy-on-write each blockref tracks both direct
diff --git a/sys/vfs/hammer2/FREEMAP b/sys/vfs/hammer2/FREEMAP
new file mode 100644 (file)
index 0000000..a558815
--- /dev/null
@@ -0,0 +1,141 @@
+
+                     HAMMER2 Freemap Design Notes
+
+                               Overview
+
+    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
+    contains a 4 MByte header (64 x 64K blocks).
+
+    Block #0 is the volume header.  Block #0 is the next three zones,
+    assuming the media is big enough, contain backup volume headers.
+    Flushes cycle through these four headers and the mount code iterates
+    all four to locate the best candidate to mount with.  The reason for
+    this is to ensure that mounting works even if a crash catches a
+    volume header in a partial update.
+
+    Remaining blocks are used for various purposes, primarily by the
+    freemap.
+
+    * It is very important to remember that the Freemap only uses
+      blocks from these reserved areas.  Freemap blocks are NOT dynamically
+      allocated.
+
+    * On-mount, the synchronization TID for the main H2 filesystem is
+      compared against the synchronization TID of the freemap and the
+      H2 topology is incrementally iterated using mirror_tid to update
+      the freemap with any missing information.  This way the freemap flush
+      does not need to be synchronized with the normal H2 flush.
+
+    * The freemap is flushed in a manner similar to the normal H2 filesystem,
+      but as mentioned above it does not have to be synchronized.  One freemap
+      flush could cover several H2 flushes.  A freemap flush is not necessary
+      for e.g. a fsync() or sync() to complete successfully.
+
+    * The minimum allocation radix is 10 (1 Kilobyte).  In Hammer2, the
+      inode is 1KB and contains up to 512 bytes of direct data, so in terms
+      of file storage efficiency H2 is can pack small files and their inodes
+      into a single 1KB block.
+
+      The freemap thus must handle a 1KB granularity, which comes to around
+      256KB per 2GB zone at 1-bit-per-block.  Since we have ~4MB available,
+      there is plenty of space to implement redundancy.
+
+                           Freemap Topology
+
+    The freemap topology contains 5 levels of meta-data (blockref arrays).
+
+    Level 0 - (radix 10+19+2) 256KB bitmap representing 2GB
+
+    Level 1 - (radix 10) 64KB blockmap representing 2GB.  This level
+             shadows level 0 exactly.   There are 1024 blockref entries
+             each representing ~2MB worth of media storage.
+
+    Level 2 - (radix 10) 64KB blockmap representing 2TB.
+    Level 3 - (radix 10) 64KB blockmap representing 2PB.
+    Level 4 - (radix 10) 64KB blockmap representing 2EB.
+    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
+             (this conveniently eats one 512-byte 'sector' of the 64KB
+             volume header).
+
+    Each level is assign reserved blocks in the 4MB header per 2GB zone.
+    Level 0 requires four blocks (#1-#4), level 1, 2, 3, and 4 each require
+    one block (#5, #6, #7, #8), while level 5 is embedded in the volume
+    header.
+
+    In addition, the reserved blocks 1-8 are not overwritten on each flush.
+    Instead, a different set of reserved blocks is used.  Four sets, A-D,
+    are specified.  A=1-8, B=9-16, C=17-24, D=25-32.  Blocks 33-63 are unused
+    at present and reserved for future use.
+
+                           Blockref Substructure
+
+    The blockref substructure at each level steals some space from the
+    check code area (a 24-byte area).  We only need 4 bytes for the check
+    code icrc.  We use some of the remaining space to store information
+    that allows the block allocator to do its work more efficiently.
+
+    * biggest - Biggest available linear allocation radix (powers of 2).
+               May be initialized larger but the 2GB zone has a 4MB chunk
+               taken out of it for a header so the maximum linear allocation
+               is going to be 1GB (and not an efficient 1GB at that), which
+               would be radix 30.
+
+    * avail   - Total available space in bytes.
+
+    The freemap allocator uses a cylinder-group-like abstraction using
+    the localized allocation concept first implemented by UFS.  In HAMMER2
+    there is no such thing as a real cylinder group, but we do the next
+    best thing by implementing our layer 1 blockmap representing 2GB.
+
+    The layer 1 blockmap is an array of 1024 blockrefs, so each blockref
+    covers 2MB worth of media storage.  HAMMER2's 'cylinder group' concept
+    thus has a minimum granularity of 2MB.  A typical setting might be e.g.
+    10MB.
+
+    By localizing allocations to cylinder groups based on various bits of
+    information, HAMMER2 tries to allocate space on the disk and still leave
+    some left over for localized expansion and to reduce fragmentation at
+    the same time.  Not an easy task, especially considering the copy-on-write
+    nature of the filesystem.  This part of the algorithm likely needs a lot
+    of work but I hope I've laid down a media format that will not have to be
+    changed down the line to accomodate better allocation strategies.
+
+                           Initial Conditions
+
+    The freemap is a multi-indirect block structure but there is no real
+    reason to pre-format it in newfs_hammer2.  Instead, newfs_hammer2 simply
+    leaves the associated blockset empty and uses the (voldata->allocator_beg)
+    field to allocate space linearly, then leaves it to the live filesystem
+    to initialize the freemap as more space gets allocated.
+
+    To keep the abstraction simple, this means in the bitmap 0=unallocated,
+    1=allocated.  The allocation blockmap is initialized for the zone's 4MB
+    reserve area as new zones are opened up for allocation.  Initialization
+    of the freemap for the root zone at offset 0 is further adjusted based
+    on (voldata->allocator_beg).  This field is not used once the freemap
+    for the root zone has been setup by the live filesystem.
+
+                       Use of Generic indirect-block API
+
+    I decided to use the same indirect-block allocation model for the
+    freemap that normal files use, with a few special cases added to force
+    specific radix values and to 'allocate' the freemap-related blocks
+    and indirect blocks via a reserved-block calculation and (obviously)
+    not via a recursive call to the allocator.
+
+    The Freemap is defined above as a fixed 6-level scheme (level 0-5),
+    but in actual operation the radix tree can be shortcut just as it
+    is with normal files.  However, shorcuts are forced into the radix
+    values of this specification and reserved blocks are calculated based on
+    the radix level and offset, so as the freemap becomes more fleshed
+    out the tree looks more and more like the specification.
+
+    One advantage of doing things this way is that smaller filesystems
+    won't actually use a 6-level scheme.  A 16GB filesystem can use 8
+    blockrefs at layer 5 (in the volume header) that point directly to
+    layer 1.  A 16TB filesystem can use 8 blockrefs at layer5 that point
+    to layer 2.  And so forth.
+
+    At the moment we have no plans to return any of the unused 4MB zone
+    header space back to the filesystem for general use.  There are lots
+    of things we may want to use the reserved areas for in the future.
index 5d4e3eb..17331eb 100644 (file)
@@ -1,10 +1,20 @@
 
+* freemap / clustering.  Set block size on 2MB boundary so the cluster code
+  can be used for reading.
+
+* need API layer for shared buffers (unfortunately).
+
 * add magic number to inode header, add parent inode number too, to
   help with brute-force recovery.
 
 * modifications past our flush point do not adjust vchain.
   need to make vchain dynamic so we can (see flush_scan2).??
 
+* MINIOSIZE/RADIX set to 1KB for now to avoid buffer cache deadlocks
+  on multiple locked inodes.  Fix so we can use LBUFSIZE!  Or,
+  alternatively, allow a smaller I/O size based on the sector size
+  (not optimal though).
+
 * When making a snapshot, do not allow the snapshot to be mounted until
   the in-memory chain has been freed in order to break the shared core.
 
index 52e6815..0a3c872 100644 (file)
@@ -166,6 +166,19 @@ typedef struct hammer2_chain hammer2_chain_t;
 int hammer2_chain_cmp(hammer2_chain_t *chain1, hammer2_chain_t *chain2);
 RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 
+/*
+ * Special notes on flags:
+ *
+ * INITIAL - This flag allows a chain to be created and for storage to
+ *          be allocated without having to immediately instantiate the
+ *          related buffer.  The data is assumed to be all-zeros.  It
+ *          is primarily used for indirect blocks.
+ *
+ * MOVED   - A modified chain becomes MOVED after it flushes.  A chain
+ *          can also become MOVED if it is moved within the topology
+ *          (even if not modified).
+ *
+ */
 #define HAMMER2_CHAIN_MODIFIED         0x00000001      /* dirty chain data */
 #define HAMMER2_CHAIN_ALLOCATED                0x00000002      /* kmalloc'd chain */
 #define HAMMER2_CHAIN_DIRTYBP          0x00000004      /* dirty on unlock */
@@ -185,10 +198,17 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 
 /*
  * Flags passed to hammer2_chain_lookup() and hammer2_chain_next()
+ *
+ * NOTE: MATCHIND allows an indirect block / freemap node to be returned
+ *      when the passed key range matches the radix.  Remember that key_end
+ *      is inclusive (e.g. {0x000,0xFFF}, not {0x000,0x1000}).
  */
 #define HAMMER2_LOOKUP_NOLOCK          0x00000001      /* ref only */
 #define HAMMER2_LOOKUP_NODATA          0x00000002      /* data left NULL */
 #define HAMMER2_LOOKUP_SHARED          0x00000100
+#define HAMMER2_LOOKUP_MATCHIND                0x00000200
+#define HAMMER2_LOOKUP_FREEMAP         0x00000400      /* freemap base */
+#define HAMMER2_LOOKUP_ALWAYS          0x00000800      /* resolve data */
 
 /*
  * Flags passed to hammer2_chain_modify() and hammer2_chain_resize()
@@ -199,6 +219,7 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_MODIFY_OPTDATA         0x00000002      /* data can be NULL */
 #define HAMMER2_MODIFY_NO_MODIFY_TID   0x00000004
 #define HAMMER2_MODIFY_ASSERTNOCOPY    0x00000008
+#define HAMMER2_MODIFY_NOREALLOC       0x00000010
 
 /*
  * Flags passed to hammer2_chain_lock()
@@ -208,8 +229,8 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_RESOLVE_ALWAYS         3
 #define HAMMER2_RESOLVE_MASK           0x0F
 
-#define HAMMER2_RESOLVE_SHARED         0x10
-#define HAMMER2_RESOLVE_NOREF          0x20
+#define HAMMER2_RESOLVE_SHARED         0x10    /* request shared lock */
+#define HAMMER2_RESOLVE_NOREF          0x20    /* already ref'd on lock */
 
 /*
  * Flags passed to hammer2_chain_delete_duplicate()
@@ -225,6 +246,11 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_FREECACHE_UNUSED3      3
 #define HAMMER2_FREECACHE_TYPES                4
 
+/*
+ * hammer2_freemap_alloc() block preference
+ */
+#define HAMMER2_OFF_NOPREF             ((hammer2_off_t)-1)
+
 /*
  * BMAP read-ahead maximum parameters
  */
@@ -336,9 +362,11 @@ struct hammer2_trans {
        TAILQ_ENTRY(hammer2_trans) entry;
        struct hammer2_mount    *hmp;
        hammer2_tid_t           sync_tid;
-       thread_t                td;
+       thread_t                td;             /* pointer */
        int                     flags;
        int                     blocked;
+       struct hammer2_inode    *tmp_ip;        /* heuristics only */
+       hammer2_off_t           tmp_bpref;      /* heuristics only */
        uint8_t                 inodes_created;
        uint8_t                 dummy[7];
 };
@@ -377,13 +405,16 @@ struct hammer2_mount {
        int             nipstacks;
        int             maxipstacks;
        hammer2_chain_t vchain;         /* anchor chain */
+       hammer2_chain_t fchain;         /* freemap chain special */
        hammer2_chain_t *schain;        /* super-root */
        hammer2_inode_t *sroot;         /* super-root inode */
        struct lock     alloclk;        /* lockmgr lock */
        struct lock     voldatalk;      /* lockmgr lock */
        struct hammer2_trans_queue transq; /* all in-progress transactions */
        hammer2_trans_t *curflush;      /* current flush in progress */
-       hammer2_tid_t   flush_tid;      /* currently synchronizing flush pt */
+       hammer2_tid_t   topo_flush_tid; /* currently synchronizing flush pt */
+       hammer2_tid_t   free_flush_tid; /* currently synchronizing flush pt */
+       hammer2_off_t   heur_last_alloc;
        int             flushcnt;       /* #of flush trans on the list */
 
        int             volhdrno;       /* last volhdrno written */
@@ -510,7 +541,7 @@ u_int32_t hammer2_to_unix_xid(uuid_t *uuid);
 void hammer2_guid_to_uuid(uuid_t *uuid, u_int32_t guid);
 
 hammer2_key_t hammer2_dirhash(const unsigned char *name, size_t len);
-int hammer2_allocsize(size_t bytes);
+int hammer2_getradix(size_t bytes);
 
 int hammer2_calc_logical(hammer2_inode_t *ip, hammer2_off_t uoff,
                         hammer2_key_t *lbasep, hammer2_key_t *leofp);
@@ -615,8 +646,8 @@ void hammer2_chain_setsubmod(hammer2_trans_t *trans, hammer2_chain_t *chain);
 /*
  * hammer2_trans.c
  */
-void hammer2_trans_init(hammer2_mount_t *hmp, hammer2_trans_t *trans,
-                               int flags);
+void hammer2_trans_init(hammer2_trans_t *trans, hammer2_mount_t *hmp,
+                       hammer2_inode_t *ip, int flags);
 void hammer2_trans_done(hammer2_trans_t *trans);
 
 /*
@@ -642,8 +673,8 @@ void hammer2_dump_chain(hammer2_chain_t *chain, int tab, int *countp);
 /*
  * hammer2_freemap.c
  */
-hammer2_off_t hammer2_freemap_alloc(hammer2_mount_t *hmp,
-                               int type, size_t bytes);
+int hammer2_freemap_alloc(hammer2_trans_t *trans,
+                               hammer2_blockref_t *bref, size_t bytes);
 void hammer2_freemap_free(hammer2_mount_t *hmp, hammer2_off_t data_off,
                                int type);
 
index 1b419f9..9cf5900 100644 (file)
@@ -71,7 +71,7 @@ static int hammer2_indirect_optimize; /* XXX SYSCTL */
 
 static hammer2_chain_t *hammer2_chain_create_indirect(
                hammer2_trans_t *trans, hammer2_chain_t *parent,
-               hammer2_key_t key, int keybits, int *errorp);
+               hammer2_key_t key, int keybits, int for_type, int *errorp);
 
 /*
  * We use a red-black tree to guarantee safe lookups under shared locks.
@@ -147,13 +147,13 @@ hammer2_chain_alloc(hammer2_mount_t *hmp, hammer2_trans_t *trans,
        switch(bref->type) {
        case HAMMER2_BREF_TYPE_INODE:
        case HAMMER2_BREF_TYPE_INDIRECT:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
        case HAMMER2_BREF_TYPE_DATA:
        case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
                chain = kmalloc(sizeof(*chain), hmp->mchain, M_WAITOK | M_ZERO);
                break;
        case HAMMER2_BREF_TYPE_VOLUME:
+       case HAMMER2_BREF_TYPE_FREEMAP:
                chain = NULL;
                panic("hammer2_chain_alloc volume type illegal for op");
        default:
@@ -406,6 +406,7 @@ hammer2_chain_lastdrop(hammer2_chain_t *chain)
 
        switch(chain->bref.type) {
        case HAMMER2_BREF_TYPE_VOLUME:
+       case HAMMER2_BREF_TYPE_FREEMAP:
                chain->data = NULL;
                break;
        case HAMMER2_BREF_TYPE_INODE:
@@ -530,6 +531,10 @@ hammer2_chain_lock(hammer2_chain_t *chain, int how)
                        return(0);
                if (chain->bref.type == HAMMER2_BREF_TYPE_DATA)
                        return(0);
+#if 0
+               if (chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE)
+                       return(0);
+#endif
                if (chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_LEAF)
                        return(0);
                /* fall through */
@@ -555,7 +560,7 @@ hammer2_chain_lock(hammer2_chain_t *chain, int how)
         * API must still be used to do that).
         *
         * The device buffer is variable-sized in powers of 2 down
-        * to HAMMER2_MINALLOCSIZE (typically 1K).  A 64K physical storage
+        * to HAMMER2_MIN_ALLOC (typically 1K).  A 64K physical storage
         * chunk always contains buffers of the same size. (XXX)
         *
         * The minimum physical IO size may be larger than the variable
@@ -597,10 +602,12 @@ hammer2_chain_lock(hammer2_chain_t *chain, int how)
 
        /*
         * Zero the data area if the chain is in the INITIAL-create state.
-        * Mark the buffer for bdwrite().
+        * Mark the buffer for bdwrite().  This clears the INITIAL state
+        * but does not mark the chain modified.
         */
        bdata = (char *)chain->bp->b_data + boff;
        if (chain->flags & HAMMER2_CHAIN_INITIAL) {
+               atomic_clear_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
                bzero(bdata, chain->bytes);
                atomic_set_int(&chain->flags, HAMMER2_CHAIN_DIRTYBP);
        }
@@ -616,6 +623,7 @@ hammer2_chain_lock(hammer2_chain_t *chain, int how)
         */
        switch (bref->type) {
        case HAMMER2_BREF_TYPE_VOLUME:
+       case HAMMER2_BREF_TYPE_FREEMAP:
                /*
                 * Copy data from bp to embedded buffer
                 */
@@ -644,7 +652,6 @@ hammer2_chain_lock(hammer2_chain_t *chain, int how)
                break;
        case HAMMER2_BREF_TYPE_INDIRECT:
        case HAMMER2_BREF_TYPE_DATA:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
        case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
        default:
@@ -758,7 +765,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
                case HAMMER2_BREF_TYPE_INDIRECT:
                        counterp = &hammer2_ioa_indr_write;
                        break;
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
                case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
                        counterp = &hammer2_ioa_fmap_write;
@@ -779,7 +785,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
                case HAMMER2_BREF_TYPE_INDIRECT:
                        counterp = &hammer2_iod_indr_write;
                        break;
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
                case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
                        counterp = &hammer2_iod_fmap_write;
@@ -920,8 +925,7 @@ hammer2_chain_resize(hammer2_trans_t *trans, hammer2_inode_t *ip,
         * Relocate the block, even if making it smaller (because different
         * block sizes may be in different regions).
         */
-       chain->bref.data_off = hammer2_freemap_alloc(hmp, chain->bref.type,
-                                                    nbytes);
+       hammer2_freemap_alloc(trans, &chain->bref, nbytes);
        chain->bytes = nbytes;
        /*ip->delta_dcount += (ssize_t)(nbytes - obytes);*/ /* XXX atomic */
 
@@ -987,29 +991,62 @@ hammer2_chain_modify(hammer2_trans_t *trans, hammer2_chain_t **chainp,
        hammer2_mount_t *hmp = trans->hmp;
        hammer2_chain_t *chain;
        hammer2_off_t pbase;
+       hammer2_tid_t flush_tid;
        struct buf *nbp;
        int error;
+       int wasinitial;
        size_t bbytes;
        size_t boff;
        void *bdata;
 
        /*
-        * modify_tid is only update for primary modifications, not for
-        * propagated brefs.  mirror_tid will be updated regardless during
-        * the flush, no need to set it here.
+        * Data must be resolved if already assigned unless explicitly
+        * flagged otherwise.
         */
        chain = *chainp;
+       if (chain->data == NULL && (flags & HAMMER2_MODIFY_OPTDATA) == 0 &&
+           (chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX)) {
+               hammer2_chain_lock(chain, HAMMER2_RESOLVE_ALWAYS);
+               hammer2_chain_unlock(chain);
+       }
+
+       /*
+        * data is not optional for freemap chains (we must always be sure
+        * to copy the data on COW storage allocations).
+        */
+       if (chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE ||
+           chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_LEAF) {
+               KKASSERT((chain->flags & HAMMER2_CHAIN_INITIAL) ||
+                        (flags & HAMMER2_MODIFY_OPTDATA) == 0);
+       }
 
        /*
         * If the chain is already marked MODIFIED we can usually just
         * return.  However, if a modified chain is modified again in
-        * a synchronization-point-crossing manner we have to
-        * delete/duplicate the chain so as not to interfere with the
-        * atomicy of the flush.
+        * a synchronization-point-crossing manner we have to issue a
+        * delete/duplicate on the chain to avoid flush interference.
         */
        if (chain->flags & HAMMER2_CHAIN_MODIFIED) {
-               if (chain->modify_tid <= hmp->flush_tid &&
-                   trans->sync_tid > hmp->flush_tid) {
+               /*
+                * Which flush_tid do we need to check?  If the chain is
+                * related to the freemap we have to use the freemap flush
+                * tid (free_flush_tid), otherwise we use the normal filesystem
+                * flush tid (topo_flush_tid).  The two flush domains are
+                * almost completely independent of each other.
+                */
+               if (chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE ||
+                   chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_LEAF) {
+                       flush_tid = hmp->topo_flush_tid; /* XXX */
+                       goto skipxx;    /* XXX */
+               } else {
+                       flush_tid = hmp->topo_flush_tid;
+               }
+
+               /*
+                * Main tests
+                */
+               if (chain->modify_tid <= flush_tid &&
+                   trans->sync_tid > flush_tid) {
                        /*
                         * Modifications cross synchronization point,
                         * requires delete-duplicate.
@@ -1018,33 +1055,28 @@ hammer2_chain_modify(hammer2_trans_t *trans, hammer2_chain_t **chainp,
                        hammer2_chain_delete_duplicate(trans, chainp, 0);
                        chain = *chainp;
                        /* fall through using duplicate */
-               } else {
-                       /*
-                        * It is possible that a prior lock/modify sequence
-                        * retired the buffer.  During this lock/modify
-                        * sequence MODIFIED may still be set but the buffer
-                        * could wind up clean.  Since the caller is going
-                        * to modify the buffer further we have to be sure
-                        * that DIRTYBP is set so our chain code knows to
-                        * bwrite/bdwrite the bp.
-                        */
-                       if ((flags & HAMMER2_MODIFY_OPTDATA) == 0 &&
-                           chain->bp == NULL) {
-                               goto skip1;
-                       }
-                       atomic_set_int(&chain->flags, HAMMER2_CHAIN_DIRTYBP);
-
-                       /*
-                        * Must still adjust these fields in the
-                        * already-modified path.
-                        */
-                       if ((flags & HAMMER2_MODIFY_NO_MODIFY_TID) == 0)
-                               chain->bref.modify_tid = trans->sync_tid;
-                       chain->modify_tid = trans->sync_tid;
-                       return;
                }
+skipxx: /* XXX */
+               /*
+                * Quick return path, set DIRTYBP to ensure that
+                * the later retirement of bp will write it out.
+                *
+                * quick return path also needs the modify_tid
+                * logic.
+                */
+               if (chain->bp)
+                       atomic_set_int(&chain->flags, HAMMER2_CHAIN_DIRTYBP);
+               if ((flags & HAMMER2_MODIFY_NO_MODIFY_TID) == 0)
+                       chain->bref.modify_tid = trans->sync_tid;
+               chain->modify_tid = trans->sync_tid;
+               return;
        }
 
+       /*
+        * modify_tid is only update for primary modifications, not for
+        * propagated brefs.  mirror_tid will be updated regardless during
+        * the flush, no need to set it here.
+        */
        if ((flags & HAMMER2_MODIFY_NO_MODIFY_TID) == 0)
                chain->bref.modify_tid = trans->sync_tid;
 
@@ -1064,50 +1096,43 @@ hammer2_chain_modify(hammer2_trans_t *trans, hammer2_chain_t **chainp,
        chain->modify_tid = trans->sync_tid;
 
        /*
-        * We must allocate the copy-on-write block.
-        *
-        * If the data is embedded no other action is required.
-        *
-        * If the data is not embedded we acquire and clear the
-        * new block.  If chain->data is not NULL we then do the
-        * copy-on-write.  chain->data will then be repointed to the new
-        * buffer and the old buffer will be released.
+        * The modification or re-modification requires an allocation and
+        * possible COW.
         *
-        * For newly created elements with no prior allocation we go
-        * through the copy-on-write steps except without the copying part.
-        */
-       if (chain != &hmp->vchain) {
-               if ((hammer2_debug & 0x0001) &&
-                   (chain->bref.data_off & HAMMER2_OFF_MASK)) {
-                       kprintf("Replace %d\n", chain->bytes);
-               }
-               chain->bref.data_off =
-                       hammer2_freemap_alloc(hmp, chain->bref.type,
-                                             chain->bytes);
+        * We normally always allocate new storage here.  If storage exists
+        * and MODIFY_NOREALLOC is passed in, we do not allocate new storage.
+        */
+       if (chain != &hmp->vchain &&
+           chain != &hmp->fchain &&
+           ((chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX) == 0 ||
+            (flags & HAMMER2_MODIFY_NOREALLOC) == 0)
+       ) {
+               hammer2_freemap_alloc(trans, &chain->bref, chain->bytes);
                /* XXX failed allocation */
        }
 
        /*
-        * If data instantiation is optional and the chain has no current
-        * data association (typical for DATA and newly-created INDIRECT
-        * elements), don't instantiate the buffer now.
+        * Do not COW if OPTDATA is set.  INITIAL flag remains unchanged.
+        * (OPTDATA does not prevent [re]allocation of storage, only the
+        * related copy-on-write op).
         */
-       if ((flags & HAMMER2_MODIFY_OPTDATA) && chain->bp == NULL)
+       if (flags & HAMMER2_MODIFY_OPTDATA)
                goto skip2;
 
-skip1:
        /*
-        * Setting the DIRTYBP flag will cause the buffer to be dirtied or
-        * written-out on unlock.  This bit is independent of the MODIFIED
-        * bit because the chain may still need meta-data adjustments done
-        * by virtue of MODIFIED for its parent, and the buffer can be
-        * flushed out (possibly multiple times) by the OS before that.
-        *
         * Clearing the INITIAL flag (for indirect blocks) indicates that
-        * a zero-fill buffer has been instantiated.
+        * we've processed the uninitialized storage allocation.
+        *
+        * If this flag is already clear we are likely in a copy-on-write
+        * situation but we have to be sure NOT to bzero the storage if
+        * no data is present.
         */
-       atomic_set_int(&chain->flags, HAMMER2_CHAIN_DIRTYBP);
-       atomic_clear_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
+       if (chain->flags & HAMMER2_CHAIN_INITIAL) {
+               atomic_clear_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
+               wasinitial = 1;
+       } else {
+               wasinitial = 0;
+       }
 
        /*
         * We currently should never instantiate a device buffer for a
@@ -1116,10 +1141,11 @@ skip1:
        KKASSERT(chain->bref.type != HAMMER2_BREF_TYPE_DATA);
 
        /*
-        * Execute COW operation
+        * Instantiate data buffer and possibly execute COW operation
         */
        switch(chain->bref.type) {
        case HAMMER2_BREF_TYPE_VOLUME:
+       case HAMMER2_BREF_TYPE_FREEMAP:
        case HAMMER2_BREF_TYPE_INODE:
                /*
                 * The data is embedded, no copy-on-write operation is
@@ -1129,13 +1155,13 @@ skip1:
                break;
        case HAMMER2_BREF_TYPE_DATA:
        case HAMMER2_BREF_TYPE_INDIRECT:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
        case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
                /*
                 * Perform the copy-on-write operation
                 */
-               KKASSERT(chain != &hmp->vchain);        /* safety */
+               KKASSERT(chain != &hmp->vchain && chain != &hmp->fchain);
+
                /*
                 * The device buffer may be larger than the allocation size.
                 */
@@ -1145,10 +1171,14 @@ skip1:
                boff = chain->bref.data_off & HAMMER2_OFF_MASK & (bbytes - 1);
 
                /*
+                * Buffer aliasing is possible, check for the case.
+                *
                 * The getblk() optimization can only be used if the
                 * physical block size matches the request.
                 */
-               if (chain->bytes == bbytes) {
+               if (chain->bp && chain->bp->b_loffset == pbase) {
+                       nbp = chain->bp;
+               } else if (chain->bytes == bbytes) {
                        nbp = getblk(hmp->devvp, pbase, bbytes, 0, 0);
                        error = 0;
                } else {
@@ -1159,20 +1189,38 @@ skip1:
 
                /*
                 * Copy or zero-fill on write depending on whether
-                * chain->data exists or not.
+                * chain->data exists or not.  Retire the existing bp
+                * based on the DIRTYBP flag.  Set the DIRTYBP flag to
+                * indicate that retirement of nbp should use bdwrite().
                 */
                if (chain->data) {
-                       bcopy(chain->data, bdata, chain->bytes);
                        KKASSERT(chain->bp != NULL);
-               } else {
+                       if (chain->data != bdata) {
+                               bcopy(chain->data, bdata, chain->bytes);
+                       }
+               } else if (wasinitial) {
                        bzero(bdata, chain->bytes);
+               } else {
+                       /*
+                        * We have a problem.  We were asked to COW but
+                        * we don't have any data to COW with!
+                        */
+                       panic("hammer2_chain_modify: having a COW %p\n",
+                             chain);
                }
-               if (chain->bp) {
-                       chain->bp->b_flags |= B_RELBUF;
-                       brelse(chain->bp);
+               if (chain->bp != nbp) {
+                       if (chain->bp) {
+                               if (chain->flags & HAMMER2_CHAIN_DIRTYBP) {
+                                       bdwrite(chain->bp);
+                               } else {
+                                       chain->bp->b_flags |= B_RELBUF;
+                                       brelse(chain->bp);
+                               }
+                       }
+                       chain->bp = nbp;
                }
-               chain->bp = nbp;
                chain->data = bdata;
+               atomic_set_int(&chain->flags, HAMMER2_CHAIN_DIRTYBP);
                break;
        default:
                panic("hammer2_chain_modify: illegal non-embedded type %d",
@@ -1349,7 +1397,6 @@ retry:
                bref = &parent->data->ipdata.u.blockset.blockref[index];
                break;
        case HAMMER2_BREF_TYPE_INDIRECT:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                KKASSERT(parent->data != NULL);
                KKASSERT(index >= 0 &&
@@ -1360,6 +1407,10 @@ retry:
                KKASSERT(index >= 0 && index < HAMMER2_SET_COUNT);
                bref = &hmp->voldata.sroot_blockset.blockref[index];
                break;
+       case HAMMER2_BREF_TYPE_FREEMAP:
+               KKASSERT(index >= 0 && index < HAMMER2_SET_COUNT);
+               bref = &hmp->voldata.freemap_blockset.blockref[index];
+               break;
        default:
                bref = NULL;
                panic("hammer2_chain_get: unrecognized blockref type: %d",
@@ -1466,7 +1517,9 @@ hammer2_chain_getparent(hammer2_chain_t **parentp, int how)
  *
  * WARNING!  THIS DOES NOT RETURN KEYS IN LOGICAL KEY ORDER!  ANY KEY
  *          WITHIN THE RANGE CAN BE RETURNED.  HOWEVER, AN ITERATION
- *          WHICH PICKS UP WHERE WE LEFT OFF WILL CONTINUE THE SCAN.
+ *          WHICH PICKS UP WHERE WE LEFT OFF WILL CONTINUE THE SCAN
+ *          AND ALL IN-RANGE KEYS WILL EVENTUALLY BE RETURNED (NOT
+ *          NECESSARILY IN ORDER).
  *
  * (*parentp) must be exclusively locked and referenced and can be an inode
  * or an existing indirect block within the inode.
@@ -1485,6 +1538,12 @@ hammer2_chain_getparent(hammer2_chain_t **parentp, int how)
  * This function will also recurse up the chain if the key is not within the
  * current parent's range.  (*parentp) can never be set to NULL.  An iteration
  * can simply allow (*parentp) to float inside the loop.
+ *
+ * NOTE!  chain->data is not always resolved.  By default it will not be
+ *       resolved for BREF_TYPE_DATA, FREEMAP_NODE, or FREEMAP_LEAF.  Use
+ *       HAMMER2_LOOKUP_ALWAYS to force resolution (but be careful w/
+ *       BREF_TYPE_DATA as the device buffer can alias the logical file
+ *       buffer).
  */
 hammer2_chain_t *
 hammer2_chain_lookup(hammer2_chain_t **parentp,
@@ -1504,6 +1563,9 @@ hammer2_chain_lookup(hammer2_chain_t **parentp,
        int how_always = HAMMER2_RESOLVE_ALWAYS;
        int how_maybe = HAMMER2_RESOLVE_MAYBE;
 
+       if (flags & HAMMER2_LOOKUP_ALWAYS)
+               how_maybe = how_always;
+
        if (flags & (HAMMER2_LOOKUP_SHARED | HAMMER2_LOOKUP_NOLOCK)) {
                how_maybe |= HAMMER2_RESOLVE_SHARED;
                how_always |= HAMMER2_RESOLVE_SHARED;
@@ -1550,9 +1612,21 @@ again:
                base = &parent->data->ipdata.u.blockset.blockref[0];
                count = HAMMER2_SET_COUNT;
                break;
-       case HAMMER2_BREF_TYPE_INDIRECT:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
+       case HAMMER2_BREF_TYPE_INDIRECT:
+               /*
+                * Handle MATCHIND on the parent
+                */
+               if (flags & HAMMER2_LOOKUP_MATCHIND) {
+                       scan_beg = parent->bref.key;
+                       scan_end = scan_beg +
+                              ((hammer2_key_t)1 << parent->bref.keybits) - 1;
+                       if (key_beg == scan_beg && key_end == scan_end) {
+                               chain = parent;
+                               hammer2_chain_lock(chain, how_maybe);
+                               goto done;
+                       }
+               }
                /*
                 * Optimize indirect blocks in the INITIAL state to avoid
                 * I/O.
@@ -1570,6 +1644,10 @@ again:
                base = &hmp->voldata.sroot_blockset.blockref[0];
                count = HAMMER2_SET_COUNT;
                break;
+       case HAMMER2_BREF_TYPE_FREEMAP:
+               base = &hmp->voldata.freemap_blockset.blockref[0];
+               count = HAMMER2_SET_COUNT;
+               break;
        default:
                panic("hammer2_chain_lookup: unrecognized blockref type: %d",
                      parent->bref.type);
@@ -1587,6 +1665,9 @@ again:
         *       might represent different flush synchronization points).
         */
        bref = NULL;
+       scan_beg = 0;   /* avoid compiler warning */
+       scan_end = 0;   /* avoid compiler warning */
+
        for (i = 0; i < count; ++i) {
                tmp = hammer2_chain_find(parent, i);
                if (tmp) {
@@ -1636,24 +1717,31 @@ again:
         * so we can access its data.  It might need a fixup if the caller
         * passed incompatible flags.  Be careful not to cause a deadlock
         * as a data-load requires an exclusive lock.
+        *
+        * If HAMMER2_LOOKUP_MATCHIND is set and the indirect block's key
+        * range is within the requested key range we return the indirect
+        * block and do NOT loop.  This is usually only used to acquire
+        * freemap nodes.
         */
        if (chain->bref.type == HAMMER2_BREF_TYPE_INDIRECT ||
            chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE) {
                hammer2_chain_unlock(parent);
                *parentp = parent = chain;
                if (flags & HAMMER2_LOOKUP_NOLOCK) {
-                       hammer2_chain_lock(chain, how_maybe |
-                                                 HAMMER2_RESOLVE_NOREF);
+                       hammer2_chain_lock(chain,
+                                          how_maybe |
+                                          HAMMER2_RESOLVE_NOREF);
                } else if ((flags & HAMMER2_LOOKUP_NODATA) &&
                           chain->data == NULL) {
                        hammer2_chain_ref(chain);
                        hammer2_chain_unlock(chain);
-                       hammer2_chain_lock(chain, how_maybe |
-                                                 HAMMER2_RESOLVE_NOREF);
+                       hammer2_chain_lock(chain,
+                                          how_maybe |
+                                          HAMMER2_RESOLVE_NOREF);
                }
                goto again;
        }
-
+done:
        /*
         * All done, return the chain
         */
@@ -1670,6 +1758,8 @@ again:
  *
  * parent must be locked on entry and remains locked throughout.  chain's
  * lock status must match flags.  Chain is always at least referenced.
+ *
+ * WARNING!  The MATCHIND flag does not apply to this function.
  */
 hammer2_chain_t *
 hammer2_chain_next(hammer2_chain_t **parentp, hammer2_chain_t *chain,
@@ -1749,7 +1839,6 @@ again2:
                count = HAMMER2_SET_COUNT;
                break;
        case HAMMER2_BREF_TYPE_INDIRECT:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                if (parent->flags & HAMMER2_CHAIN_INITIAL) {
                        base = NULL;
@@ -1763,6 +1852,10 @@ again2:
                base = &hmp->voldata.sroot_blockset.blockref[0];
                count = HAMMER2_SET_COUNT;
                break;
+       case HAMMER2_BREF_TYPE_FREEMAP:
+               base = &hmp->voldata.freemap_blockset.blockref[0];
+               count = HAMMER2_SET_COUNT;
+               break;
        default:
                panic("hammer2_chain_next: unrecognized blockref type: %d",
                      parent->bref.type);
@@ -1784,6 +1877,9 @@ again2:
         *       might represent different flush synchronization points).
         */
        bref = NULL;
+       scan_beg = 0;   /* avoid compiler warning */
+       scan_end = 0;   /* avoid compiler warning */
+
        while (i < count) {
                tmp = hammer2_chain_find(parent, i);
                if (tmp) {
@@ -1831,24 +1927,34 @@ again2:
         * so we can access its data.  It might need a fixup if the caller
         * passed incompatible flags.  Be careful not to cause a deadlock
         * as a data-load requires an exclusive lock.
+        *
+        * If HAMMER2_LOOKUP_MATCHIND is set and the indirect block's key
+        * range is within the requested key range we return the indirect
+        * block and do NOT loop.  This is usually only used to acquire
+        * freemap nodes.
         */
        if (chain->bref.type == HAMMER2_BREF_TYPE_INDIRECT ||
            chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE) {
-               hammer2_chain_unlock(parent);
-               *parentp = parent = chain;
-               chain = NULL;
-               if (flags & HAMMER2_LOOKUP_NOLOCK) {
-                       hammer2_chain_lock(parent, how_maybe |
-                                                  HAMMER2_RESOLVE_NOREF);
-               } else if ((flags & HAMMER2_LOOKUP_NODATA) &&
-                          parent->data == NULL) {
-                       hammer2_chain_ref(parent);
+               if ((flags & HAMMER2_LOOKUP_MATCHIND) == 0 ||
+                   key_beg > scan_beg || key_end < scan_end) {
                        hammer2_chain_unlock(parent);
-                       hammer2_chain_lock(parent, how_maybe |
+                       *parentp = parent = chain;
+                       chain = NULL;
+                       if (flags & HAMMER2_LOOKUP_NOLOCK) {
+                               hammer2_chain_lock(parent,
+                                                  how_maybe |
+                                                  HAMMER2_RESOLVE_NOREF);
+                       } else if ((flags & HAMMER2_LOOKUP_NODATA) &&
+                                  parent->data == NULL) {
+                               hammer2_chain_ref(parent);
+                               hammer2_chain_unlock(parent);
+                               hammer2_chain_lock(parent,
+                                                  how_maybe |
                                                   HAMMER2_RESOLVE_NOREF);
+                       }
+                       i = 0;
+                       goto again2;
                }
-               i = 0;
-               goto again2;
        }
 
        /*
@@ -1878,7 +1984,8 @@ again2:
  * Caller must pass-in an exclusively locked parent the new chain is to
  * be inserted under, and optionally pass-in a disconnected, exclusively
  * locked chain to insert (else we create a new chain).  The function will
- * adjust (*parentp) as necessary and return the existing or new chain.
+ * adjust (*parentp) as necessary, create or connect the chain, and
+ * return an exclusively locked chain in *chainp.
  */
 int
 hammer2_chain_create(hammer2_trans_t *trans, hammer2_chain_t **parentp,
@@ -1905,17 +2012,20 @@ hammer2_chain_create(hammer2_trans_t *trans, hammer2_chain_t **parentp,
        if (chain == NULL) {
                /*
                 * First allocate media space and construct the dummy bref,
-                * then allocate the in-memory chain structure.
+                * then allocate the in-memory chain structure.  Set the
+                * INITIAL flag for fresh chains.
                 */
                bzero(&dummy, sizeof(dummy));
                dummy.type = type;
                dummy.key = key;
                dummy.keybits = keybits;
-               dummy.data_off = hammer2_allocsize(bytes);
+               dummy.data_off = hammer2_getradix(bytes);
                dummy.methods = parent->bref.methods;
                chain = hammer2_chain_alloc(hmp, trans, &dummy);
                hammer2_chain_core_alloc(chain, NULL);
 
+               atomic_set_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
+
                /*
                 * Lock the chain manually, chain_lock will load the chain
                 * which we do NOT want to do.  (note: chain->refs is set
@@ -1938,6 +2048,7 @@ hammer2_chain_create(hammer2_trans_t *trans, hammer2_chain_t **parentp,
 
                switch(type) {
                case HAMMER2_BREF_TYPE_VOLUME:
+               case HAMMER2_BREF_TYPE_FREEMAP:
                        panic("hammer2_chain_create: called with volume type");
                        break;
                case HAMMER2_BREF_TYPE_INODE:
@@ -1949,7 +2060,6 @@ hammer2_chain_create(hammer2_trans_t *trans, hammer2_chain_t **parentp,
                        panic("hammer2_chain_create: cannot be used to"
                              "create indirect block");
                        break;
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
                case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                        panic("hammer2_chain_create: cannot be used to"
                              "create freemap root or node");
@@ -1963,7 +2073,9 @@ hammer2_chain_create(hammer2_trans_t *trans, hammer2_chain_t **parentp,
                }
        } else {
                /*
-                * Potentially update the chain's key/keybits.
+                * Potentially update the existing chain's key/keybits.
+                *
+                * Do NOT mess with the current state of the INITIAL flag.
                 */
                chain->bref.key = key;
                chain->bref.keybits = keybits;
@@ -1985,7 +2097,6 @@ again:
                count = HAMMER2_SET_COUNT;
                break;
        case HAMMER2_BREF_TYPE_INDIRECT:
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                if (parent->flags & HAMMER2_CHAIN_INITIAL) {
                        base = NULL;
@@ -2000,6 +2111,11 @@ again:
                base = &hmp->voldata.sroot_blockset.blockref[0];
                count = HAMMER2_SET_COUNT;
                break;
+       case HAMMER2_BREF_TYPE_FREEMAP:
+               KKASSERT(parent->data != NULL);
+               base = &hmp->voldata.freemap_blockset.blockref[0];
+               count = HAMMER2_SET_COUNT;
+               break;
        default:
                panic("hammer2_chain_create: unrecognized blockref type: %d",
                      parent->bref.type);
@@ -2045,7 +2161,7 @@ again:
 
                nparent = hammer2_chain_create_indirect(trans, parent,
                                                        key, keybits,
-                                                       &error);
+                                                       type, &error);
                if (nparent == NULL) {
                        if (allocated)
                                hammer2_chain_drop(chain);
@@ -2077,47 +2193,40 @@ again:
        atomic_set_int(&chain->flags, HAMMER2_CHAIN_ONRBTREE);
        spin_unlock(&above->cst.spin);
 
-       /*
-        * (allocated) indicates that this is a newly-created chain element
-        * rather than a renamed chain element.
-        *
-        * In this situation we want to place the chain element in
-        * the MODIFIED state.  The caller expects it to NOT be in the
-        * INITIAL state.
-        *
-        * The data area will be set up as follows:
-        *
-        *      VOLUME          not allowed here.
-        *
-        *      INODE           embedded data are will be set-up.
-        *
-        *      INDIRECT        not allowed here.
-        *
-        *      DATA            no data area will be set-up (caller is expected
-        *                      to have logical buffers, we don't want to alias
-        *                      the data onto device buffers!).
-        */
        if (allocated) {
+               /*
+                * Mark the newly created chain modified.
+                *
+                * Device buffers are not instantiated for DATA elements
+                * as these are handled by logical buffers.
+                *
+                * Indirect and freemap node indirect blocks are handled
+                * by hammer2_chain_create_indirect() and not by this
+                * function.
+                *
+                * Data for all other bref types is expected to be
+                * instantiated (INODE, LEAF).
+                */
                switch(chain->bref.type) {
                case HAMMER2_BREF_TYPE_DATA:
-               case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
                        hammer2_chain_modify(trans, &chain,
                                             HAMMER2_MODIFY_OPTDATA |
                                             HAMMER2_MODIFY_ASSERTNOCOPY);
                        break;
-               case HAMMER2_BREF_TYPE_INDIRECT:
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
-               case HAMMER2_BREF_TYPE_FREEMAP_NODE:
-                       /* not supported in this function */
-                       panic("hammer2_chain_create: bad type");
-                       atomic_set_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
+               case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
+               case HAMMER2_BREF_TYPE_INODE:
                        hammer2_chain_modify(trans, &chain,
-                                            HAMMER2_MODIFY_OPTDATA |
                                             HAMMER2_MODIFY_ASSERTNOCOPY);
                        break;
                default:
-                       hammer2_chain_modify(trans, &chain,
-                                            HAMMER2_MODIFY_ASSERTNOCOPY);
+                       /*
+                        * Remaining types are not supported by this function.
+                        * In particular, INDIRECT and LEAF_NODE types are
+                        * handled by create_indirect().
+                        */
+                       panic("hammer2_chain_create: bad type: %d",
+                             chain->bref.type);
+                       /* NOT REACHED */
                        break;
                }
        } else {
@@ -2158,6 +2267,9 @@ done:
  *      new chain will have the same transaction id and thus the operation
  *      appears atomic w/regards to media flushes.
  */
+static void hammer2_chain_dup_inodefixup(hammer2_chain_t *ochain,
+                                        hammer2_chain_t *nchain);
+
 void
 hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
                        hammer2_chain_t **chainp, hammer2_blockref_t *bref)
@@ -2168,135 +2280,35 @@ hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
        hammer2_chain_t *nchain;
        hammer2_chain_t *scan;
        hammer2_chain_core_t *above;
-       hammer2_chain_core_t *core;
        size_t bytes;
        int count;
+       int oflags;
+       void *odata;
 
        /*
         * First create a duplicate of the chain structure, associating
         * it with the same core, making it the same size, pointing it
-        * to the same bref (the same media block), and copying any inline
-        * data.
+        * to the same bref (the same media block).
         */
        ochain = *chainp;
        if (bref == NULL)
                bref = &ochain->bref;
        nchain = hammer2_chain_alloc(hmp, trans, bref);
        hammer2_chain_core_alloc(nchain, ochain->core);
-       core = ochain->core;
-
        bytes = (hammer2_off_t)1 <<
                (int)(bref->data_off & HAMMER2_OFF_MASK_RADIX);
        nchain->bytes = bytes;
+       nchain->modify_tid = ochain->modify_tid;
 
-       /*
-        * Be sure to copy the INITIAL flag as well or we could end up
-        * loading garbage from the bref.
-        */
-       if (ochain->flags & HAMMER2_CHAIN_INITIAL)
-               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_INITIAL);
-       if (ochain->flags & HAMMER2_CHAIN_DIRTYBP)
-               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_DIRTYBP);
-
-       /*
-        * If the old chain is modified the new one must be too,
-        * but we only want to allocate a new bref.
-        */
-       if (ochain->flags & HAMMER2_CHAIN_MODIFIED) {
-               /*
-                * When duplicating chains the MODIFIED state is inherited.
-                * A new bref typically must be allocated.  However, file
-                * data chains may already have the data offset assigned
-                * to a logical buffer cache buffer so we absolutely cannot
-                * allocate a new bref here for TYPE_DATA.
-                *
-                * Basically the flusher core only dumps media topology
-                * and meta-data, not file data.  The VOP_FSYNC code deals
-                * with the file data.  XXX need back-pointer to inode.
-                */
-               if (nchain->bref.type == HAMMER2_BREF_TYPE_DATA) {
-                       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_MODIFIED);
-                       hammer2_chain_ref(nchain);
-               } else {
-                       hammer2_chain_modify(trans, &nchain,
-                                            HAMMER2_MODIFY_OPTDATA |
-                                            HAMMER2_MODIFY_ASSERTNOCOPY);
-               }
-       } else if (nchain->flags & HAMMER2_CHAIN_INITIAL) {
-               /*
-                * When duplicating chains in the INITITAL state we need
-                * to ensure that the chain is marked modified so a
-                * block is properly assigned to it, otherwise the MOVED
-                * bit won't do the right thing.
-                */
-               KKASSERT (nchain->bref.type != HAMMER2_BREF_TYPE_DATA);
-               hammer2_chain_modify(trans, &nchain,
-                                    HAMMER2_MODIFY_OPTDATA |
-                                    HAMMER2_MODIFY_ASSERTNOCOPY);
-       }
-       if (parent || (ochain->flags & HAMMER2_CHAIN_MOVED)) {
-               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_MOVED);
-               hammer2_chain_ref(nchain);
-       }
-       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_SUBMODIFIED);
-
-       switch(nchain->bref.type) {
-       case HAMMER2_BREF_TYPE_VOLUME:
-               panic("hammer2_chain_duplicate: cannot be called w/volhdr");
-               break;
-       case HAMMER2_BREF_TYPE_INODE:
-               KKASSERT(bytes == HAMMER2_INODE_BYTES);
-               if (ochain->data) {
-                       nchain->data = kmalloc(sizeof(nchain->data->ipdata),
-                                             hmp->minode, M_WAITOK | M_ZERO);
-                       nchain->data->ipdata = ochain->data->ipdata;
-               }
-               break;
-       case HAMMER2_BREF_TYPE_INDIRECT:
-               if ((nchain->flags & HAMMER2_CHAIN_MODIFIED) &&
-                   nchain->data) {
-                       bcopy(ochain->data, nchain->data,
-                             nchain->bytes);
-               }
-               break;
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
-       case HAMMER2_BREF_TYPE_FREEMAP_NODE:
-               panic("hammer2_chain_duplicate: cannot be used to"
-                     "create a freemap root or node");
-               break;
-       case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
-       case HAMMER2_BREF_TYPE_DATA:
-       default:
-               if ((nchain->flags & HAMMER2_CHAIN_MODIFIED) &&
-                   nchain->data) {
-                       bcopy(ochain->data, nchain->data,
-                             nchain->bytes);
-               }
-               /* leave chain->data NULL */
-               KKASSERT(nchain->data == NULL);
-               break;
-       }
-
-       /*
-        * Unmodified duplicated blocks may have the same bref, we
-        * must be careful to avoid buffer cache deadlocks so we
-        * unlock the old chain before resolving the new one.
-        *
-        * Insert nchain at the end of the duplication list.
-        */
        hammer2_chain_lock(nchain, HAMMER2_RESOLVE_NEVER);
-       /* extra ref still present from original allocation */
-
-       hammer2_chain_unlock(ochain);
-       *chainp = nchain;
-       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_MAYBE |
-                                  HAMMER2_RESOLVE_NOREF); /* eat excess ref */
-       hammer2_chain_unlock(nchain);
+       hammer2_chain_dup_inodefixup(ochain, nchain);
 
        /*
         * If parent is not NULL, insert into the parent at the requested
         * index.  The newly duplicated chain must be marked MOVED and
         * SUBMODIFIED set in its parent(s).
+        *
+        * Having both chains locked is extremely important for atomicy.
         */
        if (parent) {
                /*
@@ -2314,7 +2326,6 @@ hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
                        count = HAMMER2_SET_COUNT;
                        break;
                case HAMMER2_BREF_TYPE_INDIRECT:
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
                case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                        if (parent->flags & HAMMER2_CHAIN_INITIAL) {
                                base = NULL;
@@ -2329,6 +2340,11 @@ hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
                        base = &hmp->voldata.sroot_blockset.blockref[0];
                        count = HAMMER2_SET_COUNT;
                        break;
+               case HAMMER2_BREF_TYPE_FREEMAP:
+                       KKASSERT(parent->data != NULL);
+                       base = &hmp->voldata.freemap_blockset.blockref[0];
+                       count = HAMMER2_SET_COUNT;
+                       break;
                default:
                        panic("hammer2_chain_create: unrecognized "
                              "blockref type: %d",
@@ -2361,8 +2377,75 @@ hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
                }
                hammer2_chain_setsubmod(trans, nchain);
        }
+
+       /*
+        * We have to unlock ochain to flush any dirty data, asserting the
+        * case (data == NULL) to catch any extra locks that might have been
+        * present, then transfer state to nchain.
+        */
+       oflags = ochain->flags;
+       odata = ochain->data;
+       hammer2_chain_unlock(ochain);
+       KKASSERT(ochain->bref.type == HAMMER2_BREF_TYPE_INODE ||
+                ochain->data == NULL);
+
+       if (oflags & HAMMER2_CHAIN_INITIAL)
+               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_INITIAL);
+
+       /*
+        * WARNING!  We should never resolve DATA to device buffers
+        *           (XXX allow it if the caller did?), and since
+        *           we currently do not have the logical buffer cache
+        *           buffer in-hand to fix its cached physical offset
+        *           we also force the modify code to not COW it. XXX
+        */
+       if (oflags & HAMMER2_CHAIN_MODIFIED) {
+               if (nchain->bref.type == HAMMER2_BREF_TYPE_DATA) {
+                       hammer2_chain_modify(trans, &nchain,
+                                            HAMMER2_MODIFY_OPTDATA |
+                                            HAMMER2_MODIFY_NOREALLOC |
+                                            HAMMER2_MODIFY_ASSERTNOCOPY);
+               } else if (oflags & HAMMER2_CHAIN_INITIAL) {
+                       hammer2_chain_modify(trans, &nchain,
+                                            HAMMER2_MODIFY_OPTDATA |
+                                            HAMMER2_MODIFY_ASSERTNOCOPY);
+               } else {
+                       hammer2_chain_modify(trans, &nchain,
+                                            HAMMER2_MODIFY_ASSERTNOCOPY);
+               }
+               hammer2_chain_drop(nchain);
+       } else {
+               if (nchain->bref.type == HAMMER2_BREF_TYPE_DATA) {
+                       hammer2_chain_drop(nchain);
+               } else if (oflags & HAMMER2_CHAIN_INITIAL) {
+                       hammer2_chain_drop(nchain);
+               } else {
+                       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_ALWAYS |
+                                                  HAMMER2_RESOLVE_NOREF);
+                       hammer2_chain_unlock(nchain);
+               }
+       }
+       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_SUBMODIFIED);
+       *chainp = nchain;
 }
 
+#if 0
+               /*
+                * When the chain is in the INITIAL state we must still
+                * ensure that a block has been assigned so MOVED processing
+                * works as expected.
+                */
+               KKASSERT (nchain->bref.type != HAMMER2_BREF_TYPE_DATA);
+               hammer2_chain_modify(trans, &nchain,
+                                    HAMMER2_MODIFY_OPTDATA |
+                                    HAMMER2_MODIFY_ASSERTNOCOPY);
+
+
+       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_MAYBE |
+                                  HAMMER2_RESOLVE_NOREF); /* eat excess ref */
+       hammer2_chain_unlock(nchain);
+#endif
+
 /*
  * Special in-place delete-duplicate sequence which does not require a
  * locked parent.  (*chainp) is marked DELETED and atomically replaced
@@ -2378,12 +2461,11 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
        hammer2_chain_t *nchain;
        hammer2_chain_core_t *above;
        size_t bytes;
+       int oflags;
+       void *odata;
 
        /*
-        * First create a duplicate of the chain structure, associating
-        * it with the same core, making it the same size, pointing it
-        * to the same bref (the same media block), and copying any inline
-        * data.
+        * First create a duplicate of the chain structure
         */
        ochain = *chainp;
        nchain = hammer2_chain_alloc(hmp, trans, &ochain->bref);    /* 1 ref */
@@ -2396,52 +2478,79 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
        bytes = (hammer2_off_t)1 <<
                (int)(ochain->bref.data_off & HAMMER2_OFF_MASK_RADIX);
        nchain->bytes = bytes;
+       nchain->modify_tid = ochain->modify_tid;
 
        /*
-        * Be sure to copy the INITIAL flag as well or we could end up
-        * loading garbage from the bref.
+        * Lock nchain and insert into ochain's core hierarchy, marking
+        * ochain DELETED at the same time.  Having both chains locked
+        * is extremely important for atomicy.
         */
-       if (ochain->flags & HAMMER2_CHAIN_INITIAL)
+       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_NEVER);
+       hammer2_chain_dup_inodefixup(ochain, nchain);
+       /* extra ref still present from original allocation */
+
+       nchain->index = ochain->index;
+
+       spin_lock(&above->cst.spin);
+       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_ONRBTREE);
+       ochain->delete_tid = trans->sync_tid;
+       nchain->above = above;
+       atomic_set_int(&ochain->flags, HAMMER2_CHAIN_DELETED);
+       if ((ochain->flags & HAMMER2_CHAIN_MOVED) == 0) {
+               hammer2_chain_ref(ochain);
+               atomic_set_int(&ochain->flags, HAMMER2_CHAIN_MOVED);
+       }
+       if (RB_INSERT(hammer2_chain_tree, &above->rbtree, nchain)) {
+               panic("hammer2_chain_delete_duplicate: collision");
+       }
+       spin_unlock(&above->cst.spin);
+
+       /*
+        * We have to unlock ochain to flush any dirty data, asserting the
+        * case (data == NULL) to catch any extra locks that might have been
+        * present, then transfer state to nchain.
+        */
+       oflags = ochain->flags;
+       odata = ochain->data;
+       hammer2_chain_unlock(ochain);   /* replacing ochain */
+       KKASSERT(ochain->bref.type == HAMMER2_BREF_TYPE_INODE ||
+                ochain->data == NULL);
+
+       if (oflags & HAMMER2_CHAIN_INITIAL)
                atomic_set_int(&nchain->flags, HAMMER2_CHAIN_INITIAL);
-       if (ochain->flags & HAMMER2_CHAIN_DIRTYBP)
-               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_DIRTYBP);
 
        /*
-        * If the old chain is modified the new one must be too,
-        * but we only want to allocate a new bref.
+        * WARNING!  We should never resolve DATA to device buffers
+        *           (XXX allow it if the caller did?), and since
+        *           we currently do not have the logical buffer cache
+        *           buffer in-hand to fix its cached physical offset
+        *           we also force the modify code to not COW it. XXX
         */
-       if (ochain->flags & HAMMER2_CHAIN_MODIFIED) {
-               /*
-                * When duplicating chains the MODIFIED state is inherited.
-                * A new bref typically must be allocated.  However, file
-                * data chains may already have the data offset assigned
-                * to a logical buffer cache buffer so we absolutely cannot
-                * allocate a new bref here for TYPE_DATA.
-                *
-                * Basically the flusher core only dumps media topology
-                * and meta-data, not file data.  The VOP_FSYNC code deals
-                * with the file data.  XXX need back-pointer to inode.
-                */
+       if (oflags & HAMMER2_CHAIN_MODIFIED) {
                if (nchain->bref.type == HAMMER2_BREF_TYPE_DATA) {
-                       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_MODIFIED);
-                       hammer2_chain_ref(nchain);
-                       nchain->modify_tid = trans->sync_tid;
-               } else {
                        hammer2_chain_modify(trans, &nchain,
                                             HAMMER2_MODIFY_OPTDATA |
+                                            HAMMER2_MODIFY_NOREALLOC |
                                             HAMMER2_MODIFY_ASSERTNOCOPY);
-               }
-       } else if (nchain->flags & HAMMER2_CHAIN_INITIAL) {
-               /*
-                * When duplicating chains in the INITITAL state we need
-                * to ensure that the chain is marked modified so a
-                * block is properly assigned to it, otherwise the MOVED
-                * bit won't do the right thing.
-                */
-               KKASSERT (nchain->bref.type != HAMMER2_BREF_TYPE_DATA);
-               hammer2_chain_modify(trans, &nchain,
-                                    HAMMER2_MODIFY_OPTDATA |
-                                    HAMMER2_MODIFY_ASSERTNOCOPY);
+               } else if (oflags & HAMMER2_CHAIN_INITIAL) {
+                       hammer2_chain_modify(trans, &nchain,
+                                            HAMMER2_MODIFY_OPTDATA |
+                                            HAMMER2_MODIFY_ASSERTNOCOPY);
+               } else {
+                       hammer2_chain_modify(trans, &nchain,
+                                            HAMMER2_MODIFY_ASSERTNOCOPY);
+               }
+               hammer2_chain_drop(nchain);
+       } else {
+               if (nchain->bref.type == HAMMER2_BREF_TYPE_DATA) {
+                       hammer2_chain_drop(nchain);
+               } else if (oflags & HAMMER2_CHAIN_INITIAL) {
+                       hammer2_chain_drop(nchain);
+               } else {
+                       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_ALWAYS |
+                                                  HAMMER2_RESOLVE_NOREF);
+                       hammer2_chain_unlock(nchain);
+               }
        }
 
        /*
@@ -2453,88 +2562,32 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
                hammer2_chain_ref(nchain);
        }
        atomic_set_int(&nchain->flags, HAMMER2_CHAIN_SUBMODIFIED);
-
-       /*
-        * Copy media contents as needed.
-        */
-       switch(nchain->bref.type) {
-       case HAMMER2_BREF_TYPE_VOLUME:
-               panic("hammer2_chain_duplicate: cannot be called w/volhdr");
-               break;
-       case HAMMER2_BREF_TYPE_INODE:
-               KKASSERT(bytes == HAMMER2_INODE_BYTES);
-               if (ochain->data) {
-                       nchain->data = kmalloc(sizeof(nchain->data->ipdata),
-                                             hmp->minode, M_WAITOK | M_ZERO);
-                       nchain->data->ipdata = ochain->data->ipdata;
-               }
-               break;
-       case HAMMER2_BREF_TYPE_INDIRECT:
-               if ((nchain->flags & HAMMER2_CHAIN_MODIFIED) &&
-                   nchain->data) {
-                       bcopy(ochain->data, nchain->data,
-                             nchain->bytes);
-               }
-               break;
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
-       case HAMMER2_BREF_TYPE_FREEMAP_NODE:
-               panic("hammer2_chain_duplicate: cannot be used to"
-                     "create a freemap root or node");
-               break;
-       case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
-       case HAMMER2_BREF_TYPE_DATA:
-       default:
-               if ((nchain->flags & HAMMER2_CHAIN_MODIFIED) &&
-                   nchain->data) {
-                       bcopy(ochain->data, nchain->data,
-                             nchain->bytes);
-               }
-               /* leave chain->data NULL */
-               KKASSERT(nchain->data == NULL);
-               break;
-       }
-
-       /*
-        * Both chains must be locked for us to be able to set the
-        * duplink.  The caller may expect valid data.
-        *
-        * Unmodified duplicated blocks may have the same bref, we
-        * must be careful to avoid buffer cache deadlocks so we
-        * unlock the old chain before resolving the new one.
-        *
-        * Insert nchain at the end of the duplication list.
-        */
-       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_NEVER);
-       /* extra ref still present from original allocation */
-
-       nchain->index = ochain->index;
-
-       spin_lock(&above->cst.spin);
-       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_ONRBTREE);
-       ochain->delete_tid = trans->sync_tid;
-       nchain->above = above;
-       atomic_set_int(&ochain->flags, HAMMER2_CHAIN_DELETED);
-       if ((ochain->flags & HAMMER2_CHAIN_MOVED) == 0) {
-               hammer2_chain_ref(ochain);
-               atomic_set_int(&ochain->flags, HAMMER2_CHAIN_MOVED);
-       }
-       if (RB_INSERT(hammer2_chain_tree, &above->rbtree, nchain)) {
-               panic("hammer2_chain_delete_duplicate: collision");
-       }
-       spin_unlock(&above->cst.spin);
-
-       /*
-        * Cleanup.  Also note that nchain must be re-resolved to ensure
-        * that it's data is resolved because we locked it RESOLVE_NEVER
-        * up above.
-        */
-       *chainp = nchain;               /* inherits locked */
-       hammer2_chain_unlock(ochain);   /* replacing ochain */
-       hammer2_chain_lock(nchain, HAMMER2_RESOLVE_MAYBE |
-                                  HAMMER2_RESOLVE_NOREF); /* excess ref */
-       hammer2_chain_unlock(nchain);
-
        hammer2_chain_setsubmod(trans, nchain);
+       *chainp = nchain;
+}
+
+/*
+ * Helper function to fixup inodes.  The caller procedure stack may hold
+ * multiple locks on ochain if it represents an inode, preventing our
+ * unlock from retiring its state to the buffer cache.
+ *
+ * In this situation any attempt to access the buffer cache could result
+ * either in stale data or a deadlock.  Work around the problem by copying
+ * the embedded data directly.
+ */
+static
+void
+hammer2_chain_dup_inodefixup(hammer2_chain_t *ochain, hammer2_chain_t *nchain)
+{
+       if (ochain->bref.type != HAMMER2_BREF_TYPE_INODE)
+               return;
+       if (ochain->data == NULL)
+               return;
+       KKASSERT(nchain->bref.type == HAMMER2_BREF_TYPE_INODE);
+       KKASSERT(nchain->data == NULL);
+       nchain->data = kmalloc(sizeof(nchain->data->ipdata),
+                              ochain->hmp->minode, M_WAITOK | M_ZERO);
+       nchain->data->ipdata = ochain->data->ipdata;
 }
 
 /*
@@ -2690,11 +2743,17 @@ hammer2_chain_snapshot(hammer2_trans_t *trans, hammer2_inode_t *ip,
  *
  * Must be called with an exclusively locked parent.
  */
+static int hammer2_chain_indkey_freemap(hammer2_chain_t *parent,
+                               hammer2_key_t *keyp, int keybits,
+                               hammer2_blockref_t *base, int count);
+static int hammer2_chain_indkey_normal(hammer2_chain_t *parent,
+                               hammer2_key_t *keyp, int keybits,
+                               hammer2_blockref_t *base, int count);
 static
 hammer2_chain_t *
 hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                              hammer2_key_t create_key, int create_bits,
-                             int *errorp)
+                             int for_type, int *errorp)
 {
        hammer2_mount_t *hmp = trans->hmp;
        hammer2_chain_core_t *above;
@@ -2707,8 +2766,6 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        hammer2_chain_t dummy;
        hammer2_key_t key = create_key;
        int keybits = create_bits;
-       int locount = 0;
-       int hicount = 0;
        int count;
        int nbytes;
        int i;
@@ -2731,13 +2788,15 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                        count = HAMMER2_SET_COUNT;
                        break;
                case HAMMER2_BREF_TYPE_INDIRECT:
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
                case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                        count = parent->bytes / sizeof(hammer2_blockref_t);
                        break;
                case HAMMER2_BREF_TYPE_VOLUME:
                        count = HAMMER2_SET_COUNT;
                        break;
+               case HAMMER2_BREF_TYPE_FREEMAP:
+                       count = HAMMER2_SET_COUNT;
+                       break;
                default:
                        panic("hammer2_chain_create_indirect: "
                              "unrecognized blockref type: %d",
@@ -2752,7 +2811,6 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                        count = HAMMER2_SET_COUNT;
                        break;
                case HAMMER2_BREF_TYPE_INDIRECT:
-               case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
                case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                        base = &parent->data->npdata.blockref[0];
                        count = parent->bytes / sizeof(hammer2_blockref_t);
@@ -2761,6 +2819,10 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                        base = &hmp->voldata.sroot_blockset.blockref[0];
                        count = HAMMER2_SET_COUNT;
                        break;
+               case HAMMER2_BREF_TYPE_FREEMAP:
+                       base = &hmp->voldata.freemap_blockset.blockref[0];
+                       count = HAMMER2_SET_COUNT;
+                       break;
                default:
                        panic("hammer2_chain_create_indirect: "
                              "unrecognized blockref type: %d",
@@ -2771,112 +2833,35 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        }
 
        /*
-        * Scan for an unallocated bref, also skipping any slots occupied
-        * by in-memory chain elements which may not yet have been updated
-        * in the parent's bref array.
-        *
-        * Deleted elements are ignored.
+        * dummy used in later chain allocation (no longer used for lookups).
         */
        bzero(&dummy, sizeof(dummy));
        dummy.delete_tid = HAMMER2_MAX_TID;
 
-       spin_lock(&above->cst.spin);
-       for (i = 0; i < count; ++i) {
-               int nkeybits;
-
-               child = hammer2_chain_find_locked(parent, i);
-               if (child) {
-                       if (child->flags & HAMMER2_CHAIN_DELETED)
-                               continue;
-                       bref = &child->bref;
-               } else if (base && base[i].type) {
-                       bref = &base[i];
-               } else {
-                       continue;
-               }
-
-               /*
-                * Expand our calculated key range (key, keybits) to fit
-                * the scanned key.  nkeybits represents the full range
-                * that we will later cut in half (two halves @ nkeybits - 1).
-                */
-               nkeybits = keybits;
-               if (nkeybits < bref->keybits) {
-                       if (bref->keybits > 64) {
-                               kprintf("bad bref index %d chain %p bref %p\n", i, chain, bref);
-                               Debugger("fubar");
-                       }
-                       nkeybits = bref->keybits;
-               }
-               while (nkeybits < 64 &&
-                      (~(((hammer2_key_t)1 << nkeybits) - 1) &
-                       (key ^ bref->key)) != 0) {
-                       ++nkeybits;
-               }
-
-               /*
-                * If the new key range is larger we have to determine
-                * which side of the new key range the existing keys fall
-                * under by checking the high bit, then collapsing the
-                * locount into the hicount or vise-versa.
-                */
-               if (keybits != nkeybits) {
-                       if (((hammer2_key_t)1 << (nkeybits - 1)) & key) {
-                               hicount += locount;
-                               locount = 0;
-                       } else {
-                               locount += hicount;
-                               hicount = 0;
-                       }
-                       keybits = nkeybits;
-               }
-
-               /*
-                * The newly scanned key will be in the lower half or the
-                * higher half of the (new) key range.
-                */
-               if (((hammer2_key_t)1 << (nkeybits - 1)) & bref->key)
-                       ++hicount;
-               else
-                       ++locount;
-       }
-       spin_unlock(&above->cst.spin);
-       bref = NULL;    /* now invalid (safety) */
-
        /*
-        * Adjust keybits to represent half of the full range calculated
-        * above (radix 63 max)
-        */
-       --keybits;
+        * When creating an indirect block for a freemap node or leaf
+        * the key/keybits must be fitted to static radix levels because
+        * particular radix levels use particular reserved blocks in the
+        * related zone.
+        *
+        * This routine calculates the key/radix of the indirect block
+        * we need to create, and whether it is on the high-side or the
+        * low-side.
+        */
+       if (for_type == HAMMER2_BREF_TYPE_FREEMAP_NODE ||
+           for_type == HAMMER2_BREF_TYPE_FREEMAP_LEAF) {
+               keybits = hammer2_chain_indkey_freemap(parent, &key, keybits,
+                                                      base, count);
+       } else {
+               keybits = hammer2_chain_indkey_normal(parent, &key, keybits,
+                                                     base, count);
+       }
 
        /*
-        * Select whichever half contains the most elements.  Theoretically
-        * we can select either side as long as it contains at least one
-        * element (in order to ensure that a free slot is present to hold
-        * the indirect block).
+        * Normalize the key for the radix being represented, keeping the
+        * high bits and throwing away the low bits.
         */
        key &= ~(((hammer2_key_t)1 << keybits) - 1);
-       if (hammer2_indirect_optimize) {
-               /*
-                * Insert node for least number of keys, this will arrange
-                * the first few blocks of a large file or the first few
-                * inodes in a directory with fewer indirect blocks when
-                * created linearly.
-                */
-               if (hicount < locount && hicount != 0)
-                       key |= (hammer2_key_t)1 << keybits;
-               else
-                       key &= ~(hammer2_key_t)1 << keybits;
-       } else {
-               /*
-                * Insert node for most number of keys, best for heavily
-                * fragmented files.
-                */
-               if (hicount > locount)
-                       key |= (hammer2_key_t)1 << keybits;
-               else
-                       key &= ~(hammer2_key_t)1 << keybits;
-       }
 
        /*
         * How big should our new indirect block be?  It has to be at least
@@ -2892,18 +2877,15 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        /*
         * Ok, create our new indirect block
         */
-       switch(parent->bref.type) {
-       case HAMMER2_BREF_TYPE_FREEMAP_ROOT:
-       case HAMMER2_BREF_TYPE_FREEMAP_NODE:
+       if (for_type == HAMMER2_BREF_TYPE_FREEMAP_NODE ||
+           for_type == HAMMER2_BREF_TYPE_FREEMAP_LEAF) {
                dummy.bref.type = HAMMER2_BREF_TYPE_FREEMAP_NODE;
-               break;
-       default:
+       } else {
                dummy.bref.type = HAMMER2_BREF_TYPE_INDIRECT;
-               break;
        }
        dummy.bref.key = key;
        dummy.bref.keybits = keybits;
-       dummy.bref.data_off = hammer2_allocsize(nbytes);
+       dummy.bref.data_off = hammer2_getradix(nbytes);
        dummy.bref.methods = parent->bref.methods;
 
        ichain = hammer2_chain_alloc(hmp, trans, &dummy.bref);
@@ -2917,6 +2899,9 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
         * We have to mark it modified to allocate its block, but use
         * OPTDATA to allow it to remain in the INITIAL state.  Otherwise
         * it won't be acted upon by the flush code.
+        *
+        * XXX leave the node unmodified, depend on the SUBMODIFIED
+        * flush to assign and modify parent blocks.
         */
        hammer2_chain_modify(trans, &ichain, HAMMER2_MODIFY_OPTDATA);
 
@@ -2924,6 +2909,9 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
         * Iterate the original parent and move the matching brefs into
         * the new indirect block.
         *
+        * At the same time locate an empty slot (or what will become an
+        * empty slot) and assign the new indirect block to that slot.
+        *
         * XXX handle flushes.
         */
        spin_lock(&above->cst.spin);
@@ -2952,9 +2940,8 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                }
 
                /*
-                * Skip keys not in the chosen half (low or high), only bit
-                * (keybits - 1) needs to be compared but for safety we
-                * will compare all msb bits plus that bit again.
+                * Skip keys that are not within the key/radix of the new
+                * indirect block.  They stay in the parent.
                 */
                if ((~(((hammer2_key_t)1 << keybits) - 1) &
                    (key ^ bref->key)) != 0) {
@@ -2988,7 +2975,6 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                chain = hammer2_chain_get(parent, i, HAMMER2_LOOKUP_NODATA);
                hammer2_chain_delete(trans, chain);
                hammer2_chain_duplicate(trans, ichain, i, &chain, NULL);
-
                hammer2_chain_unlock(chain);
                KKASSERT(parent->refs > 0);
                chain = NULL;
@@ -3039,13 +3025,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        /*
         * Figure out what to return.
         */
-       if (create_bits > keybits) {
-               /*
-                * Key being created is way outside the key range,
-                * return the original parent.
-                */
-               hammer2_chain_unlock(ichain);
-       } else if (~(((hammer2_key_t)1 << keybits) - 1) &
+       if (~(((hammer2_key_t)1 << keybits) - 1) &
                   (create_key ^ key)) {
                /*
                 * Key being created is outside the key range,
@@ -3063,6 +3043,219 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        return(parent);
 }
 
+/*
+ * Calculate the keybits and highside/lowside of the freemap node the
+ * caller is creating.
+ *
+ * This routine will specify the next higher-level freemap key/radix
+ * representing the lowest-ordered set.  By doing so, eventually all
+ * low-ordered sets will be moved one level down.
+ *
+ * We have to be careful here because the freemap reserves a limited
+ * number of blocks for a limited number of levels.  So we can't just
+ * push indiscriminately.
+ */
+int
+hammer2_chain_indkey_freemap(hammer2_chain_t *parent, hammer2_key_t *keyp,
+                            int keybits, hammer2_blockref_t *base, int count)
+{
+       hammer2_chain_core_t *above;
+       hammer2_chain_t *child;
+       hammer2_blockref_t *bref;
+       hammer2_key_t key;
+       int locount;
+       int hicount;
+       int i;
+
+       key = *keyp;
+       above = parent->core;
+       locount = 0;
+       hicount = 0;
+       keybits = 64;
+
+       /*
+        * Calculate the range of keys in the array being careful to skip
+        * slots which are overridden with a deletion.
+        */
+       spin_lock(&above->cst.spin);
+       for (i = 0; i < count; ++i) {
+               child = hammer2_chain_find_locked(parent, i);
+               if (child) {
+                       if (child->flags & HAMMER2_CHAIN_DELETED)
+                               continue;
+                       bref = &child->bref;
+               } else if (base && base[i].type) {
+                       bref = &base[i];
+               } else {
+                       continue;
+               }
+
+               if (keybits > bref->keybits) {
+                       key = bref->key;
+                       keybits = bref->keybits;
+               } else if (keybits == bref->keybits && bref->key < key) {
+                       key = bref->key;
+               }
+       }
+       spin_unlock(&above->cst.spin);
+
+       /*
+        * Return the keybits for a higher-level FREEMAP_NODE covering
+        * this node.
+        */
+       switch(keybits) {
+       case HAMMER2_FREEMAP_LEVEL0_RADIX:
+               keybits = HAMMER2_FREEMAP_LEVEL1_RADIX;
+               break;
+       case HAMMER2_FREEMAP_LEVEL1_RADIX:
+               keybits = HAMMER2_FREEMAP_LEVEL2_RADIX;
+               break;
+       case HAMMER2_FREEMAP_LEVEL2_RADIX:
+               keybits = HAMMER2_FREEMAP_LEVEL3_RADIX;
+               break;
+       case HAMMER2_FREEMAP_LEVEL3_RADIX:
+               keybits = HAMMER2_FREEMAP_LEVEL4_RADIX;
+               break;
+       case HAMMER2_FREEMAP_LEVEL4_RADIX:
+               panic("hammer2_chain_indkey_freemap: level too high");
+               break;
+       default:
+               panic("hammer2_chain_indkey_freemap: bad radix");
+               break;
+       }
+       *keyp = key;
+
+       return (keybits);
+}
+
+/*
+ * Calculate the keybits and highside/lowside of the indirect block the
+ * caller is creating.
+ */
+static int
+hammer2_chain_indkey_normal(hammer2_chain_t *parent, hammer2_key_t *keyp,
+                           int keybits, hammer2_blockref_t *base, int count)
+{
+       hammer2_chain_core_t *above;
+       hammer2_chain_t *child;
+       hammer2_blockref_t *bref;
+       hammer2_key_t key;
+       int nkeybits;
+       int locount;
+       int hicount;
+       int i;
+
+       key = *keyp;
+       above = parent->core;
+       locount = 0;
+       hicount = 0;
+
+       /*
+        * Calculate the range of keys in the array being careful to skip
+        * slots which are overridden with a deletion.  Once the scan
+        * completes we will cut the key range in half and shift half the
+        * range into the new indirect block.
+        */
+       spin_lock(&above->cst.spin);
+       for (i = 0; i < count; ++i) {
+               child = hammer2_chain_find_locked(parent, i);
+               if (child) {
+                       if (child->flags & HAMMER2_CHAIN_DELETED)
+                               continue;
+                       bref = &child->bref;
+               } else if (base && base[i].type) {
+                       bref = &base[i];
+               } else {
+                       continue;
+               }
+
+               /*
+                * Expand our calculated key range (key, keybits) to fit
+                * the scanned key.  nkeybits represents the full range
+                * that we will later cut in half (two halves @ nkeybits - 1).
+                */
+               nkeybits = keybits;
+               if (nkeybits < bref->keybits) {
+                       if (bref->keybits > 64) {
+                               kprintf("bad bref index %d chain %p bref %p\n",
+                                       i, child, bref);
+                               Debugger("fubar");
+                       }
+                       nkeybits = bref->keybits;
+               }
+               while (nkeybits < 64 &&
+                      (~(((hammer2_key_t)1 << nkeybits) - 1) &
+                       (key ^ bref->key)) != 0) {
+                       ++nkeybits;
+               }
+
+               /*
+                * If the new key range is larger we have to determine
+                * which side of the new key range the existing keys fall
+                * under by checking the high bit, then collapsing the
+                * locount into the hicount or vise-versa.
+                */
+               if (keybits != nkeybits) {
+                       if (((hammer2_key_t)1 << (nkeybits - 1)) & key) {
+                               hicount += locount;
+                               locount = 0;
+                       } else {
+                               locount += hicount;
+                               hicount = 0;
+                       }
+                       keybits = nkeybits;
+               }
+
+               /*
+                * The newly scanned key will be in the lower half or the
+                * higher half of the (new) key range.
+                */
+               if (((hammer2_key_t)1 << (nkeybits - 1)) & bref->key)
+                       ++hicount;
+               else
+                       ++locount;
+       }
+       spin_unlock(&above->cst.spin);
+       bref = NULL;    /* now invalid (safety) */
+
+       /*
+        * Adjust keybits to represent half of the full range calculated
+        * above (radix 63 max)
+        */
+       --keybits;
+
+       /*
+        * Select whichever half contains the most elements.  Theoretically
+        * we can select either side as long as it contains at least one
+        * element (in order to ensure that a free slot is present to hold
+        * the indirect block).
+        */
+       if (hammer2_indirect_optimize) {
+               /*
+                * Insert node for least number of keys, this will arrange
+                * the first few blocks of a large file or the first few
+                * inodes in a directory with fewer indirect blocks when
+                * created linearly.
+                */
+               if (hicount < locount && hicount != 0)
+                       key |= (hammer2_key_t)1 << keybits;
+               else
+                       key &= ~(hammer2_key_t)1 << keybits;
+       } else {
+               /*
+                * Insert node for most number of keys, best for heavily
+                * fragmented files.
+                */
+               if (hicount > locount)
+                       key |= (hammer2_key_t)1 << keybits;
+               else
+                       key &= ~(hammer2_key_t)1 << keybits;
+       }
+       *keyp = key;
+
+       return (keybits);
+}
+
 /*
  * Sets CHAIN_DELETED and CHAIN_MOVED in the chain being deleted and
  * set chain->delete_tid.
index c79f81d..ea127fa 100644 (file)
@@ -82,6 +82,7 @@
  */
 #define HAMMER2_MIN_ALLOC      1024    /* minimum allocation size */
 #define HAMMER2_MIN_RADIX      10      /* minimum allocation size 2^N */
+#define HAMMER2_MAX_ALLOC      65536   /* maximum allocation size */
 #define HAMMER2_MAX_RADIX      16      /* maximum allocation size 2^N */
 #define HAMMER2_KEY_RADIX      64      /* number of bits in key */
 
 #define HAMMER2_LBUFRADIX      14      /* logical buf (1<<14) bytes */
 #define HAMMER2_LBUFSIZE       16384
 
+/*
+ * XXX FIXME multiple locked inodes deadlock on the buffer cache
+ */
 #if 0
-#define HAMMER2_MINIORADIX     16      /* minimum phsical IO size */
-#define HAMMER2_MINIOSIZE      65536
+#define HAMMER2_MINIORADIX     HAMMER2_LBUFRADIX
+#define HAMMER2_MINIOSIZE      HAMMER2_LBUFSIZE
+#else
+#define HAMMER2_MINIORADIX     10
+#define HAMMER2_MINIOSIZE      1024
 #endif
-#define HAMMER2_MINIORADIX     HAMMER2_MINALLOCRADIX
-#define HAMMER2_MINIOSIZE      HAMMER2_MINALLOCSIZE
 
-#define HAMMER2_MINALLOCRADIX  10      /* minimum block allocation size */
-#define HAMMER2_MINALLOCSIZE   1024
 #define HAMMER2_IND_BYTES_MIN  4096    /* first indirect layer only */
 #define HAMMER2_IND_BYTES_MAX  HAMMER2_PBUFSIZE
 #define HAMMER2_IND_COUNT_MIN  (HAMMER2_IND_BYTES_MIN / \
  * Indirect blocks are typically either 4KB (64 blockrefs / ~4MB represented),
  * or 64KB (1024 blockrefs / ~64MB represented).
  */
-#define HAMMER2_SET_COUNT      8       /* direct entries */
-#define HAMMER2_SET_RADIX      3
-#define HAMMER2_EMBEDDED_BYTES 512
-#define HAMMER2_EMBEDDED_RADIX 9
+#define HAMMER2_SET_COUNT              8       /* direct entries */
+#define HAMMER2_SET_RADIX              3
+#define HAMMER2_EMBEDDED_BYTES         512     /* inode blockset/dd size */
+#define HAMMER2_EMBEDDED_RADIX         9
 
 #define HAMMER2_PBUFMASK       (HAMMER2_PBUFSIZE - 1)
 #define HAMMER2_LBUFMASK       (HAMMER2_LBUFSIZE - 1)
  *     +-----------------------+
  *      |      Volume Hdr      | block 0       volume header & alternates
  *     +-----------------------+               (first four zones only)
- *      | (A) FreeBlk layer0    | block 1      free block table
- *      | (A) FreeBlk layer1    |
- *      | (A) FreeBlk layer2    |
- *      | (A) FreeBlk layer3    |
- *      | (A) FreeBlk layer4[8] | (note: 8x64K -> 128x4K)
+ *     |   FreeBlk Section A   | block 1-8
  *     +-----------------------+
- *      | (B) FreeBlk layer0    | block 13     free block table
- *      | (B) FreeBlk layer1    |
- *      | (B) FreeBlk layer2    |
- *      | (B) FreeBlk layer3    |
- *      | (B) FreeBlk layer4[8] |
+ *     |   FreeBlk Section B   | block 9-16
  *     +-----------------------+
- *      | (C) FreeBlk layer0    | block 25     free block table
- *      | (C) FreeBlk layer1    |
- *      | (C) FreeBlk layer2    |
- *      | (C) FreeBlk layer3    |
- *      | (C) FreeBlk layer4[8] |
+ *     |   FreeBlk Section C   | block 17-24
  *     +-----------------------+
- *      | (D) FreeBlk layer0    | block 37     free block table
- *      | (D) FreeBlk layer1    |
- *      | (D) FreeBlk layer2    |
- *      | (D) FreeBlk layer3    |
- *      | (D) FreeBlk layer4[8] |
+ *     |   FreeBlk Section D   | block 25-32
  *     +-----------------------+
- *      |                      | block 49...63
+ *      |                      | block 33...63
  *      |      reserved        |
  *      |                      |
  *     +-----------------------+
  *
  * The first few 2GB zones contain volume headers and volume header backups.
- * After that the volume header block# is reserved.  The first 2GB zone
- * contains all four FreeBlk layers, for example, but the layer1 FreeBlk
- * is only needed once every 1TB.  The free block topology rotates between
- * several groups {A,B,C,D} in order to ensure that the free block table
- * is clean upon reboot after a crash or disk failure.
+ * After that the volume header block# is reserved.
  *
+ * The freemap utilizes blocks #1-32 for now, see the FREEMAP document.
  * The Free block table has a resolution of 1KB
+ *
+ * WARNING!  ZONE_SEG and VOLUME_ALIGN must be a multiple of 1<<LEVEL0_RADIX
+ *          (i.e. a multiple of 2MB).  VOLUME_ALIGN must be >= ZONE_SEG.
  */
 #define HAMMER2_VOLUME_ALIGN           (8 * 1024 * 1024)
 #define HAMMER2_VOLUME_ALIGN64         ((hammer2_off_t)HAMMER2_VOLUME_ALIGN)
  * tree, and other information in the future.
  *
  * All specified blocks are not necessarily used in all 2GB zones.  However,
- * dead areas are reserved and MUST NOT BE USED for other purposes.
+ * dead areas are reserved for future use and MUST NOT BE USED for other
+ * purposes.
  *
  * The freemap is arranged into four groups.  Modifications rotate through
  * the groups on a block by block basis (so all the blocks are not necessarily
- * synchronized to the same group).  Only three groups are actually necessary
- * (stable, flushing, modifying).
+ * synchronized to the same group).  Because the freemap is flushed
+ * independent of the main filesystem, the freemap only really needs two
+ * groups to operate efficiently.
+ *
+ *
  *
- * 64KB freemap indirect blocks are represented by layers 0, 1, 2, and 3.
- * 4KB freemap leaf blocks each represent 16MB of storage so 128 x 4KB are
- * needed per zone, which equates to 8 x 64KB layer4 blocks per zone.
  */
 #define HAMMER2_ZONE_VOLHDR            0       /* volume header or backup */
 #define HAMMER2_ZONE_FREEMAP_A         1       /* freemap layer group A */
-#define HAMMER2_ZONE_FREEMAP_B         13      /* freemap layer group B */
-#define HAMMER2_ZONE_FREEMAP_C         25      /* freemap layer group C */
-#define HAMMER2_ZONE_FREEMAP_D         37      /* freemap layer group D */
-
-#define HAMMER2_ZONEFM_LAYER0          0       /* relative to FREEMAP_x */
-#define HAMMER2_ZONEFM_LAYER1          1
-#define HAMMER2_ZONEFM_LAYER2          2
-#define HAMMER2_ZONEFM_LAYER3          3
-#define HAMMER2_ZONEFM_LAYER4          4       /* 4-11 (8 64KB blocks) */
+#define HAMMER2_ZONE_FREEMAP_B         9       /* freemap layer group B */
+#define HAMMER2_ZONE_FREEMAP_C         17      /* freemap layer group C */
+#define HAMMER2_ZONE_FREEMAP_D         25      /* freemap layer group D */
+
+                                               /* relative to FREEMAP_x */
+#define HAMMER2_ZONEFM_LEVEL0          0       /* 256KB bitmap (4 blks) */
+#define HAMMER2_ZONEFM_LEVEL1          4       /* 2GB indmap */
+#define HAMMER2_ZONEFM_LEVEL2          5       /* 2TB indmap */
+#define HAMMER2_ZONEFM_LEVEL3          6       /* 2PB indmap */
+#define HAMMER2_ZONEFM_LEVEL4          7       /* 2EB indmap */
+/* LEVEL5 is a set of 8 blockrefs in the volume header 16EB */
 
 #define HAMMER2_ZONE_BLOCK49           49      /* future */
 #define HAMMER2_ZONE_BLOCK50           50      /* future */
 #define HAMMER2_ZONE_BLOCK62           62      /* future */
 #define HAMMER2_ZONE_BLOCK63           63      /* future */
 
+/*
+ * Freemap radii.  Please note that LEVEL 1 blockref array entries
+ * point to 256-byte sections of the bitmap representing 2MB of storage.
+ * Even though the chain structures represent only 256 bytes, they are
+ * mapped using larger 16K or 64K buffer cache buffers.
+ */
+#define HAMMER2_FREEMAP_LEVEL5_RADIX   64      /* 16EB */
+#define HAMMER2_FREEMAP_LEVEL4_RADIX   61      /* 2EB */
+#define HAMMER2_FREEMAP_LEVEL3_RADIX   51      /* 2PB */
+#define HAMMER2_FREEMAP_LEVEL2_RADIX   41      /* 2TB */
+#define HAMMER2_FREEMAP_LEVEL1_RADIX   31      /* 2GB (256KB of bitmap) */
+#define HAMMER2_FREEMAP_LEVEL0_RADIX   21      /* 2MB (256 bytes of bitmap) */
+
+#define HAMMER2_FREEMAP_LEVELN_PSIZE   65536   /* physical bytes */
+#define HAMMER2_FREEMAP_LEVEL0_PSIZE   256     /* physical bytes */
+
+
 /*
  * Two linear areas can be reserved after the initial 2MB segment in the base
  * zone (the one starting at offset 0).  These areas are NOT managed by the
@@ -399,8 +406,7 @@ struct hammer2_blockref {           /* MUST BE EXACTLY 64 BYTES */
                 *
                 * biggest - largest possible allocation 2^N within sub-tree.
                 *           typically initialized to 64 in freemap_blockref
-                *           and to 54 in each blockref[] entry in the
-                *           FREEMAP_ROOT indirect block.
+                *           and reduced as-needed when a request fails.
                 *
                 *           An allocation > 2^N is guaranteed to fail.  An
                 *           allocation <= 2^N MAY fail, and if it does the
@@ -435,9 +441,10 @@ typedef struct hammer2_blockref hammer2_blockref_t;
 #define HAMMER2_BREF_TYPE_INODE                1
 #define HAMMER2_BREF_TYPE_INDIRECT     2
 #define HAMMER2_BREF_TYPE_DATA         3
-#define HAMMER2_BREF_TYPE_FREEMAP_ROOT 4
+#define HAMMER2_BREF_TYPE_UNUSED04     4
 #define HAMMER2_BREF_TYPE_FREEMAP_NODE 5
 #define HAMMER2_BREF_TYPE_FREEMAP_LEAF 6
+#define HAMMER2_BREF_TYPE_FREEMAP      254     /* pseudo-type */
 #define HAMMER2_BREF_TYPE_VOLUME       255     /* pseudo-type */
 
 #define HAMMER2_ENC_CHECK(n)           ((n) << 4)
@@ -683,89 +690,8 @@ typedef struct hammer2_inode_data hammer2_inode_data_t;
 /*
  *                             Allocation Table
  *
- * In HAMMER2 the allocation table hangs off of the volume header and
- * utilizes somewhat customized hammer2_blockref based indirect blocks
- * until hitting the leaf bitmap.  BREF_TYPE_FREEMAP_ROOT and
- * BREF_TYPE_FREEMAP_NODE represent the indirect blocks but are formatted
- * the same as BREF_TYPE_INDIRECT except for the (biggest) and (avail)
- * fields which use some of the check union space.  Thus a special CHECK
- * id (CHECK_FREEMAP instead of CHECK_ISCSI32) is also specified for these
- * babies.
- *
- * newfs_hammer2 builds the FREEMAP_ROOT block and assigns a radix of
- * 34, 44, 54, or 64 depending on whether the freemap is to be fitted
- * to the storage or is to maximized for (possibly) sparse storage.
- * Other keybits specifications for FREEMAP_ROOT are illegal.  Even fitted
- * storage is required to specify at least a keybits value of 34.
- *
- *     Total possible representation is 2^64 (16 Exabytes).
- *     10: 1024 entries / 64KB                 16EB (16PB per entry) layer0
- *     10: 1024 entries / 64KB                 16PB (16TB per entry) layer1
- *     10: 1024 entries / 64KB                 16TB (16GB per entry) layer2
- *     10: 1024 entries / 64KB                 16GB (16MB per entry) layer3
- *     24: 16384 x 1KB allocgran / 4KB         16MB                  layer4
- *
- * To make the radix come out to exactly 64 the leaf bitmaps are arranged
- * into 4KB buffers, with each buffer representing a freemap for 16MB worth
- * of storage using a 1KB allocation granularity.  The leaf bitmaps are
- * structures and not just a plain bitmap, hence the extra space needed to
- * represent 16384 x 1KB blocks.
- *
- * The reserved area at the beginning of each 2GB zone is marked as being
- * allocated on-the-fly and does not have to be pre-set in the freemap,
- * which is just as well as that would require newfs_hammer2 to do a lot
- * of writing otherwise.
- *
- * Indirect blocks are usually created with a semi-dynamic radix but in the
- * case of freemap-related indirect blocks, the blocks use a static radix
- * tree with associations to specific reserved blocks.
- */
-
-/*
- * 4KB -> hammer2_freemap_elm[256]
- *
- *     bitmap     - 64 bits x 1KB representing 64KB.  A '1' bit represents
- *                  an allocated block.
- *
- *     generation - Incremented upon any allocation.  Can't increment more
- *                  than +64 per background freeing pass due to there being
- *                  only 64 bits.
- *
- *     biggest0   - biggest hint (radix) for freemap_elm.  Represents up to
- *                  64KB (radix 16).
- *
- *     biggest1   - biggest hint (radix) for aligned groups of 16 elements,
- *                  stored in elm[0], elm[16], etc.  Represents up to 1MB.
- *                  (radix 20)
- *
- *     biggest2   - biggest hint (radix) for aligned groups of 256 elements
- *                  (i.e. the whole array, only used by elm[0]).
- *                  Represents up to 16MB (radix 24).
- *
- * The hinting is used as part of the allocation mechanism to reduce scan
- * time, which is particularly important as a filesystem approaches full.
- * Fill ratios are handled at the indirect block level (in the blockrefs) and
- * not here.
  */
-struct hammer2_freemap_elm {
-       uint64_t        bitmap;
-       uint8_t         generation;
-       uint8_t         biggest0;
-       uint8_t         biggest1;
-       uint8_t         biggest2;
-       uint32_t        reserved0C;
-};
 
-typedef struct hammer2_freemap_elm hammer2_freemap_elm_t;
-
-#define HAMMER2_FREEMAP_LEAF_BYTES     4096
-#define HAMMER2_FREEMAP_LEAF_ENTRIES   (HAMMER2_FREEMAP_LEAF_BYTES /   \
-                                        sizeof(hammer2_freemap_elm_t))
-#define HAMMER2_FREEMAP_LEAF_RADIX     24
-#define HAMMER2_FREEMAP_NODE_RADIX     10
-#define HAMMER2_FREEMAP_ELM_RADIX       5      /* 2^5 == 32 bits */
-
-#define HAMMER2_BIGF_KILLED            0x80
 
 /*
  * Flags (8 bits) - blockref, for freemap only
@@ -874,7 +800,7 @@ struct hammer2_volume_data {
        hammer2_off_t   allocator_beg;          /* 0070 Initial allocations */
        hammer2_tid_t   mirror_tid;             /* 0078 best committed tid */
        hammer2_tid_t   alloc_tid;              /* 0080 Alloctable modify tid */
-       hammer2_blockref_t freemap_blockref;    /* 0088-00C7 */
+       hammer2_blockref_t reserved0088;        /* 0088-00C7 */
 
        /*
         * Copyids are allocated dynamically from the copyexists bitmap.
@@ -890,10 +816,14 @@ struct hammer2_volume_data {
         * 32 bit CRC array at the end of the first 512 byte sector.
         *
         * icrc_sects[7] - First 512-4 bytes of volume header (including all
-        *                 the other icrc's except the last one).
+        *                 the other icrc's except this one).
         *
-        * icrc_sects[6] - Second 512-4 bytes of volume header, which is
+        * icrc_sects[6] - Sector 1 (512 bytes) of volume header, which is
         *                 the blockset for the root.
+        *
+        * icrc_sects[5] - Sector 2
+        * icrc_sects[4] - Sector 3
+        * icrc_sects[3] - Sector 4 (the freemap blockset)
         */
        hammer2_crc32_t icrc_sects[8];          /* 01E0-01FF */
 
@@ -909,7 +839,7 @@ struct hammer2_volume_data {
         */
        char    sector2[512];                   /* 0400-05FF reserved */
        char    sector3[512];                   /* 0600-07FF reserved */
-       char    sector4[512];                   /* 0800-09FF reserved */
+       hammer2_blockset_t freemap_blockset;    /* 0800-09FF freemap  */
        char    sector5[512];                   /* 0A00-0BFF reserved */
        char    sector6[512];                   /* 0C00-0DFF reserved */
        char    sector7[512];                   /* 0E00-0FFF reserved */
@@ -979,6 +909,8 @@ union hammer2_media_data {
         hammer2_inode_data_t    ipdata;
        hammer2_indblock_data_t npdata;
        char                    buf[HAMMER2_PBUFSIZE];
+       uint64_t                bitmap[HAMMER2_FREEMAP_LEVEL0_PSIZE /
+                                      sizeof(uint64_t)];
 };
 
 typedef union hammer2_media_data hammer2_media_data_t;
index 4741596..0cc97af 100644 (file)
@@ -91,7 +91,8 @@ static int hammer2_chain_flush_scan2(hammer2_chain_t *child, void *data);
  *         collisions (which key off of delete_tid).
  */
 void
-hammer2_trans_init(hammer2_mount_t *hmp, hammer2_trans_t *trans, int flags)
+hammer2_trans_init(hammer2_trans_t *trans, hammer2_mount_t *hmp,
+                  hammer2_inode_t *ip, int flags)
 {
        hammer2_trans_t *scan;
 
@@ -102,6 +103,8 @@ hammer2_trans_init(hammer2_mount_t *hmp, hammer2_trans_t *trans, int flags)
        trans->sync_tid = hmp->voldata.alloc_tid++;
        trans->flags = flags;
        trans->td = curthread;
+       trans->tmp_ip = ip;
+       trans->tmp_bpref = 0;
        TAILQ_INSERT_TAIL(&hmp->transq, trans, entry);
 
        if (flags & HAMMER2_TRANS_ISFLUSH) {
@@ -113,7 +116,7 @@ hammer2_trans_init(hammer2_mount_t *hmp, hammer2_trans_t *trans, int flags)
                ++hmp->flushcnt;
                if (hmp->curflush == NULL) {
                        hmp->curflush = trans;
-                       hmp->flush_tid = trans->sync_tid;
+                       hmp->topo_flush_tid = trans->sync_tid;
                }
                while (TAILQ_FIRST(&hmp->transq) != trans) {
                        lksleep(&trans->sync_tid, &hmp->voldatalk,
@@ -182,7 +185,7 @@ hammer2_trans_done(hammer2_trans_t *trans)
                        }
                        KKASSERT(scan);
                        hmp->curflush = scan;
-                       hmp->flush_tid = scan->sync_tid;
+                       hmp->topo_flush_tid = scan->sync_tid;
                } else {
                        /*
                         * Theoretically we don't have to clear flush_tid
@@ -191,7 +194,7 @@ hammer2_trans_done(hammer2_trans_t *trans)
                         * now zero-it.
                         */
                        hmp->curflush = NULL;
-                       hmp->flush_tid = 0;
+                       hmp->topo_flush_tid = 0;
                }
        } else {
                /*
@@ -509,8 +512,10 @@ hammer2_chain_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain)
         */
        if (chain->delete_tid <= info->sync_tid) {
                if (chain->flags & HAMMER2_CHAIN_MODIFIED) {
-                       if (chain->bp)
-                               chain->bp->b_flags |= B_INVAL|B_RELBUF;
+                       if (chain->bp) {
+                               if (chain->bytes == chain->bp->b_bufsize)
+                                       chain->bp->b_flags |= B_INVAL|B_RELBUF;
+                       }
                        atomic_clear_int(&chain->flags, HAMMER2_CHAIN_MODIFIED);
                        hammer2_chain_drop(chain);
                }
@@ -524,8 +529,10 @@ hammer2_chain_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain)
                 * Throw-away the MODIFIED flag
                 */
                if (chain->flags & HAMMER2_CHAIN_MODIFIED) {
-                       if (chain->bp)
-                               chain->bp->b_flags |= B_INVAL|B_RELBUF;
+                       if (chain->bp) {
+                               if (chain->bytes == chain->bp->b_bufsize)
+                                       chain->bp->b_flags |= B_INVAL|B_RELBUF;
+                       }
                        atomic_clear_int(&chain->flags, HAMMER2_CHAIN_MODIFIED);
                        hammer2_chain_drop(chain);
                }
@@ -571,9 +578,12 @@ hammer2_chain_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain)
        atomic_clear_int(&chain->flags, HAMMER2_CHAIN_MODIFIED);
        if (chain == &hmp->vchain)
                kprintf("(FLUSHED VOLUME HEADER)\n");
+       if (chain == &hmp->fchain)
+               kprintf("(FLUSHED FREEMAP HEADER)\n");
 
        if ((chain->flags & HAMMER2_CHAIN_MOVED) ||
-           chain == &hmp->vchain) {
+           chain == &hmp->vchain ||
+           chain == &hmp->fchain) {
                /*
                 * Drop the ref from the MODIFIED bit we cleared.
                 */
@@ -598,7 +608,22 @@ hammer2_chain_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain)
         * processing.
         */
        switch(chain->bref.type) {
+       case HAMMER2_BREF_TYPE_FREEMAP:
+               hammer2_modify_volume(hmp);
+               break;
        case HAMMER2_BREF_TYPE_VOLUME:
+               /*
+                * We should flush the free block table before we calculate
+                * CRCs and copy voldata -> volsync.
+                */
+               hammer2_chain_lock(&hmp->fchain, HAMMER2_RESOLVE_ALWAYS);
+               if (hmp->fchain.flags & (HAMMER2_CHAIN_MODIFIED |
+                                        HAMMER2_CHAIN_SUBMODIFIED)) {
+                       /* this will modify vchain as a side effect */
+                       hammer2_chain_flush(info->trans, &hmp->fchain);
+               }
+               hammer2_chain_unlock(&hmp->fchain);
+
                /*
                 * The volume header is flushed manually by the syncer, not
                 * here.  All we do is adjust the crc's.
@@ -674,9 +699,17 @@ hammer2_chain_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain)
                chain->data = NULL;
                hammer2_chain_unlock(chain);
                break;
+       case HAMMER2_BREF_TYPE_FREEMAP_NODE:
+       case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
+               /*
+                * Device-backed.  Buffer will be flushed by the sync
+                * code XXX.
+                */
+               break;
        default:
                /*
                 * Embedded elements have to be flushed out.
+                * (Basically just BREF_TYPE_INODE).
                 */
                KKASSERT(chain->data != NULL);
                KKASSERT(chain->bp == NULL);
@@ -995,6 +1028,7 @@ hammer2_chain_flush_scan2(hammer2_chain_t *child, void *data)
                }
                break;
        case HAMMER2_BREF_TYPE_INDIRECT:
+       case HAMMER2_BREF_TYPE_FREEMAP_NODE:
                if (parent->data) {
                        base = &parent->data->npdata.blockref[0];
                } else {
@@ -1007,6 +1041,10 @@ hammer2_chain_flush_scan2(hammer2_chain_t *child, void *data)
                base = &hmp->voldata.sroot_blockset.blockref[0];
                count = HAMMER2_SET_COUNT;
                break;
+       case HAMMER2_BREF_TYPE_FREEMAP:
+               base = &parent->data->npdata.blockref[0];
+               count = HAMMER2_SET_COUNT;
+               break;
        default:
                base = NULL;
                count = 0;
@@ -1045,7 +1083,8 @@ hammer2_chain_flush_scan2(hammer2_chain_t *child, void *data)
        if (info->mirror_tid < child->bref.mirror_tid) {
                info->mirror_tid = child->bref.mirror_tid;
        }
-       if (parent->bref.type == HAMMER2_BREF_TYPE_VOLUME &&
+       if ((parent->bref.type == HAMMER2_BREF_TYPE_VOLUME ||
+            parent->bref.type == HAMMER2_BREF_TYPE_FREEMAP) &&
            hmp->voldata.mirror_tid < child->bref.mirror_tid) {
                hmp->voldata.mirror_tid = child->bref.mirror_tid;
        }
index 1cb22df..f790c8a 100644 (file)
 
 #include "hammer2.h"
 
+#define USE_SIMPLE_ALLOC       0
+
+#if !USE_SIMPLE_ALLOC
+
+static int hammer2_freemap_try_alloc(hammer2_trans_t *trans,
+                       hammer2_chain_t **parentp, hammer2_blockref_t *bref,
+                       int radix, hammer2_off_t bpref, hammer2_off_t *bnext);
+static int hammer2_freemap_iterate(hammer2_trans_t *trans,
+                       hammer2_chain_t **parentp, hammer2_chain_t **chainp,
+                       hammer2_off_t bpref, hammer2_off_t *bnextp);
+
+#endif
+
+/*
+ * Calculate the device offset for the specified FREEMAP_NODE or FREEMAP_LEAF
+ * bref.  Return a combined media offset and physical size radix.  Freemap
+ * chains use fixed storage offsets in the 4MB reserved area at the
+ * beginning of each 2GB zone
+ *
+ * Rotate between four possibilities.  Theoretically this means we have three
+ * good freemaps in case of a crash which we can use as a base for the fixup
+ * scan at mount-time.
+ */
+#define H2FMBASE(key, radix)   ((key) & ~(((hammer2_off_t)1 << (radix)) - 1))
+#define H2FMSHIFT(radix)       ((hammer2_off_t)1 << (radix))
+
+static
+int
+hammer2_freemap_reserve(hammer2_mount_t *hmp, hammer2_blockref_t *bref,
+                       int radix)
+{
+       hammer2_off_t off;
+       size_t bytes;
+
+       /*
+        * Physical allocation size -> radix.  Typically either 256 for
+        * a level 0 freemap leaf or 65536 for a level N freemap node.
+        *
+        * NOTE: A 256 byte bitmap represents 256 x 8 x 1024 = 2MB of storage.
+        *       Do not use hammer2_allocsize() here as it has a min cap.
+        */
+       bytes = 1 << radix;
+
+       /*
+        * Adjust by HAMMER2_ZONE_FREEMAP_{A,B,C,D} using the existing
+        * offset as a basis.
+        */
+       if ((bref->data_off & ~HAMMER2_OFF_MASK_RADIX) == 0) {
+               off = HAMMER2_ZONE_FREEMAP_A;
+       } else {
+               off = HAMMER2_ZONE_FREEMAP_A;
+#if 0
+               off = bref->data_off & ~HAMMER2_OFF_MASK_RADIX &
+                     (((hammer2_off_t)1 << HAMMER2_FREEMAP_LEVEL1_RADIX) - 1);
+               off = off / HAMMER2_PBUFSIZE;
+               KKASSERT(off >= HAMMER2_ZONE_FREEMAP_A);
+               KKASSERT(off < HAMMER2_ZONE_FREEMAP_D + 8);
+
+               if (off >= HAMMER2_ZONE_FREEMAP_D)
+                       off = HAMMER2_ZONE_FREEMAP_A;
+               else if (off >= HAMMER2_ZONE_FREEMAP_C)
+                       off = HAMMER2_ZONE_FREEMAP_D;
+               else if (off >= HAMMER2_ZONE_FREEMAP_B)
+                       off = HAMMER2_ZONE_FREEMAP_C;
+               else
+                       off = HAMMER2_ZONE_FREEMAP_B;
+#endif
+       }
+       off = off * HAMMER2_PBUFSIZE;
+
+       /*
+        * Calculate the block offset of the reserved block.  This will
+        * point into the 4MB reserved area at the base of the appropriate
+        * 2GB zone, once added to the FREEMAP_x selection above.
+        */
+       switch(bref->keybits) {
+       /* case HAMMER2_FREEMAP_LEVEL5_RADIX: not applicable */
+       case HAMMER2_FREEMAP_LEVEL4_RADIX:      /* 2EB */
+               KKASSERT(bref->type == HAMMER2_BREF_TYPE_FREEMAP_NODE);
+               KKASSERT(bytes == HAMMER2_FREEMAP_LEVELN_PSIZE);
+               off += H2FMBASE(bref->key, HAMMER2_FREEMAP_LEVEL4_RADIX) +
+                      HAMMER2_ZONEFM_LEVEL4 * HAMMER2_PBUFSIZE;
+               break;
+       case HAMMER2_FREEMAP_LEVEL3_RADIX:      /* 2PB */
+               KKASSERT(bref->type == HAMMER2_BREF_TYPE_FREEMAP_NODE);
+               KKASSERT(bytes == HAMMER2_FREEMAP_LEVELN_PSIZE);
+               off += H2FMBASE(bref->key, HAMMER2_FREEMAP_LEVEL3_RADIX) +
+                      HAMMER2_ZONEFM_LEVEL3 * HAMMER2_PBUFSIZE;
+               break;
+       case HAMMER2_FREEMAP_LEVEL2_RADIX:      /* 2TB */
+               KKASSERT(bref->type == HAMMER2_BREF_TYPE_FREEMAP_NODE);
+               KKASSERT(bytes == HAMMER2_FREEMAP_LEVELN_PSIZE);
+               off += H2FMBASE(bref->key, HAMMER2_FREEMAP_LEVEL2_RADIX) +
+                      HAMMER2_ZONEFM_LEVEL2 * HAMMER2_PBUFSIZE;
+               break;
+       case HAMMER2_FREEMAP_LEVEL1_RADIX:      /* 2GB */
+               KKASSERT(bref->type == HAMMER2_BREF_TYPE_FREEMAP_NODE);
+               KKASSERT(bytes == HAMMER2_FREEMAP_LEVELN_PSIZE);
+               off += H2FMBASE(bref->key, HAMMER2_FREEMAP_LEVEL1_RADIX) +
+                      HAMMER2_ZONEFM_LEVEL1 * HAMMER2_PBUFSIZE;
+               break;
+       case HAMMER2_FREEMAP_LEVEL0_RADIX:      /* 2MB (256 byte bitmap) */
+               /*
+                * Terminal bitmap, start with 2GB base, then offset by
+                * 256 bytes of device offset per 2MB of logical space
+                * (8 bits per byte, 1024 byte allocation chunking).
+                */
+               KKASSERT(bref->type == HAMMER2_BREF_TYPE_FREEMAP_LEAF);
+               KKASSERT(bytes == HAMMER2_FREEMAP_LEVEL0_PSIZE);
+               off += H2FMBASE(bref->key, HAMMER2_FREEMAP_LEVEL1_RADIX) +
+                      HAMMER2_ZONEFM_LEVEL0 * HAMMER2_PBUFSIZE;
+
+               off += ((bref->key >> HAMMER2_FREEMAP_LEVEL0_RADIX) &
+                       ((1 << (HAMMER2_FREEMAP_LEVEL1_RADIX -
+                              HAMMER2_FREEMAP_LEVEL0_RADIX)) - 1)) *
+                       HAMMER2_FREEMAP_LEVEL0_PSIZE;
+               break;
+       default:
+               panic("freemap: bad radix(2) %p %d\n", bref, bref->keybits);
+               /* NOT REACHED */
+               break;
+       }
+       bref->data_off = off | radix;
+       return (0);
+}
+
 /*
  * Allocate media space, returning a combined data offset and radix.
+ * THIS ROUTINE IS USE FOR DEBUGGING ONLY.
  *
- * XXX when diving a new full block create a clean empty buffer and bqrelse()
- *     it, so small data structures do not have to issue read-IO when they
- *     do the read-modify-write on the backing store.
+ * This version of the routine is ONLY usable for testing and debug
+ * purposes and only if the filesystem never instantiated an actual
+ * freemap.  It uses the initial allocation iterator that newfs_hammer2
+ * used to build the filesystem to allocate new space and is not capable
+ * of dealing with freed space.
  */
-hammer2_off_t
-hammer2_freemap_alloc(hammer2_mount_t *hmp, int type, size_t bytes)
+#if USE_SIMPLE_ALLOC
+
+static
+int
+hammer2_freemap_simple_alloc(hammer2_mount_t *hmp, hammer2_blockref_t *bref,
+                            int radix)
 {
        hammer2_off_t data_off;
        hammer2_off_t data_next;
        hammer2_freecache_t *fc;
        /*struct buf *bp;*/
-       int radix;
+       size_t bytes;
        int fctype;
 
-       switch(type) {
+       bytes = (size_t)(1 << radix);
+       KKASSERT(bytes >= HAMMER2_MIN_ALLOC &&
+                bytes <= HAMMER2_MAX_ALLOC);
+
+       /*
+        * Must not be used if the filesystem is using a real freemap.
+        */
+       KKASSERT(hmp->voldata.freemap_blockset.blockref[0].data_off == 0);
+
+       switch(bref->type) {
        case HAMMER2_BREF_TYPE_INODE:
                fctype = HAMMER2_FREECACHE_INODE;
                break;
@@ -77,12 +219,6 @@ hammer2_freemap_alloc(hammer2_mount_t *hmp, int type, size_t bytes)
                break;
        }
 
-       /*
-        * Figure out the base 2 radix of the allocation (rounded up)
-        */
-       radix = hammer2_allocsize(bytes);
-       bytes = 1 << radix;
-
        if (radix <= HAMMER2_MAX_RADIX)
                fc = &hmp->freecache[fctype][radix];
        else
@@ -132,23 +268,374 @@ hammer2_freemap_alloc(hammer2_mount_t *hmp, int type, size_t bytes)
        }
        lockmgr(&hmp->alloclk, LK_RELEASE);
 
+       bref->data_off = data_off | radix;
+       return (0);
+}
+
+#endif
+
+/*
+ * Normal freemap allocator
+ *
+ * Use available hints to allocate space using the freemap.  Create missing
+ * freemap infrastructure on-the-fly as needed (including marking initial
+ * allocations using the iterator as allocated, instantiating new 2GB zones,
+ * and dealing with the end-of-media edge case).
+ *
+ * ip and bpref are only used as a heuristic to determine locality of
+ * reference.  bref->key may also be used heuristically.
+ */
+int
+hammer2_freemap_alloc(hammer2_trans_t *trans,
+                     hammer2_blockref_t *bref, size_t bytes)
+{
+       hammer2_mount_t *hmp = trans->hmp;
+       hammer2_chain_t *parent;
+       hammer2_off_t bpref;
+       hammer2_off_t bnext;
+       int radix;
+       int error;
+
+       /*
+        * Validate the allocation size.  It must be a power of 2.
+        */
+       radix = hammer2_getradix(bytes);
+       KKASSERT((size_t)1 << radix == bytes);
+
+       /*
+        * Freemap elements are assigned from the reserve area.
+        * Note that FREEMAP_LEVEL0_PSIZE is 256 bytes which is
+        * allowed for this case.
+        */
+       if (bref->type == HAMMER2_BREF_TYPE_FREEMAP_NODE ||
+           bref->type == HAMMER2_BREF_TYPE_FREEMAP_LEAF) {
+               return(hammer2_freemap_reserve(hmp, bref, radix));
+       }
+#if USE_SIMPLE_ALLOC
+       return (hammer2_freemap_simple_alloc(hmp, bref, radix));
+#else
+
+       /*
+        * Calculate actual allocation in bytes, and radix.  This ensures
+        * a minimum 1KB allocation.
+        */
+       KKASSERT(bytes >= HAMMER2_MIN_ALLOC &&
+                bytes <= HAMMER2_MAX_ALLOC);
+
 #if 0
        /*
-        * Allocations on-media are always in multiples of 64K but
-        * partial-block allocations can be tracked in-memory.
-        *
-        * We can reduce the need for read-modify-write IOs by
-        * telling the kernel that the contents of a new 64K block is
-        * initially good (before we use any of it).
-        *
-        * Worst case is the kernel evicts the buffer and causes HAMMER2's
-        * bread later on to actually issue a read I/O.
+        * Calculate starting point
+        */
+       if (bref->data_off & ~HAMMER2_OFF_MASK_RADIX)
+               bpref = bref->data_off & ~HAMMER2_OFF_MASK_RADIX;
+       else if (trans->tmp_bpref)
+               bpref = trans->tmp_bpref;
+       else if (trans->tmp_ip)
+               bpref = trans->tmp_ip->chain->bref.data_off;
+       else
+#endif
+               bpref = hmp->heur_last_alloc;   /* SMP race ok, heuristic */
+
+       /*
+        * Make sure bpref is in-bounds.  It's ok if bpref covers a zone's
+        * reserved area, the try code will iterate past it.
+        */
+       if (bpref > hmp->voldata.volu_size)
+               bpref = hmp->voldata.volu_size - 1;
+
+       /*
+        * Iterate the freemap looking for free space before and after.
+        */
+       parent = &hmp->fchain;
+       hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS);
+       error = EAGAIN;
+       bnext = bpref;
+
+       while (error == EAGAIN) {
+               error = hammer2_freemap_try_alloc(trans, &parent, bref,
+                                                 radix, bpref, &bnext);
+       }
+       hmp->heur_last_alloc = bnext;   /* XXX */
+       hammer2_chain_unlock(parent);
+
+       return (error);
+#endif
+}
+
+#if !USE_SIMPLE_ALLOC
+
+/*
+ * Attempt to allocate (1 << radix) bytes from the freemap at bnext.
+ * Return 0 on success with the bref appropriately updated, non-zero
+ * on failure.  Updates bnextp and returns EAGAIN to continue the
+ * iteration.
+ *
+ * This function will create missing freemap infrastructure as well as
+ * properly initialize reserved areas as already having been allocated.
+ */
+static __inline
+int
+countbits(uint64_t *data)
+{
+       int i;
+       int r = 0;
+       uint64_t v;
+
+       for (i = 0; i < 32; ++i) {
+               v = data[i];
+               while (v) {
+                       if (v & 1)
+                               ++r;
+                       v >>= 1;
+               }
+       }
+       return(r);
+}
+
+static int
+hammer2_freemap_try_alloc(hammer2_trans_t *trans, hammer2_chain_t **parentp,
+                         hammer2_blockref_t *bref, int radix,
+                         hammer2_off_t bpref, hammer2_off_t *bnextp)
+{
+       hammer2_mount_t *hmp = trans->hmp;
+       hammer2_off_t l0mask;
+       hammer2_off_t l0size;
+       hammer2_chain_t *chain;
+       hammer2_off_t key;
+       hammer2_off_t tmp;
+       size_t bytes;
+       uint64_t mask;
+       uint64_t tmp_mask;
+       uint64_t *data;
+       int error = 0;
+       int bits;
+       int index;
+       int count;
+       int subindex;
+
+       /*
+        * Calculate the number of bytes being allocated, the number
+        * of contiguous bits of bitmap being allocated, and the bitmap
+        * mask.
         *
-        * XXX Maybe do this in SEGSIZE increments? Needs a lot of work.
-        *     Also watch out for buffer size mismatches.
+        * WARNING! cpu hardware may mask bits == 64 -> 0 and blow up the
+        *          mask calculation.
+        */
+       bytes = (size_t)1 << radix;
+       bits = 1 << (radix - HAMMER2_MIN_RADIX);
+       mask = (bits == 64) ? (uint64_t)-1 : (((uint64_t)1 << bits) - 1);
+
+       /*
+        * Lookup the level0 freemap chain, creating and initializing one
+        * if necessary.  Intermediate levels will be created automatically
+        * when necessary by hammer2_chain_create().
+        */
+       key = H2FMBASE(*bnextp, HAMMER2_FREEMAP_LEVEL0_RADIX);
+       l0mask = H2FMSHIFT(HAMMER2_FREEMAP_LEVEL0_RADIX) - 1;
+       l0size = H2FMSHIFT(HAMMER2_FREEMAP_LEVEL0_RADIX);
+
+       chain = hammer2_chain_lookup(parentp, key, key + l0mask,
+                                    HAMMER2_LOOKUP_FREEMAP |
+                                    HAMMER2_LOOKUP_ALWAYS |
+                                    HAMMER2_LOOKUP_MATCHIND/*XXX*/);
+       if (chain == NULL) {
+               /*
+                * Create the missing leaf, be sure to initialize
+                * the auxillary freemap tracking information in
+                * the bref.check.freemap structure.
+                */
+#if 0
+               kprintf("freemap create L0 @ %016jx bpref %016jx\n",
+                       key, bpref);
+#endif
+               error = hammer2_chain_create(trans, parentp, &chain,
+                                    key, HAMMER2_FREEMAP_LEVEL0_RADIX,
+                                    HAMMER2_BREF_TYPE_FREEMAP_LEAF,
+                                    HAMMER2_FREEMAP_LEVEL0_PSIZE);
+               if (error == 0) {
+                       hammer2_chain_modify(trans, &chain, 0);
+                       bzero(chain->data->bitmap, HAMMER2_FREEMAP_LEVEL0_PSIZE);
+                       chain->bref.check.freemap.biggest =
+                                       HAMMER2_FREEMAP_LEVEL0_RADIX;
+                       chain->bref.check.freemap.avail = l0size;
+
+                       /*
+                        * Preset bitmap for existing static allocations.
+                        * 64-bit-align so we don't have to worry about
+                        * endian for the memset().
+                        */
+                       tmp = (hmp->voldata.allocator_beg +
+                              HAMMER2_MAX_ALLOC - 1) &
+                             ~(hammer2_off_t)(HAMMER2_MAX_ALLOC - 1);
+                       if (key < tmp) {
+                               if (key + l0size <= tmp) {
+                                       memset(chain->data->bitmap, -1,
+                                               l0size / HAMMER2_MIN_ALLOC / 8);
+                                       chain->bref.check.freemap.avail = 0;
+                               } else {
+                                       count = (tmp - key) / HAMMER2_MIN_ALLOC;
+                                       kprintf("Init L0 BASE %d\n", count);
+                                       memset(chain->data->bitmap, -1,
+                                              count / 8);
+                                       chain->bref.check.freemap.avail -=
+                                               count * HAMMER2_MIN_ALLOC;
+                               }
+                       }
+
+                       /*
+                        * Preset bitmap for reserved area.  Calculate
+                        * 2GB base.
+                        */
+                       tmp = H2FMBASE(key, HAMMER2_FREEMAP_LEVEL1_RADIX);
+                       if (key - tmp < HAMMER2_ZONE_SEG) {
+                               memset(chain->data->bitmap, -1,
+                                      l0size / HAMMER2_MIN_ALLOC / 8);
+                               chain->bref.check.freemap.avail = 0;
+                       }
+
+                       /*
+                        * Preset bitmap for end of media
+                        */
+                       if (key >= trans->hmp->voldata.volu_size) {
+                               memset(chain->data->bitmap, -1,
+                                      l0size / HAMMER2_MIN_ALLOC / 8);
+                               chain->bref.check.freemap.avail = 0;
+                       }
+               }
+       } else if (chain->bref.check.freemap.biggest < radix) {
+               /*
+                * Already flagged as not having enough space
+                */
+               error = ENOSPC;
+       } else {
+               /*
+                * Modify existing chain to setup for adjustment.
+                */
+               hammer2_chain_modify(trans, &chain, 0);
+       }
+       if (error)
+               goto skip;
+
+       /*
+        * Calculate mask and count.  Each bit represents 1KB and (currently)
+        * the maximum allocation is 65536 bytes.  Allocations are always
+        * natively aligned.
+        */
+       count = HAMMER2_FREEMAP_LEVEL0_PSIZE / sizeof(uint64_t); /* 32 */
+       data = &chain->data->bitmap[0];
+
+       tmp_mask = 0; /* avoid compiler warnings */
+
+       /*
+        * Allocate data and meta-data from the beginning and inodes
+        * from the end.
+        */
+       if (bref->type != HAMMER2_BREF_TYPE_INODE) {
+               for (index = 0; index < count; ++index) {
+                       if (data[index] == (uint64_t)-1) /* all allocated */
+                               continue;
+                       tmp_mask = mask;                 /* iterate */
+                       for (subindex = 0; subindex < 64; subindex += bits) {
+                               if ((data[index] & tmp_mask) == 0)
+                                       break;
+                               tmp_mask <<= bits;
+                       }
+                       if (subindex != 64) {
+                               key += HAMMER2_MIN_ALLOC * 64 * index;
+                               key += HAMMER2_MIN_ALLOC * subindex;
+                               break;
+                       }
+               }
+               if (index == count)
+                       error = ENOSPC;
+       } else {
+               for (index = count - 1; index >= 0; --index) {
+                       if (data[index] == (uint64_t)-1) /* all allocated */
+                               continue;
+                       tmp_mask = mask << (64 - bits);
+                       for (subindex = 64 - bits;
+                            subindex >= 0;
+                            subindex -= bits) {
+                               if ((data[index] & tmp_mask) == 0)
+                                       break;
+                               tmp_mask >>= bits;
+                       }
+                       if (subindex != -bits) {
+                               key += HAMMER2_MIN_ALLOC * 64 * index;
+                               key += HAMMER2_MIN_ALLOC * subindex;
+                               break;
+                       }
+               }
+               if (index == -1)
+                       error = ENOSPC;
+       }
+
+skip:
+       if (error == 0) {
+               /*
+                * Assert validity.  Must be beyond the static allocator used
+                * by newfs_hammer2 (and thus also beyond the aux area),
+                * not go past the volume size, and must not be in the
+                * reserved segment area for a zone.
+                */
+               KKASSERT(key >= hmp->voldata.allocator_beg &&
+                        key + bytes <= hmp->voldata.volu_size);
+               KKASSERT((key & HAMMER2_ZONE_MASK64) >= HAMMER2_ZONE_SEG);
+
+               /*
+                * Modify the chain and set the bitmap appropriately.
+                */
+               hammer2_chain_modify(trans, &chain, 0);
+               data = &chain->data->bitmap[0];
+               KKASSERT((data[index] & tmp_mask) == 0);
+               /*
+               kprintf("set %016jx data %016jx %016jx\n",
+                       key, data[index], tmp_mask);
+               */
+               data[index] |= tmp_mask;
+
+               /*
+                * We return the allocated space in bref->data_off.
+                */
+               *bnextp = key;
+               bref->data_off = key | radix;
+
+#if 0
+               kprintf("alloc cp=%p %016jx %016jx using %016jx chain->data %d\n",
+                       chain,
+                       bref->key, bref->data_off, chain->bref.data_off,
+                       countbits(data));
+#endif
+       } else if (error == ENOSPC) {
+               /*
+                * Return EAGAIN with next iteration in *bnextp, or
+                * return ENOSPC if the allocation map has been exhausted.
+                */
+               if (chain->bref.check.freemap.biggest > radix)
+                       chain->bref.check.freemap.biggest = radix;
+               error = hammer2_freemap_iterate(trans, parentp, &chain,
+                                               bpref, bnextp);
+       }
+
+       /*
+        * Cleanup
+        */
+       if (chain)
+               hammer2_chain_unlock(chain);
+       return (error);
+}
+
+#if 0
+       /*
+        * When making meta-data allocations smaller than LBUFSIZE we will
+        * use a LBUFSIZE'd buffer.  The first chunk allocated from such a
+        * buffer instantiates a device buffer and marks it clean to avoid
+        * unnecessary read-before-write ops.  XXX buffer cache buffer
+        * sharing.  XXX mixed data/meta-data issues.
         */
        if (bytes < HAMMER2_MINIOSIZE &&
-           (data_off & (HAMMER2_MINIOSIZE - 1)) == 0) {
+           (data_off & (HAMMER2_MINIOSIZE - 1)) == 0 &&
+           (bitmap shows this is the initial allocation)) {
                bp = getblk(hmp->devvp, data_off, HAMMER2_MINIOSIZE, 0, 0);
                bp->b_flags |= B_CACHE;
                bp->b_resid = 0;
@@ -156,13 +643,21 @@ hammer2_freemap_alloc(hammer2_mount_t *hmp, int type, size_t bytes)
        }
 #endif
 
-       if (hammer2_debug & 0x0001) {
-               kprintf("hammer2: allocate %d %016jx: %zd\n",
-                       type, (intmax_t)data_off, bytes);
-       }
-       return (data_off | radix);
+static int
+hammer2_freemap_iterate(hammer2_trans_t *trans, hammer2_chain_t **parentp,
+                       hammer2_chain_t **chainp,
+                       hammer2_off_t bpref, hammer2_off_t *bnextp)
+{
+       *bnextp += H2FMSHIFT(HAMMER2_FREEMAP_LEVEL0_RADIX);
+       if (*bnextp >= trans->hmp->voldata.volu_size)
+               return (ENOSPC);
+       return(EAGAIN);
 }
 
+#endif
+
+#if 0
+
 void
 hammer2_freemap_free(hammer2_mount_t *hmp, hammer2_off_t data_off, int type)
 {
@@ -197,6 +692,8 @@ hammer2_freemap_free(hammer2_mount_t *hmp, hammer2_off_t data_off, int type)
        }
 }
 
+#endif
+
 #if 0
 /*
  * Allocate media space, returning a combined data offset and radix.
index 1f022f4..ec92083 100644 (file)
@@ -1216,7 +1216,8 @@ done:
 }
 
 /*
- * Calculate the allocation size for the file fragment straddling EOF
+ * Calculate the allocation size for the file fragment straddling EOF,
+ * returning the radix.
  */
 int
 hammer2_inode_calc_alloc(hammer2_key_t filesize)
@@ -1226,7 +1227,7 @@ hammer2_inode_calc_alloc(hammer2_key_t filesize)
 
        if (frag == 0)
                return(0);
-       for (radix = HAMMER2_MINALLOCRADIX; frag > (1 << radix); ++radix)
+       for (radix = HAMMER2_MIN_RADIX; frag > (1 << radix); ++radix)
                ;
        return (radix);
 }
index 7900214..e77ea94 100644 (file)
@@ -481,7 +481,7 @@ hammer2_ioctl_pfs_create(hammer2_inode_t *ip, void *data)
                return(EINVAL);
        pfs->name[sizeof(pfs->name) - 1] = 0;   /* ensure 0-termination */
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, ip, 0);
        nip = hammer2_inode_create(&trans, hmp->sroot, NULL, NULL,
                                     pfs->name, strlen(pfs->name),
                                     &nchain, &error);
@@ -508,7 +508,7 @@ hammer2_ioctl_pfs_delete(hammer2_inode_t *ip, void *data)
        hammer2_trans_t trans;
        int error;
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, ip, 0);
        error = hammer2_unlink_file(&trans, hmp->sroot,
                                    pfs->name, strlen(pfs->name),
                                    2, NULL);
@@ -531,7 +531,7 @@ hammer2_ioctl_pfs_snapshot(hammer2_inode_t *ip, void *data)
        if (pfs->name[sizeof(pfs->name)-1] != 0)
                return(EINVAL);
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, ip, 0);
        parent = hammer2_inode_lock_ex(ip);
        error = hammer2_chain_snapshot(&trans, ip, pfs);
        hammer2_inode_unlock_ex(ip, parent);
index 944891d..f7d73e8 100644 (file)
@@ -284,6 +284,7 @@ hammer2_dirhash(const unsigned char *name, size_t len)
        return (key);
 }
 
+#if 0
 /*
  * Return the power-of-2 radix greater or equal to
  * the specified number of bytes.
@@ -312,6 +313,30 @@ hammer2_allocsize(size_t bytes)
        return (radix);
 }
 
+#endif
+
+/*
+ * Convert bytes to radix with no limitations
+ */
+int
+hammer2_getradix(size_t bytes)
+{
+       int radix;
+
+       if (bytes == HAMMER2_PBUFSIZE)
+               radix = HAMMER2_PBUFRADIX;
+       else if (bytes >= 16384)
+               radix = 14;
+       else if (bytes >= 1024)
+               radix = 10;
+       else
+               radix = 0;
+
+       while (((size_t)1 << radix) < bytes)
+               ++radix;
+       return (radix);
+}
+
 /*
  * ip must be locked sh/ex
  */
@@ -326,9 +351,9 @@ hammer2_calc_logical(hammer2_inode_t *ip, hammer2_off_t uoff,
        *leofp = ipdata->size & ~HAMMER2_PBUFMASK64;
        KKASSERT(*lbasep <= *leofp);
        if (*lbasep == *leofp /*&& *leofp < 1024 * 1024*/) {
-               radix = hammer2_allocsize((size_t)(ipdata->size - *leofp));
-               if (radix < HAMMER2_MINALLOCRADIX)
-                       radix = HAMMER2_MINALLOCRADIX;
+               radix = hammer2_getradix((size_t)(ipdata->size - *leofp));
+               if (radix < HAMMER2_MIN_RADIX)
+                       radix = HAMMER2_MIN_RADIX;
                *leofp += 1U << radix;
                return (1U << radix);
        } else {
index 62e4693..118527c 100644 (file)
@@ -61,7 +61,7 @@ static struct hammer2_mntlist hammer2_mntlist;
 static struct lock hammer2_mntlk;
 
 int hammer2_debug;
-int hammer2_cluster_enable = 1;
+int hammer2_cluster_enable = 0;        /* XXX temporary until layout ironed out */
 int hammer2_hardlink_enable = 1;
 long hammer2_iod_file_read;
 long hammer2_iod_meta_read;
@@ -391,7 +391,7 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                TAILQ_INIT(&hmp->transq);
 
                /*
-                * vchain setup. vchain.data is special cased to NULL.
+                * vchain setup. vchain.data is embedded.
                 * vchain.refs is initialized and will never drop to 0.
                 */
                hmp->vchain.hmp = hmp;
@@ -403,6 +403,25 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                hammer2_chain_core_alloc(&hmp->vchain, NULL);
                /* hmp->vchain.u.xxx is left NULL */
 
+               /*
+                * fchain setup.  fchain.data is embedded.
+                * fchain.refs is initialized and will never drop to 0.
+                *
+                * The data is not used but needs to be initialized to
+                * pass assertion muster.  We use this chain primarily
+                * as a placeholder for the freemap's top-level RBTREE
+                * so it does not interfere with the volume's topology
+                * RBTREE.
+                */
+               hmp->fchain.hmp = hmp;
+               hmp->fchain.refs = 1;
+               hmp->fchain.data = (void *)&hmp->voldata.freemap_blockset;
+               hmp->fchain.bref.type = HAMMER2_BREF_TYPE_FREEMAP;
+               hmp->fchain.bref.data_off = 0 | HAMMER2_PBUFRADIX;
+               hmp->fchain.delete_tid = HAMMER2_MAX_TID;
+               hammer2_chain_core_alloc(&hmp->fchain, NULL);
+               /* hmp->fchain.u.xxx is left NULL */
+
                /*
                 * Install the volume header
                 */
@@ -669,10 +688,17 @@ hammer2_vfs_unmount(struct mount *mp, int mntflags)
                        devvp = NULL;
                }
 
+               /*
+                * Final drop of embedded freemap root chain to clean up
+                * fchain.core (fchain structure is not flagged ALLOCATED
+                * so it is cleaned out and then left to rot).
+                */
+               hammer2_chain_drop(&hmp->fchain);
+
                /*
                 * Final drop of embedded volume root chain to clean up
                 * vchain.core (vchain structure is not flagged ALLOCATED
-                * so it is cleaned out and then left).
+                * so it is cleaned out and then left to rot).
                 */
                dumpcnt = 50;
                hammer2_dump_chain(&hmp->vchain, 0, &dumpcnt);
@@ -826,7 +852,7 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
        if (waitfor & MNT_LAZY)
                flags |= VMSC_ONEPASS;
 
-       hammer2_trans_init(hmp, &info.trans, HAMMER2_TRANS_ISFLUSH);
+       hammer2_trans_init(&info.trans, hmp, NULL, HAMMER2_TRANS_ISFLUSH);
 
        info.error = 0;
        info.waitfor = MNT_NOWAIT;
@@ -847,16 +873,27 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
                /* XXX */
        }
 #endif
+       hammer2_chain_lock(&hmp->vchain, HAMMER2_RESOLVE_ALWAYS);
+       if (hmp->vchain.flags & (HAMMER2_CHAIN_MODIFIED |
+                                 HAMMER2_CHAIN_SUBMODIFIED)) {
+               hammer2_chain_flush(&info.trans, &hmp->vchain);
+       }
+       hammer2_chain_unlock(&hmp->vchain);
+
+#if 0
        /*
         * Rollup flush.  The fsyncs above basically just flushed
         * data blocks.  The flush below gets all the meta-data.
         */
-       hammer2_chain_lock(&hmp->vchain, HAMMER2_RESOLVE_ALWAYS);
-       if (hmp->vchain.flags & (HAMMER2_CHAIN_MODIFIED |
+       hammer2_chain_lock(&hmp->fchain, HAMMER2_RESOLVE_ALWAYS);
+       if (hmp->fchain.flags & (HAMMER2_CHAIN_MODIFIED |
                                 HAMMER2_CHAIN_SUBMODIFIED)) {
-               hammer2_chain_flush(&info.trans, &hmp->vchain);
+               /* this will modify vchain as a side effect */
+               hammer2_chain_flush(&info.trans, &hmp->fchain);
        }
-       hammer2_chain_unlock(&hmp->vchain);
+       hammer2_chain_unlock(&hmp->fchain);
+#endif
+
 
        error = 0;
 
index d570311..e427f76 100644 (file)
@@ -185,7 +185,7 @@ hammer2_vop_reclaim(struct vop_reclaim_args *ap)
        if (chain->flags & (HAMMER2_CHAIN_MODIFIED |
                            HAMMER2_CHAIN_DELETED |
                            HAMMER2_CHAIN_SUBMODIFIED)) {
-               hammer2_trans_init(ip->hmp, &trans, HAMMER2_TRANS_ISFLUSH);
+               hammer2_trans_init(&trans, hmp, ip, HAMMER2_TRANS_ISFLUSH);
                hammer2_chain_flush(&trans, chain);
                hammer2_trans_done(&trans);
        }
@@ -218,7 +218,7 @@ hammer2_vop_fsync(struct vop_fsync_args *ap)
        ip = VTOI(vp);
        hmp = ip->hmp;
 
-       hammer2_trans_init(hmp, &trans, HAMMER2_TRANS_ISFLUSH);
+       hammer2_trans_init(&trans, hmp, ip, HAMMER2_TRANS_ISFLUSH);
        chain = hammer2_inode_lock_ex(ip);
 
        vfsync(vp, ap->a_waitfor, 1, NULL, NULL);
@@ -336,7 +336,7 @@ hammer2_vop_setattr(struct vop_setattr_args *ap)
        if (hmp->ronly)
                return(EROFS);
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, ip, 0);
        chain = hammer2_inode_lock_ex(ip);
        ipdata = &chain->data->ipdata;
        error = 0;
@@ -740,7 +740,7 @@ hammer2_vop_write(struct vop_write_args *ap)
         * ip must be marked modified, particularly because the write
         * might wind up being copied into the embedded data area.
         */
-       hammer2_trans_init(ip->hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, ip, 0);
        parent = hammer2_inode_lock_ex(ip);
        error = hammer2_write_file(&trans, ip, &parent,
                                   uio, ap->a_ioflag, seqcount);
@@ -1225,7 +1225,7 @@ hammer2_truncate_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
                        case HAMMER2_BREF_TYPE_DATA:
                                hammer2_chain_resize(trans, ip, bp,
                                             parent, &chain,
-                                            hammer2_allocsize(nblksize),
+                                            hammer2_getradix(nblksize),
                                             HAMMER2_MODIFY_OPTDATA);
                                allocbuf(bp, nblksize);
                                bzero(bp->b_data + loff, nblksize - loff);
@@ -1283,7 +1283,7 @@ hammer2_truncate_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
                        case HAMMER2_BREF_TYPE_DATA:
                                chain = hammer2_chain_resize(trans, ip, bp,
                                             parent, chain,
-                                            hammer2_allocsize(nblksize),
+                                            hammer2_getradix(nblksize),
                                             0);
                                hammer2_chain_modify(hmp, &chain, 0);
                                bzero(chain->data->buf + loff, nblksize - loff);
@@ -1442,7 +1442,7 @@ hammer2_extend_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
 retry:
                error = 0;
                parent = hammer2_chain_lookup_init(ip->chain, 0);
-               nradix = hammer2_allocsize(nblksize);
+               nradix = hammer2_getradix(nblksize);
 
                chain = hammer2_chain_lookup(&parent,
                                             obase, obase,
@@ -1560,7 +1560,7 @@ hammer2_vop_nresolve(struct vop_nresolve_args *ap)
                kprintf("hammer2: need to unconsolidate hardlink for %s\n",
                        chain->data->ipdata.filename);
                /* XXX retain shared lock on dip? (currently not held) */
-               hammer2_trans_init(dip->hmp, &trans, 0);
+               hammer2_trans_init(&trans, hmp, dip, 0);
                hammer2_hardlink_deconsolidate(&trans, dip, &chain, &ochain);
                hammer2_trans_done(&trans);
        }
@@ -1659,7 +1659,7 @@ hammer2_vop_nmkdir(struct vop_nmkdir_args *ap)
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, dip, 0);
        nip = hammer2_inode_create(&trans, dip, ap->a_vap, ap->a_cred,
                                   name, name_len, &chain, &error);
        if (error) {
@@ -1839,7 +1839,6 @@ hammer2_vop_nlink(struct vop_nlink_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
-       hammer2_trans_init(hmp, &trans, 0);
 
        /*
         * ip represents the file being hardlinked.  The file could be a
@@ -1853,6 +1852,8 @@ hammer2_vop_nlink(struct vop_nlink_args *ap)
         * returned chain is locked.
         */
        ip = VTOI(ap->a_vp);
+       hammer2_trans_init(&trans, hmp, ip, 0);
+
        chain = hammer2_inode_lock_ex(ip);
        error = hammer2_hardlink_consolidate(&trans, ip, &chain, dip, 1);
        if (error)
@@ -1909,7 +1910,7 @@ hammer2_vop_ncreate(struct vop_ncreate_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, dip, 0);
 
        nip = hammer2_inode_create(&trans, dip, ap->a_vap, ap->a_cred,
                                   name, name_len, &nchain, &error);
@@ -1954,7 +1955,7 @@ hammer2_vop_nsymlink(struct vop_nsymlink_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, dip, 0);
 
        ap->a_vap->va_type = VLNK;      /* enforce type */
 
@@ -2040,7 +2041,7 @@ hammer2_vop_nremove(struct vop_nremove_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, dip, 0);
        error = hammer2_unlink_file(&trans, dip, name, name_len, 0, NULL);
        hammer2_trans_done(&trans);
        if (error == 0) {
@@ -2073,7 +2074,7 @@ hammer2_vop_nrmdir(struct vop_nrmdir_args *ap)
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, dip, 0);
        error = hammer2_unlink_file(&trans, dip, name, name_len, 1, NULL);
        hammer2_trans_done(&trans);
        if (error == 0) {
@@ -2124,7 +2125,7 @@ hammer2_vop_nrename(struct vop_nrename_args *ap)
        tname = tncp->nc_name;
        tname_len = tncp->nc_nlen;
 
-       hammer2_trans_init(hmp, &trans, 0);
+       hammer2_trans_init(&trans, hmp, tdip, 0);
 
        /*
         * ip is the inode being renamed.  If this is a hardlink then