hammer2 - Retool flushing and use of mirror_tid, more cluster work.
authorMatthew Dillon <dillon@apollo.backplane.com>
Wed, 8 Apr 2015 05:53:06 +0000 (22:53 -0700)
committerMatthew Dillon <dillon@apollo.backplane.com>
Wed, 8 Apr 2015 06:25:27 +0000 (23:25 -0700)
Now that I'm starting to deal with different PFSs on the same physical
volume some mirror_tid related issues need to be fixed.

* mirror_tid is now per-media, not per-PFS.  This fixes a number of issues
  particularly with the on-mount recovery scan.  Mount, recovery, and flush
  code now no longer has to worry about PFSs when it comes to adjusting
  mirror_tid.

* modify_tid will become per-physical PFS (not quite done in this commit).

* Change where mirror_tid gets set.  Set bref.mirror_tid in
  hammer2_chain_modify() instead of in the flush code.  This takes care
  of most of the flush cases.

* Better separation of the freemap_tid in the volume header.

* fchain (freemap) and vchain (main topology) syncing now works properly.
  The fchain can be flushed independently of vchain, and the recovery code
  can handle any number of vchain flushes occuring without a fchain flush.

  For the moment, both are flushed on sync, fchain first, vchain second.
  This can leave fchain's mirror_tid slightly behind vchain requiring
  recovery on mount if a crash were to occur.

  We now properly update just the fchain mirror_tid on a followup sync if
  no main topology modifications have occurred, allowing the fchain
  mirror_tid to catch-up to the vchain mirror_tid.

  We now properly sync the fchain on unmount so no recovery is required.

* Revamp the recovery code to properly use the fchain-to-vchain mirror_tid
  range in the recovery scan.  This has the general effect of making the
  recovery pass run a whole lot faster and when coupled with the above
  fixes.

  Report whether recovery is needed or not on-mount and the mirror_tid range
  if so.

* Update DESIGN.

* Add CITEM_FEMOD indicating which chains in the hammer2_cluster structure
  can actually be modified and modsync'd by a hammer2_cluster_modify() call.

* CITEM_INVALID now also checks bref.modify_tid as intended, when checking
  whether nodes are synchronized or not (it used to use mirror_tid but with
  the revamping modify_tid takes over this functionality).

* Remove the auto-ref/auto-drop from hammer2_chain_lock(),
  hammer2_chain_unlock(), hammer2_cluster_lock(),
  and hammer2_cluster_unlock().  Separate ref and drop calls are needed if
  the ref-count is not taken care of already.

  This makes the *chain* and *cluster* API basically behave the same way,
  reducing confusion.

  Cleanup related #defines and code infrastructure that is no longer needed
  to handle RESOLVE_NOLOCK.

* Fix a bug when LOOKUP_NOLOCK is used.  Do not assume LOOKUP_SHARED when
  LOOKUP_NOLOCK is used, use the LOOKUP_SHARED flag only to determine the
  locking type.  Otherwise relocking the parent (which has to be locked)
  for the degenerate DIRECTDATA case breaks the parent and causes a
  deadlock or assertion.

* Preliminary adjustments to the slave synchronization code.  In particular,
  add support to hammer2_chain_modify() to suppress modify_tid updates so
  the slave synchronization code can update the field manually.  It must
  NOT update the modify_tid in parent chains in the topology until the
  sub-tree under them is synchronized.

14 files changed:
sys/vfs/hammer2/DESIGN
sys/vfs/hammer2/TODO
sys/vfs/hammer2/hammer2.h
sys/vfs/hammer2/hammer2_bulkscan.c
sys/vfs/hammer2/hammer2_chain.c
sys/vfs/hammer2/hammer2_cluster.c
sys/vfs/hammer2/hammer2_disk.h
sys/vfs/hammer2/hammer2_flush.c
sys/vfs/hammer2/hammer2_freemap.c
sys/vfs/hammer2/hammer2_inode.c
sys/vfs/hammer2/hammer2_ioctl.c
sys/vfs/hammer2/hammer2_syncthr.c
sys/vfs/hammer2/hammer2_vfsops.c
sys/vfs/hammer2/hammer2_vnops.c

index f3a5f02..83ee8a1 100644 (file)
 
                                    Feature List
 
+* Block topology (both the main topology and the freemap) use a copy-on-write
+  design.  Media-level block frees are delayed and flushes rotate between
+  4 volume headers (maxes out at 4 if the filesystem is > ~8GB).  Flushes
+  will allocate new blocks up to the root in order to propagate block table
+  changes and transaction ids.
+
+* Incremental update scans are trivial by design.
+
 * Multiple roots, with many features.  This is implemented via the super-root
   concept.  When mounting a HAMMER2 filesystem you specify a device path and
   a directory name in the super-root.  (HAMMER1 had only one root).
@@ -185,6 +193,14 @@ filesystem directory/file hierarchy snapshots and other similar operations
 can be trivially performed on an entire subdirectory tree at any level in
 the filesystem.
 
+The copy-on-write design implements a block table in a radix-tree format,
+with a small 8x fan-out in the volume header and inode and a large 256x or
+1024x fan-out for indirect blocks.  The table is built bottom-up.
+Intermediate radii are only created when necessary so small files will use
+much shallower radix block trees.  The inode itself can accomodate files
+up 512KB (65536x8).  Directories also use a radix block table and directory
+inodes can accomodate up to 8 entries before pushing an indirect radix block.
+
 The copy-on-write nature of the filesystem implies that any modification
 whatsoever will have to eventually synchronize new disk blocks all the way
 to the super-root of the filesystem and the volume header itself.  This forms
@@ -193,7 +209,7 @@ completed high-level transaction boundary.  All disk writes are to new blocks
 except for the volume header (which cycles through 4 copies), thus allowing
 all writes to run asynchronously and concurrently prior to and during a flush,
 and then just doing a final synchronization and volume header update at the
-end.
+end.  Many of HAMMER2s features are enabled by this core design feature.
 
 Clearly this method requires intermediate modifications to the chain to be
 cached so multiple modifications can be aggregated prior to being
@@ -216,6 +232,82 @@ blockref is then recorded when the filesystem is mounted after a crash and
 the update chain is reconstituted when a matching blockref is encountered
 again during normal operation of the filesystem.
 
+                           MIRROR_TID, MODIFY_TID
+
+In HAMMER2, the core block reference is 64-byte structure called a blockref.
+The blockref contains various bits of information including the 64-bit radix
+key (typically a directory hash if a directory entry, inode number if a
+hidden hardlink target, or file offset if a file block), 64-bit data offset
+with the physical block size radix encoded in it (physical block size can be
+different from logical block size due to compression), two 64-bit transaction
+ids, type information, and 192 bits worth of check data for the block being
+reference which can be a simple CRC or stronger HASH.
+
+Both mirror_tid and modify_tid propagate upward from the change point all the
+way to the root, but serve different purposes and work in slightly different
+ways.
+
+mirror_tid - This is a media-centric (as in physical disk partition)
+            transaction id which tracks media-level updates.
+
+            Whenever any block in the media topology is modified, its
+            mirror_tid is updated with the flush id and will propagate
+            upward during the flush all the way to the volume header.
+
+            mirror_tid is monotonic.
+
+modify_tid - This is a cluster-centric (as in across all the nodes used
+            to build a cluster) transaction id which tracks filesystem-level
+            updates.
+
+            modify_tid is updated when the front-end of the filesystem makes
+            a change to an inode or data block.  It will also propagate
+            upward, stopping at the root of the PFS (the mount point for
+            the cluster).
+
+The major difference between mirror_tid and modify_tid is that for any given
+element in the topology residing on different nodes.  e.g. file "x" on node 1
+and file "x" on node 2, if the files are synchronized with each other they
+will have the same modify_tid on a block-by-block basis, and a single check
+of the inode's modify_tid is sufficient to determine that the files are fully
+synchronized and identical.  These same inodes and representitive blocks will
+have very different mirror_tids because the nodes will reside on different
+physical media.
+
+I noted above that modify_tids also propagate upward, but not in all cases.
+A node which is undergoing SYNCHRONIZATION only updates the modify_tid of
+a block when it has determined that the block and its entire sub-block
+hierarchy has been synchronized to that point.
+
+The synchronization code updates an out-of-sync node bottom-up and will
+definitely set modify_tid as it goes, but media flushes can occur at any
+time and these flushes will use mirror_tid for flush and freemap management.
+The mirror_tid for each flush propagates upward to the volume header on each
+flush.
+
+* The synchronization code is able to determine that a sub-tree is
+  synchronized simply by observing the modify_tid at the root of the sub-tree,
+  on a directory-by-directory basis.
+
+* The synchronization code is able to do an incremental update of an
+  out-of-sync node simply by skipping elements with matching modify_tids.
+
+* The synchronization code can be interrupted and restarted at any time,
+  and is able to pick up where it left off with very little overhead.
+
+* The synchronization code does not inhibit media flushes.  Media flushes
+  can occur (and must occur) while synchronization is ongoing.
+
+There are several other stored transaction ids in HAMMER2.  There is a
+separate freemap_tid in the volume header that is used to allow freemap
+flushes to be deferred, and inodes have an attr_tid and a dirent_tid which
+tracks attribute changes and (for directories) create/rename/delete changes.
+The inode TIDs are used as an aid for the cache coherency subsystem.
+
+Remember that since this is a copy-on-write filesystem, we can propagate
+a considerable amount of information up the tree to the volume header
+without adding to the I/O we already have to do.
+
                            DIRECTORIES AND INODES
 
 Directories are hashed, and another major design element is that directory
@@ -242,10 +334,12 @@ On mount, HAMMER2 will first locate the highest-sequenced check-code-validated
 volume header from the 4 copies available (if the filesystem is big enough,
 e.g. > ~10GB or so, there will be 4 copies of the volume header).
 
-HAMMER2 will then run an incremental scan of the topology for transaction ids
-between the last freemap flush and the current topology in order to update
-the freemap.  Because this scan is incremental the worst-case time to run
-the scan is about the time it takes to run one flush.
+HAMMER2 will then run an incremental scan of the topology for mirror_tid
+transaction ids between the last freemap flush and the current topology in
+order to update the freemap.  Because this scan is incremental the
+worst-case time to run the scan is the time it takes to scan the meta-data
+for all changes made between the last freemap flush and the last topology
+flush.
 
 The filesystem is then ready for use.
 
@@ -482,24 +576,6 @@ There are four major protocols:
 
        This is basically the physical media protocol.
 
-There are lots of ways to implement multi-master environments using the
-above core features but the implementation is going to be fairly complex
-even with HAMMER2's feature set.
-
-Keep in mind that modifications propagate all the way to the super-root
-and volume header, so in any clustered arrangement the use of (modify_tid)
-and (mirror_tid) is critical in determining the synchronization state of
-portion(s) of the filesystem.
-
-Specifically, since any modification propagates to the root the (mirror_tid)
-in higher level directories is going to be in a constant state of flux.  This
-state of flux DOES NOT invalidate the cache state for these higher levels
-of directories.  Instead, the (modify_tid) is used on a node-by-node basis
-to determine cache state at any given level, and (mirror_tid) is used to
-determine whether any recursively underlying state is desynchronized.
-The inode structure also has two additional transaction ids used to optimize
-path lookups, stat, and directory lookup/scan operations.
-
                       MASTER & SLAVE SYNCHRONIZATION
 
 With HAMMER2 I really want to be hard-nosed about the consistency of the
index 796fc62..b818c43 100644 (file)
@@ -1,4 +1,6 @@
 
+* syncthr leaves inode locks for entire sync, which is wrong.
+
 * recovery scan vs unmount.  At the moment an unmount does its flushes,
   and if successful the freemap will be fully up-to-date, but the mount
   code doesn't know that and the last flush batch will probably match
@@ -6,6 +8,9 @@
   recovery pass at mount time can be extensive.  Add a CLEAN flag to the
   volume header to optimize out the unnecessary recovery pass.
 
+* More complex transaction sequencing and flush merging.  Right now it is
+  all serialized against flushes.
+
 * adding new pfs - freeze and force remaster
 
 * removing a pfs - freeze and force remaster
index 0d32a4d..f452547 100644 (file)
@@ -380,7 +380,7 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_CHAIN_ONFLUSH          0x00000200      /* on a flush list */
 #define HAMMER2_CHAIN_FICTITIOUS       0x00000400      /* unsuitable for I/O */
 #define HAMMER2_CHAIN_VOLUMESYNC       0x00000800      /* needs volume sync */
-#define HAMMER2_CHAIN_KEEP_MIRROR_TID  0x00001000      /* retain mirror_tid */
+#define HAMMER2_CHAIN_UNUSED00001000   0x00001000
 #define HAMMER2_CHAIN_UNUSED00002000   0x00002000
 #define HAMMER2_CHAIN_ONRBTREE         0x00004000      /* on parent RB tree */
 #define HAMMER2_CHAIN_UNUSED00008000   0x00008000
@@ -457,7 +457,7 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_RESOLVE_MASK           0x0F
 
 #define HAMMER2_RESOLVE_SHARED         0x10    /* request shared lock */
-#define HAMMER2_RESOLVE_NOREF          0x20    /* already ref'd on lock */
+#define HAMMER2_RESOLVE_UNUSED20       0x20
 #define HAMMER2_RESOLVE_RDONLY         0x40    /* higher level op flag */
 
 /*
@@ -544,8 +544,13 @@ typedef struct hammer2_cluster_item hammer2_cluster_item_t;
 /*
  * INVALID     - Invalid for focus, i.e. not part of synchronized set.
  *               Once set, this bit is sticky across operations.
+ *
+ * FEMOD       - Indicates that front-end modifying operations can
+ *               mess with this entry and MODSYNC will copy also
+ *               effect it.
  */
 #define HAMMER2_CITEM_INVALID  0x00000001
+#define HAMMER2_CITEM_FEMOD    0x00000002
 
 struct hammer2_cluster {
        int                     refs;           /* track for deallocation */
@@ -698,49 +703,16 @@ typedef struct hammer2_inode_unlink hammer2_inode_unlink_t;
  * A hammer2 transaction and flush sequencing structure.
  *
  * This global structure is tied into hammer2_dev and is used
- * to sequence modifying operations and flushes.
- *
- * (a) Any modifying operations with sync_tid >= flush_tid will stall until
- *     all modifying operating with sync_tid < flush_tid complete.
- *
- *     The flush related to flush_tid stalls until all modifying operations
- *     with sync_tid < flush_tid complete.
- *
- * (b) Once unstalled, modifying operations with sync_tid > flush_tid are
- *     allowed to run.  All modifications cause modify/duplicate operations
- *     to occur on the related chains.  Note that most INDIRECT blocks will
- *     be unaffected because the modifications just overload the RBTREE
- *     structurally instead of actually modifying the indirect blocks.
- *
- * (c) The actual flush unstalls and RUNS CONCURRENTLY with (b), but only
- *     utilizes the chain structures with sync_tid <= flush_tid.  The
- *     flush will modify related indirect blocks and inodes in-place
- *     (rather than duplicate) since the adjustments are compatible with
- *     (b)'s RBTREE overloading
- *
- *     SPECIAL NOTE:  Inode modifications have to also propagate along any
- *                   modify/duplicate chains.  File writes detect the flush
- *                   and force out the conflicting buffer cache buffer(s)
- *                   before reusing them.
- *
- * (d) Snapshots can be made instantly but must be flushed and disconnected
- *     from their duplicative source before they can be mounted.  This is
- *     because while H2's on-media structure supports forks, its in-memory
- *     structure only supports very simple forking for background flushing
- *     purposes.
- *
- * TODO: Flush merging.  When fsync() is called on multiple discrete files
- *      concurrently there is no reason to stall the second fsync.
- *      The final flush that reaches to root can cover both fsync()s.
- *
- *     The chains typically terminate as they fly onto the disk.  The flush
- *     ultimately reaches the volume header.
+ * to sequence modifying operations and flushes.  These operations
+ * run on whole cluster PFSs, not individual nodes (at this level),
+ * so we do not record mirror_tid here.
  */
 struct hammer2_trans {
        TAILQ_ENTRY(hammer2_trans) entry;
        struct hammer2_pfs      *pmp;
-       hammer2_xid_t           sync_xid;
+       hammer2_xid_t           sync_xid;       /* transaction sequencer */
        hammer2_tid_t           inode_tid;      /* inode number assignment */
+       hammer2_tid_t           modify_tid;     /* modify transaction id */
        thread_t                td;             /* pointer */
        int                     flags;
        int                     blocked;
@@ -754,7 +726,7 @@ typedef struct hammer2_trans hammer2_trans_t;
 #define HAMMER2_TRANS_CONCURRENT       0x0002  /* concurrent w/flush */
 #define HAMMER2_TRANS_BUFCACHE         0x0004  /* from bioq strategy write */
 #define HAMMER2_TRANS_NEWINODE         0x0008  /* caller allocating inode */
-#define HAMMER2_TRANS_UNUSED0010       0x0010
+#define HAMMER2_TRANS_KEEPMODIFY       0x0010  /* do not change bref.modify */
 #define HAMMER2_TRANS_PREFLUSH         0x0020  /* preflush state */
 
 #define HAMMER2_FREEMAP_HEUR_NRADIX    4       /* pwr 2 PBUFRADIX-MINIORADIX */
@@ -843,6 +815,7 @@ struct hammer2_dev {
        struct lock     vollk;          /* lockmgr lock */
        hammer2_off_t   heur_freemap[HAMMER2_FREEMAP_HEUR];
        int             volhdrno;       /* last volhdrno written */
+       char            devrepname[64]; /* for kprintf */
        hammer2_volume_data_t voldata;
        hammer2_volume_data_t volsync;  /* synchronized voldata */
 };
@@ -918,9 +891,8 @@ struct hammer2_pfs {
        struct malloc_type      *mmsg;
        struct spinlock         inum_spin;      /* inumber lookup */
        struct hammer2_inode_tree inum_tree;    /* (not applicable to spmp) */
-       hammer2_tid_t           alloc_tid;
-       hammer2_tid_t           flush_tid;
-       hammer2_tid_t           inode_tid;
+       hammer2_tid_t           modify_tid;     /* modify transaction id */
+       hammer2_tid_t           inode_tid;      /* inode allocator */
        uint8_t                 pfs_nmasters;   /* total masters */
        uint8_t                 pfs_mode;       /* operating mode PFSMODE */
        uint8_t                 unused01;
@@ -1231,7 +1203,6 @@ void hammer2_base_insert(hammer2_trans_t *trans, hammer2_chain_t *chain,
  */
 void hammer2_trans_init(hammer2_trans_t *trans, hammer2_pfs_t *pmp,
                                int flags);
-void hammer2_trans_spmp(hammer2_trans_t *trans, hammer2_pfs_t *pmp);
 void hammer2_trans_done(hammer2_trans_t *trans);
 
 /*
@@ -1284,7 +1255,7 @@ void hammer2_bioq_sync(hammer2_pfs_t *pmp);
 int hammer2_vfs_sync(struct mount *mp, int waitflags);
 hammer2_pfs_t *hammer2_pfsalloc(hammer2_cluster_t *cluster,
                                const hammer2_inode_data_t *ripdata,
-                               hammer2_tid_t alloc_tid);
+                               hammer2_tid_t modify_tid);
 
 void hammer2_lwinprog_ref(hammer2_pfs_t *pmp);
 void hammer2_lwinprog_drop(hammer2_pfs_t *pmp);
index 38630d9..179bfcc 100644 (file)
@@ -91,8 +91,7 @@ hammer2_bulk_scan(hammer2_trans_t *trans, hammer2_chain_t *parent,
                 * lock the parent, the lock eats the ref.
                 */
                hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS |
-                                          HAMMER2_RESOLVE_SHARED |
-                                          HAMMER2_RESOLVE_NOREF);
+                                          HAMMER2_RESOLVE_SHARED);
 
                /*
                 * Generally loop on the contents if we have not been flagged
@@ -108,6 +107,7 @@ hammer2_bulk_scan(hammer2_trans_t *trans, hammer2_chain_t *parent,
 
                        if (doabort & HAMMER2_BULK_ABORT) {
                                hammer2_chain_unlock(chain);
+                               hammer2_chain_drop(chain);
                                chain = NULL;
                                break;
                        }
@@ -144,6 +144,7 @@ hammer2_bulk_scan(hammer2_trans_t *trans, hammer2_chain_t *parent,
                 * save structure if we didn't recycle it above.
                 */
                hammer2_chain_unlock(parent);
+               hammer2_chain_drop(parent);
                if (save)
                        kfree(save, M_HAMMER2);
        }
@@ -469,6 +470,7 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
        bmap = cbinfo->bmap;
 
        live_parent = &cbinfo->hmp->fchain;
+       hammer2_chain_ref(live_parent);
        hammer2_chain_lock(live_parent, HAMMER2_RESOLVE_ALWAYS);
        live_chain = NULL;
 
@@ -487,8 +489,10 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
                 */
                key = (data_off & ~HAMMER2_FREEMAP_LEVEL1_MASK);
                if (live_chain == NULL || live_chain->bref.key != key) {
-                       if (live_chain)
+                       if (live_chain) {
                                hammer2_chain_unlock(live_chain);
+                               hammer2_chain_drop(live_chain);
+                       }
                        live_chain = hammer2_chain_lookup(
                                            &live_parent,
                                            &key_dummy,
@@ -516,6 +520,7 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
                                hammer2_error_str(live_chain->error),
                                (intmax_t)data_off);
                        hammer2_chain_unlock(live_chain);
+                       hammer2_chain_drop(live_chain);
                        live_chain = NULL;
                        goto next;
                }
@@ -543,10 +548,14 @@ next:
                data_off += HAMMER2_FREEMAP_LEVEL0_SIZE;
                ++bmap;
        }
-       if (live_chain)
+       if (live_chain) {
                hammer2_chain_unlock(live_chain);
-       if (live_parent)
+               hammer2_chain_drop(live_chain);
+       }
+       if (live_parent) {
                hammer2_chain_unlock(live_parent);
+               hammer2_chain_drop(live_parent);
+       }
 }
 
 static
index a350127..2c0ca68 100644 (file)
@@ -505,7 +505,7 @@ hammer2_chain_drop_data(hammer2_chain_t *chain, int lastdrop)
 }
 
 /*
- * Ref and lock a chain element, acquiring its data with I/O if necessary,
+ * Lock a referenced chain element, acquiring its data with I/O if necessary,
  * and specify how you would like the data to be resolved.
  *
  * If an I/O or other fatal error occurs, chain->error will be set to non-zero.
@@ -558,8 +558,7 @@ hammer2_chain_lock(hammer2_chain_t *chain, int how)
        /*
         * Ref and lock the element.  Recursive locks are allowed.
         */
-       if ((how & HAMMER2_RESOLVE_NOREF) == 0)
-               hammer2_chain_ref(chain);
+       KKASSERT(chain->refs > 0);
        atomic_add_int(&chain->lockcnt, 1);
 
        hmp = chain->hmp;
@@ -742,7 +741,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
                        if (atomic_cmpset_int(&chain->lockcnt,
                                              lockcnt, lockcnt - 1)) {
                                hammer2_mtx_unlock(&chain->core.lock);
-                               hammer2_chain_drop(chain);
                                return;
                        }
                } else {
@@ -767,7 +765,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
        ostate = hammer2_mtx_upgrade(&chain->core.lock);
        if (chain->lockcnt) {
                hammer2_mtx_unlock(&chain->core.lock);
-               hammer2_chain_drop(chain);
                return;
        }
 
@@ -781,7 +778,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
                if ((chain->flags & HAMMER2_CHAIN_MODIFIED) == 0)
                        hammer2_chain_drop_data(chain, 0);
                hammer2_mtx_unlock(&chain->core.lock);
-               hammer2_chain_drop(chain);
                return;
        }
 
@@ -856,7 +852,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
                hammer2_io_bqrelse(&chain->dio);
        }
        hammer2_mtx_unlock(&chain->core.lock);
-       hammer2_chain_drop(chain);
 }
 
 /*
@@ -1049,6 +1044,20 @@ hammer2_chain_modify(hammer2_trans_t *trans, hammer2_chain_t *chain, int flags)
                }
        }
 
+       /*
+        * Update mirror_tid and modify_tid.
+        *
+        * NOTE: modify_tid updates can be suppressed with a flag.  This is
+        *       used by the slave synchronization code to delay updating
+        *       modify_tid in higher-level objects until lower-level objects
+        *       have been synchronized.
+        *
+        * NOTE: chain->pmp could be the device spmp.
+        */
+       chain->bref.mirror_tid = hmp->voldata.mirror_tid + 1;
+       if (chain->pmp && (trans->flags & HAMMER2_TRANS_KEEPMODIFY) == 0)
+               chain->bref.modify_tid = chain->pmp->modify_tid + 1;
+
        /*
         * Set BMAPUPD to tell the flush code that an existing blockmap entry
         * requires updating as well as to tell the delete code that the
@@ -1423,6 +1432,7 @@ hammer2_chain_get(hammer2_chain_t *parent, int generation,
 hammer2_chain_t *
 hammer2_chain_lookup_init(hammer2_chain_t *parent, int flags)
 {
+       hammer2_chain_ref(parent);
        if (flags & HAMMER2_LOOKUP_SHARED) {
                hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS |
                                           HAMMER2_RESOLVE_SHARED);
@@ -1435,8 +1445,10 @@ hammer2_chain_lookup_init(hammer2_chain_t *parent, int flags)
 void
 hammer2_chain_lookup_done(hammer2_chain_t *parent)
 {
-       if (parent)
+       if (parent) {
                hammer2_chain_unlock(parent);
+               hammer2_chain_drop(parent);
+       }
 }
 
 static
@@ -1457,10 +1469,11 @@ hammer2_chain_getparent(hammer2_chain_t **parentp, int how)
        hammer2_spin_unex(&oparent->core.spin);
        if (oparent) {
                hammer2_chain_unlock(oparent);
+               hammer2_chain_drop(oparent);
                oparent = NULL;
        }
 
-       hammer2_chain_lock(nparent, how | HAMMER2_RESOLVE_NOREF);
+       hammer2_chain_lock(nparent, how);
        *parentp = nparent;
 
        return (nparent);
@@ -1481,7 +1494,14 @@ hammer2_chain_getparent(hammer2_chain_t **parentp, int how)
  * will be unlocked and dereferenced (no change if they are both the same).
  *
  * The matching chain will be returned exclusively locked.  If NOLOCK is
- * requested the chain will be returned only referenced.
+ * requested the chain will be returned only referenced.  Note that the
+ * parent chain must always be locked shared or exclusive, matching the
+ * HAMMER2_LOOKUP_SHARED flag.  We can conceivably lock it SHARED temporarily
+ * when NOLOCK is specified but that complicates matters if *parentp must
+ * inherit the chain.
+ *
+ * NOLOCK also implies NODATA, since an unlocked chain usually has a NULL
+ * data pointer or can otherwise be in flux.
  *
  * NULL is returned if no match was found, but (*parentp) will still
  * potentially be adjusted.
@@ -1533,7 +1553,7 @@ hammer2_chain_lookup(hammer2_chain_t **parentp, hammer2_key_t *key_nextp,
        } else {
                how = HAMMER2_RESOLVE_MAYBE;
        }
-       if (flags & (HAMMER2_LOOKUP_SHARED | HAMMER2_LOOKUP_NOLOCK)) {
+       if (flags & HAMMER2_LOOKUP_SHARED) {
                how_maybe |= HAMMER2_RESOLVE_SHARED;
                how_always |= HAMMER2_RESOLVE_SHARED;
                how |= HAMMER2_RESOLVE_SHARED;
@@ -1577,9 +1597,8 @@ again:
                 * This is only applicable to regular files and softlinks.
                 */
                if (parent->data->ipdata.op_flags & HAMMER2_OPFLAG_DIRECTDATA) {
-                       if (flags & HAMMER2_LOOKUP_NOLOCK)
-                               hammer2_chain_ref(parent);
-                       else
+                       hammer2_chain_ref(parent);
+                       if ((flags & HAMMER2_LOOKUP_NOLOCK) == 0)
                                hammer2_chain_lock(parent, how_always);
                        *key_nextp = key_end + 1;
                        return (parent);
@@ -1598,6 +1617,7 @@ again:
                               ((hammer2_key_t)1 << parent->bref.keybits) - 1;
                        if (key_beg == scan_beg && key_end == scan_end) {
                                chain = parent;
+                               hammer2_chain_ref(chain);
                                hammer2_chain_lock(chain, how_maybe);
                                *key_nextp = scan_end + 1;
                                goto done;
@@ -1709,9 +1729,9 @@ again:
         */
        if (chain->bref.type == HAMMER2_BREF_TYPE_INDIRECT ||
            chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE) {
-               hammer2_chain_lock(chain, how_maybe | HAMMER2_RESOLVE_NOREF);
+               hammer2_chain_lock(chain, how_maybe);
        } else {
-               hammer2_chain_lock(chain, how | HAMMER2_RESOLVE_NOREF);
+               hammer2_chain_lock(chain, how);
        }
 
        /*
@@ -1729,6 +1749,7 @@ again:
         */
        if (chain->flags & HAMMER2_CHAIN_DELETED) {
                hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
                key_beg = *key_nextp;
                if (key_beg == 0 || key_beg > key_end)
                        return(NULL);
@@ -1754,6 +1775,7 @@ again:
        if (chain->bref.type == HAMMER2_BREF_TYPE_INDIRECT ||
            chain->bref.type == HAMMER2_BREF_TYPE_FREEMAP_NODE) {
                hammer2_chain_unlock(parent);
+               hammer2_chain_drop(parent);
                *parentp = parent = chain;
                goto again;
        }
@@ -1767,10 +1789,8 @@ done:
         * need to be resolved.
         */
        if (chain) {
-               if (flags & HAMMER2_LOOKUP_NOLOCK) {
-                       hammer2_chain_ref(chain);
+               if (flags & HAMMER2_LOOKUP_NOLOCK)
                        hammer2_chain_unlock(chain);
-               }
        }
 
        return (chain);
@@ -1808,7 +1828,7 @@ hammer2_chain_next(hammer2_chain_t **parentp, hammer2_chain_t *chain,
         * Calculate locking flags for upward recursion.
         */
        how_maybe = HAMMER2_RESOLVE_MAYBE;
-       if (flags & (HAMMER2_LOOKUP_SHARED | HAMMER2_LOOKUP_NOLOCK))
+       if (flags & HAMMER2_LOOKUP_SHARED)
                how_maybe |= HAMMER2_RESOLVE_SHARED;
 
        parent = *parentp;
@@ -1819,10 +1839,9 @@ hammer2_chain_next(hammer2_chain_t **parentp, hammer2_chain_t *chain,
        if (chain) {
                key_beg = chain->bref.key +
                          ((hammer2_key_t)1 << chain->bref.keybits);
-               if (flags & HAMMER2_LOOKUP_NOLOCK)
-                       hammer2_chain_drop(chain);
-               else
+               if ((flags & HAMMER2_LOOKUP_NOLOCK) == 0)
                        hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
 
                /*
                 * chain invalid past this point, but we can still do a
@@ -1892,7 +1911,7 @@ hammer2_chain_scan(hammer2_chain_t *parent, hammer2_chain_t *chain,
        hmp = parent->hmp;
 
        /*
-        * Scan flags borrowed from lookup
+        * Scan flags borrowed from lookup.
         */
        if (flags & HAMMER2_LOOKUP_ALWAYS) {
                how_maybe = how_always;
@@ -1902,7 +1921,7 @@ hammer2_chain_scan(hammer2_chain_t *parent, hammer2_chain_t *chain,
        } else {
                how = HAMMER2_RESOLVE_MAYBE;
        }
-       if (flags & (HAMMER2_LOOKUP_SHARED | HAMMER2_LOOKUP_NOLOCK)) {
+       if (flags & HAMMER2_LOOKUP_SHARED) {
                how_maybe |= HAMMER2_RESOLVE_SHARED;
                how_always |= HAMMER2_RESOLVE_SHARED;
                how |= HAMMER2_RESOLVE_SHARED;
@@ -1916,6 +1935,7 @@ hammer2_chain_scan(hammer2_chain_t *parent, hammer2_chain_t *chain,
                key = chain->bref.key +
                      ((hammer2_key_t)1 << chain->bref.keybits);
                hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
                chain = NULL;
                if (key == 0)
                        goto done;
@@ -2026,7 +2046,7 @@ again:
         * chain is referenced but not locked.  We must lock the chain
         * to obtain definitive DUPLICATED/DELETED state
         */
-       hammer2_chain_lock(chain, how | HAMMER2_RESOLVE_NOREF);
+       hammer2_chain_lock(chain, how);
 
        /*
         * Skip deleted chains (XXX cache 'i' end-of-block-array? XXX)
@@ -2046,6 +2066,7 @@ again:
         */
        if (chain->flags & HAMMER2_CHAIN_DELETED) {
                hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
                chain = NULL;
 
                key = next_key;
@@ -2287,6 +2308,7 @@ again:
                }
                if (parent != nparent) {
                        hammer2_chain_unlock(parent);
+                       hammer2_chain_drop(parent);
                        parent = *parentp = nparent;
                }
                goto again;
@@ -2822,7 +2844,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        ichain = hammer2_chain_alloc(hmp, parent->pmp, trans, &dummy.bref);
        atomic_set_int(&ichain->flags, HAMMER2_CHAIN_INITIAL);
        hammer2_chain_lock(ichain, HAMMER2_RESOLVE_MAYBE);
-       hammer2_chain_drop(ichain);     /* excess ref from alloc */
+       /* ichain has one ref at this point */
 
        /*
         * We have to mark it modified to allocate its block, but use
@@ -2890,8 +2912,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                         */
                        hammer2_chain_ref(chain);
                        hammer2_spin_unex(&parent->core.spin);
-                       hammer2_chain_lock(chain, HAMMER2_RESOLVE_NEVER |
-                                                 HAMMER2_RESOLVE_NOREF);
+                       hammer2_chain_lock(chain, HAMMER2_RESOLVE_NEVER);
                } else {
                        /*
                         * Get chain for blockref element.  _get returns NULL
@@ -2912,8 +2933,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                                hammer2_spin_ex(&parent->core.spin);
                                continue;
                        }
-                       hammer2_chain_lock(chain, HAMMER2_RESOLVE_NEVER |
-                                                 HAMMER2_RESOLVE_NOREF);
+                       hammer2_chain_lock(chain, HAMMER2_RESOLVE_NEVER);
                }
 
                /*
@@ -2931,6 +2951,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                 */
                if (chain->flags & HAMMER2_CHAIN_DELETED) {
                        hammer2_chain_unlock(chain);
+                       hammer2_chain_drop(chain);
                        goto next_key;
                }
 
@@ -2947,6 +2968,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
                hammer2_chain_rename(trans, NULL, &ichain, chain,
                                     HAMMER2_INSERT_NOSTATS);
                hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
                KKASSERT(parent->refs > 0);
                chain = NULL;
 next_key:
@@ -2996,6 +3018,7 @@ next_key_spinlocked:
                 * return the original parent.
                 */
                hammer2_chain_unlock(ichain);
+               hammer2_chain_drop(ichain);
        } else {
                /*
                 * Otherwise its in the range, return the new parent.
@@ -3316,6 +3339,14 @@ hammer2_chain_delete(hammer2_trans_t *trans, hammer2_chain_t *parent,
                _hammer2_chain_delete_helper(trans, parent, chain, flags);
        }
 
+       /*
+        * NOTE: Special case call to hammer2_flush().  We are not in a FLUSH
+        *       transaction, so we can't pass a mirror_tid for the volume.
+        *       But since we are destroying the chain we can just pass 0
+        *       and use the flush call to clean out the subtopology.
+        *
+        *       XXX not the best way to destroy the sub-topology.
+        */
        if (flags & HAMMER2_DELETE_PERMANENT) {
                atomic_set_int(&chain->flags, HAMMER2_CHAIN_DESTROY);
                hammer2_flush(trans, chain);
index fd41002..8c2382b 100644 (file)
@@ -127,6 +127,8 @@ hammer2_cluster_need_resize(hammer2_cluster_t *cluster, int bytes)
 
        KKASSERT(cluster->flags & HAMMER2_CLUSTER_LOCKED);
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0)
+                       continue;
                chain = cluster->array[i].chain;
                if (chain == NULL)
                        continue;
@@ -171,7 +173,8 @@ hammer2_cluster_modified(hammer2_cluster_t *cluster)
  * Returns the bref of the cluster's focus, sans any data-offset information
  * (since offset information is per-node and wouldn't be useful).
  *
- * Callers use this function to access mirror_tid, key, and keybits.
+ * Callers use this function to access modify_tid, mirror_tid, type,
+ * key, and keybits.
  *
  * If the cluster is errored, returns an empty bref.
  * The cluster must be locked.
@@ -206,6 +209,8 @@ hammer2_cluster_isunlinked(hammer2_cluster_t *cluster)
 
        flags = 0;
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0)
+                       continue;
                chain = cluster->array[i].chain;
                if (chain)
                        flags |= chain->flags;
@@ -216,6 +221,8 @@ hammer2_cluster_isunlinked(hammer2_cluster_t *cluster)
 /*
  * Set a bitmask of flags in all chains related to a cluster.
  * The cluster should probably be locked.
+ *
+ * XXX Only operate on FEMOD elements?
  */
 void
 hammer2_cluster_set_chainflags(hammer2_cluster_t *cluster, uint32_t flags)
@@ -233,6 +240,8 @@ hammer2_cluster_set_chainflags(hammer2_cluster_t *cluster, uint32_t flags)
 /*
  * Set a bitmask of flags in all chains related to a cluster.
  * The cluster should probably be locked.
+ *
+ * XXX Only operate on FEMOD elements?
  */
 void
 hammer2_cluster_clr_chainflags(hammer2_cluster_t *cluster, uint32_t flags)
@@ -263,6 +272,8 @@ hammer2_cluster_setflush(hammer2_trans_t *trans, hammer2_cluster_t *cluster)
        int i;
 
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0)
+                       continue;
                chain = cluster->array[i].chain;
                if (chain == NULL)
                        continue;
@@ -276,7 +287,7 @@ hammer2_cluster_setflush(hammer2_trans_t *trans, hammer2_cluster_t *cluster)
  * Set the check mode for the cluster.
  * Errored elements of the cluster are ignored.
  *
- * The cluster must be locked.
+ * The cluster must be locked and modified.
  */
 void
 hammer2_cluster_setmethod_check(hammer2_trans_t *trans,
@@ -288,6 +299,10 @@ hammer2_cluster_setmethod_check(hammer2_trans_t *trans,
 
        KKASSERT(cluster->flags & HAMMER2_CLUSTER_LOCKED);
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0) {
+                       cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
+                       continue;
+               }
                chain = cluster->array[i].chain;
                if (chain == NULL)
                        continue;
@@ -316,6 +331,7 @@ hammer2_cluster_from_chain(hammer2_chain_t *chain)
 
        cluster = kmalloc(sizeof(*cluster), M_HAMMER2, M_WAITOK | M_ZERO);
        cluster->array[0].chain = chain;
+       cluster->array[0].flags = HAMMER2_CITEM_FEMOD;
        cluster->nchains = 1;
        cluster->focus = chain;
        cluster->focus_index = 0;
@@ -397,6 +413,13 @@ hammer2_cluster_wait(hammer2_cluster_t *cluster)
  * cluster does not adjust this flag since exact matches only matter for leafs
  * (parents can depend on minor differences in topology).
  *
+ * HAMMER2_CITEM_FEMOD flags which elements can be modified by normal
+ * operations.  Typically this is only set on a quorum of MASTERs or
+ * on a SOFT_MASTER.  Also as a degenerate case on SUPROOT.  If a SOFT_MASTER
+ * is present, this bit is *not* set on a quorum of MASTERs.  The
+ * synchronization code ignores this bit, but all hammer2_cluster_*() calls
+ * that create/modify/delete elements use it.
+ *
  * The chains making up the cluster may be narrowed down based on quorum
  * acceptability, and if RESOLVE_RDONLY is specified the chains can be
  * narrowed down to a single chain as long as the entire subtopology is known
@@ -418,18 +441,16 @@ hammer2_cluster_lock(hammer2_cluster_t *cluster, int how)
        int i;
 
        /* cannot be on inode-embedded cluster template, must be on copy */
+       KKASSERT(cluster->refs > 0);
        KKASSERT((cluster->flags & HAMMER2_CLUSTER_INODE) == 0);
        if (cluster->flags & HAMMER2_CLUSTER_LOCKED) {
-               kprintf("hammer2_cluster_lock: cluster %p already locked!\n",
+               panic("hammer2_cluster_lock: cluster %p already locked!\n",
                        cluster);
        } else {
                KKASSERT(cluster->focus == NULL);
        }
        atomic_set_int(&cluster->flags, HAMMER2_CLUSTER_LOCKED);
 
-       if ((how & HAMMER2_RESOLVE_NOREF) == 0)
-               atomic_add_int(&cluster->refs, 1);
-
        /*
         * Lock chains and resolve state.
         */
@@ -456,6 +477,7 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
        int nmasters;
        int nslaves;
        int nquorum;
+       int smpresent;
        int i;
 
        cluster->error = 0;
@@ -474,6 +496,7 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
        pmp = cluster->pmp;
        KKASSERT(pmp != NULL || cluster->nchains == 0);
        nquorum = pmp ? pmp->pfs_nmasters / 2 + 1 : 0;
+       smpresent = 0;
 
        /*
         * Pass 1
@@ -510,11 +533,11 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                                 * Invalid as in unsynchronized, cannot be
                                 * used to calculate the quorum.
                                 */
-                       } else if (quorum_tid < chain->bref.mirror_tid ||
+                       } else if (quorum_tid < chain->bref.modify_tid ||
                                   nmasters == 0) {
                                nmasters = 1;
-                               quorum_tid = chain->bref.mirror_tid;
-                       } else if (quorum_tid == chain->bref.mirror_tid) {
+                               quorum_tid = chain->bref.modify_tid;
+                       } else if (quorum_tid == chain->bref.modify_tid) {
                                ++nmasters;
                        }
                        break;
@@ -524,6 +547,7 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                case HAMMER2_PFSTYPE_SOFT_MASTER:
                        nflags |= HAMMER2_CLUSTER_WRSOFT;
                        nflags |= HAMMER2_CLUSTER_RDSOFT;
+                       smpresent = 1;
                        break;
                case HAMMER2_PFSTYPE_SOFT_SLAVE:
                        nflags |= HAMMER2_CLUSTER_RDSOFT;
@@ -531,7 +555,8 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                case HAMMER2_PFSTYPE_SUPROOT:
                        /*
                         * Degenerate cluster representing the super-root
-                        * topology on a single device.
+                        * topology on a single device.  Fake stuff so
+                        * cluster ops work as expected.
                         */
                        nflags |= HAMMER2_CLUSTER_WRHARD;
                        nflags |= HAMMER2_CLUSTER_RDHARD;
@@ -548,6 +573,7 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
         * Pass 2
         */
        for (i = 0; i < cluster->nchains; ++i) {
+               cluster->array[i].flags &= ~HAMMER2_CITEM_FEMOD;
                chain = cluster->array[i].chain;
                if (chain == NULL)
                        continue;
@@ -563,8 +589,8 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                case HAMMER2_PFSTYPE_MASTER:
                        /*
                         * We must have enough up-to-date masters to reach
-                        * a quorum and the master mirror_tid must match
-                        * the quorum's mirror_tid.
+                        * a quorum and the master modify_tid must match
+                        * the quorum's modify_tid.
                         *
                         * Do not select an errored or out-of-sync master.
                         */
@@ -572,9 +598,13 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                                nflags |= HAMMER2_CLUSTER_UNHARD;
                        } else if (nmasters >= nquorum &&
                                   chain->error == 0 &&
-                                  quorum_tid == chain->bref.mirror_tid) {
+                                  quorum_tid == chain->bref.modify_tid) {
                                nflags |= HAMMER2_CLUSTER_WRHARD;
                                nflags |= HAMMER2_CLUSTER_RDHARD;
+                               if (!smpresent) {
+                                       cluster->array[i].flags |=
+                                                       HAMMER2_CITEM_FEMOD;
+                               }
                                if (cluster->focus == NULL ||
                                    focus_pfs_type == HAMMER2_PFSTYPE_SLAVE) {
                                        focus_pfs_type = HAMMER2_PFSTYPE_MASTER;
@@ -589,8 +619,8 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                case HAMMER2_PFSTYPE_SLAVE:
                        /*
                         * We must have enough up-to-date masters to reach
-                        * a quorum and the slave mirror_tid must match the
-                        * quorum's mirror_tid.
+                        * a quorum and the slave modify_tid must match the
+                        * quorum's modify_tid.
                         *
                         * Do not select an errored slave.
                         */
@@ -598,7 +628,7 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                                nflags |= HAMMER2_CLUSTER_UNHARD;
                        } else if (nmasters >= nquorum &&
                                   chain->error == 0 &&
-                                  quorum_tid == chain->bref.mirror_tid) {
+                                  quorum_tid == chain->bref.modify_tid) {
                                ++nslaves;
                                nflags |= HAMMER2_CLUSTER_RDHARD;
                                if (cluster->focus == NULL) {
@@ -621,6 +651,7 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                        cluster->focus = chain;
                        cluster->error = chain->error;
                        focus_pfs_type = HAMMER2_PFSTYPE_SOFT_MASTER;
+                       cluster->array[i].flags |= HAMMER2_CITEM_FEMOD;
                        break;
                case HAMMER2_PFSTYPE_SOFT_SLAVE:
                        /*
@@ -635,15 +666,26 @@ hammer2_cluster_resolve(hammer2_cluster_t *cluster)
                                focus_pfs_type = HAMMER2_PFSTYPE_SOFT_SLAVE;
                        }
                        break;
+               case HAMMER2_PFSTYPE_SUPROOT:
+                       /*
+                        * spmp (degenerate case)
+                        */
+                       KKASSERT(i == 0);
+                       cluster->focus_index = i;
+                       cluster->focus = chain;
+                       cluster->error = chain->error;
+                       focus_pfs_type = HAMMER2_PFSTYPE_SUPROOT;
+                       cluster->array[i].flags |= HAMMER2_CITEM_FEMOD;
+                       break;
                default:
                        break;
                }
        }
 
        if (ttlslaves == 0)
-               nflags |= HAMMER2_CLUSTER_NOHARD;
-       if (ttlmasters == 0)
                nflags |= HAMMER2_CLUSTER_NOSOFT;
+       if (ttlmasters == 0)
+               nflags |= HAMMER2_CLUSTER_NOHARD;
 
        /*
         * Set SSYNCED or MSYNCED for slaves and masters respectively if
@@ -713,25 +755,17 @@ hammer2_cluster_unlock(hammer2_cluster_t *cluster)
                kprintf("hammer2_cluster_unlock: cluster %p not locked\n",
                        cluster);
        }
-       /* KKASSERT(cluster->flags & HAMMER2_CLUSTER_LOCKED); */
+       KKASSERT(cluster->flags & HAMMER2_CLUSTER_LOCKED);
        KKASSERT(cluster->refs > 0);
        atomic_clear_int(&cluster->flags, HAMMER2_CLUSTER_LOCKED);
 
        for (i = 0; i < cluster->nchains; ++i) {
                chain = cluster->array[i].chain;
-               if (chain) {
+               if (chain)
                        hammer2_chain_unlock(chain);
-                       if (cluster->refs == 1)
-                               cluster->array[i].chain = NULL; /* safety */
-               }
        }
        cluster->focus_index = 0;
        cluster->focus = NULL;
-
-       if (atomic_fetchadd_int(&cluster->refs, -1) == 1) {
-               kfree(cluster, M_HAMMER2);
-               /* cluster = NULL; safety */
-       }
 }
 
 /*
@@ -750,6 +784,10 @@ hammer2_cluster_resize(hammer2_trans_t *trans, hammer2_inode_t *ip,
        KKASSERT(cparent->nchains == cluster->nchains);
 
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0) {
+                       cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
+                       continue;
+               }
                chain = cluster->array[i].chain;
                if (chain) {
                        KKASSERT(cparent->array[i].chain);
@@ -799,6 +837,10 @@ hammer2_cluster_modify(hammer2_trans_t *trans, hammer2_cluster_t *cluster,
 
        resolve_again = 0;
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0) {
+                       cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
+                       continue;
+               }
                chain = cluster->array[i].chain;
                if (chain == NULL)
                        continue;
@@ -838,6 +880,8 @@ hammer2_cluster_modsync(hammer2_cluster_t *cluster)
        KKASSERT(focus->flags & HAMMER2_CHAIN_MODIFIED);
 
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0)
+                       continue;
                scan = cluster->array[i].chain;
                if (scan == NULL || scan == focus)
                        continue;
@@ -875,26 +919,14 @@ hammer2_cluster_modsync(hammer2_cluster_t *cluster)
 }
 
 /*
- * Lookup initialization/completion API
+ * Lookup initialization/completion API.  Returns a locked cluster with 1 ref.
  */
 hammer2_cluster_t *
 hammer2_cluster_lookup_init(hammer2_cluster_t *cparent, int flags)
 {
        hammer2_cluster_t *cluster;
-       int i;
 
-       cluster = kmalloc(sizeof(*cluster), M_HAMMER2, M_WAITOK | M_ZERO);
-       cluster->pmp = cparent->pmp;                    /* can be NULL */
-       cluster->flags = 0;     /* cluster not locked (yet) */
-       /* cluster->focus = NULL; already null */
-
-       for (i = 0; i < cparent->nchains; ++i)
-               cluster->array[i].chain = cparent->array[i].chain;
-       cluster->nchains = cparent->nchains;
-
-       /*
-        * Independently lock (this will also give cluster 1 ref)
-        */
+       cluster = hammer2_cluster_copy(cparent);
        if (flags & HAMMER2_LOOKUP_SHARED) {
                hammer2_cluster_lock(cluster, HAMMER2_RESOLVE_ALWAYS |
                                              HAMMER2_RESOLVE_SHARED);
@@ -907,8 +939,10 @@ hammer2_cluster_lookup_init(hammer2_cluster_t *cparent, int flags)
 void
 hammer2_cluster_lookup_done(hammer2_cluster_t *cparent)
 {
-       if (cparent)
+       if (cparent) {
                hammer2_cluster_unlock(cparent);
+               hammer2_cluster_drop(cparent);
+       }
 }
 
 /*
@@ -1018,6 +1052,7 @@ hammer2_cluster_lookup(hammer2_cluster_t *cparent, hammer2_key_t *key_nextp,
                if (chain->bref.type != focus->bref.type ||
                    chain->bref.key != focus->bref.key ||
                    chain->bref.keybits != focus->bref.keybits ||
+                   chain->bref.modify_tid != focus->bref.modify_tid ||
                    chain->bytes != focus->bytes ||
                    ddflag != cluster->ddflag) {
                        cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
@@ -1083,10 +1118,10 @@ hammer2_cluster_next(hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
                if (cparent->array[i].chain == NULL ||
                    (cparent->array[i].flags & HAMMER2_CITEM_INVALID) ||
                    (cluster->array[i].flags & HAMMER2_CITEM_INVALID)) {
-                       if (flags & HAMMER2_LOOKUP_NOLOCK)
-                               hammer2_chain_drop(ochain);
-                       else
+                       if ((flags & HAMMER2_LOOKUP_NOLOCK) == 0)
                                hammer2_chain_unlock(ochain);
+                       hammer2_chain_drop(ochain);
+                       cluster->array[i].chain = NULL;
                        ++null_count;
                        continue;
                }
@@ -1154,6 +1189,7 @@ hammer2_cluster_next(hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
                if (nchain->bref.type != focus->bref.type ||
                    nchain->bref.key != focus->bref.key ||
                    nchain->bref.keybits != focus->bref.keybits ||
+                   nchain->bref.modify_tid != focus->bref.modify_tid ||
                    nchain->bytes != focus->bytes ||
                    ddflag != cluster->ddflag) {
                        cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
@@ -1211,7 +1247,8 @@ hammer2_cluster_next_single_chain(hammer2_cluster_t *cparent,
        /* ochain now invalid */
 
        /*
-        * Install nchain.  Note that nchain can be NULL.
+        * Install nchain.  Note that nchain can be NULL, and can also
+        * be in an unlocked state depending on flags.
         */
        cluster->array[i].chain = nchain;
        cluster->array[i].flags &= ~HAMMER2_CITEM_INVALID;
@@ -1232,6 +1269,7 @@ hammer2_cluster_next_single_chain(hammer2_cluster_t *cparent,
        if (nchain->bref.type != focus->bref.type ||
            nchain->bref.key != focus->bref.key ||
            nchain->bref.keybits != focus->bref.keybits ||
+           nchain->bref.modify_tid != focus->bref.modify_tid ||
            nchain->bytes != focus->bytes ||
            ddflag != cluster->ddflag) {
                cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
@@ -1279,9 +1317,20 @@ hammer2_cluster_create(hammer2_trans_t *trans, hammer2_cluster_t *cparent,
         *       create new chains.
         */
        for (i = 0; i < cparent->nchains; ++i) {
-               if (*clusterp && cluster->array[i].chain == NULL) {
+               if ((cparent->array[i].flags & HAMMER2_CITEM_FEMOD) == 0) {
+                       cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
                        continue;
                }
+               if (*clusterp) {
+                       if ((cluster->array[i].flags &
+                            HAMMER2_CITEM_FEMOD) == 0) {
+                               cluster->array[i].flags |=
+                                               HAMMER2_CITEM_INVALID;
+                               continue;
+                       }
+                       if (cluster->array[i].chain == NULL)
+                               continue;
+               }
                error = hammer2_chain_create(trans, &cparent->array[i].chain,
                                             &cluster->array[i].chain, pmp,
                                             key, keybits,
@@ -1325,6 +1374,10 @@ hammer2_cluster_rename(hammer2_trans_t *trans, hammer2_blockref_t *bref,
        cparent->focus_index = 0;
 
        for (i = 0; i < cluster->nchains; ++i) {
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0) {
+                       cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
+                       continue;
+               }
                chain = cluster->array[i].chain;
                if (chain) {
                        if (bref) {
@@ -1361,8 +1414,11 @@ hammer2_cluster_delete(hammer2_trans_t *trans, hammer2_cluster_t *cparent,
        }
 
        for (i = 0; i < cluster->nchains; ++i) {
-               parent = (i < cparent->nchains) ?
-                        cparent->array[i].chain : NULL;
+               if ((cluster->array[i].flags & HAMMER2_CITEM_FEMOD) == 0) {
+                       cluster->array[i].flags |= HAMMER2_CITEM_INVALID;
+                       continue;
+               }
+               parent = cparent->array[i].chain;
                chain = cluster->array[i].chain;
                if (chain == NULL)
                        continue;
@@ -1458,6 +1514,12 @@ hammer2_cluster_snapshot(hammer2_trans_t *trans, hammer2_cluster_t *ocluster,
                kern_uuidgen(&wipdata->pfs_clid, 1);
 
                for (i = 0; i < ncluster->nchains; ++i) {
+                       if ((ncluster->array[i].flags &
+                            HAMMER2_CITEM_FEMOD) == 0) {
+                               ncluster->array[i].flags |=
+                                       HAMMER2_CITEM_INVALID;
+                               continue;
+                       }
                        nchain = ncluster->array[i].chain;
                        if (nchain)
                                nchain->bref.flags |= HAMMER2_BREF_FLAG_PFSROOT;
@@ -1515,10 +1577,10 @@ hammer2_cluster_parent(hammer2_cluster_t *cluster)
                        hammer2_chain_unlock(chain);
                        hammer2_chain_lock(rchain, HAMMER2_RESOLVE_ALWAYS);
                        hammer2_chain_lock(chain, HAMMER2_RESOLVE_ALWAYS);
-                       hammer2_chain_drop(rchain);
                        if (chain->parent == rchain)
                                break;
                        hammer2_chain_unlock(rchain);
+                       hammer2_chain_drop(rchain);
                }
                if (cluster->focus == chain) {
                        cparent->focus_index = i;
index 0edd230..8633783 100644 (file)
@@ -553,8 +553,8 @@ struct hammer2_blockref {           /* MUST BE EXACTLY 64 BYTES */
        uint8_t         reserved06;
        uint8_t         reserved07;
        hammer2_key_t   key;            /* key specification */
-       hammer2_tid_t   mirror_tid;     /* propagate for mirror scan */
-       hammer2_tid_t   modify_tid;     /* modifications sans propagation */
+       hammer2_tid_t   mirror_tid;     /* media flush topology & freemap */
+       hammer2_tid_t   modify_tid;     /* cluster level change / flush */
        hammer2_off_t   data_off;       /* low 6 bits is phys size (radix)*/
        union {                         /* check info */
                char    buf[24];
@@ -1085,11 +1085,12 @@ struct hammer2_volume_data {
        hammer2_off_t   allocator_beg;          /* 0070 Initial allocations */
 
        /*
-        * mirror_tid reflects the highest committed super-root change
-        * freemap_tid reflects the highest committed freemap change
+        * mirror_tid reflects the highest committed change for this
+        * block device regardless of whether it is to the super-root
+        * or to a PFS or whatever.
         *
-        * NOTE: mirror_tid does not track (and should not track) changes
-        *       made to or under PFS roots.
+        * freemap_tid reflects the highest committed freemap change for
+        * this block device.
         */
        hammer2_tid_t   mirror_tid;             /* 0078 committed tid (vol) */
        hammer2_tid_t   reserved0080;           /* 0080 */
index 44b1e6c..f9cf05b 100644 (file)
  * Deceptively simple but actually fairly difficult to implement properly is
  * how I would describe it.
  *
- * The biggest issue is that each PFS may belong to a cluster so its media
- * modify_tid and mirror_tid fields are in a completely different domain
- * than the topology related to the super-root.
- *
  * Flushing generally occurs bottom-up but requires a top-down scan to
  * locate chains with MODIFIED and/or UPDATE bits set.  The ONFLUSH flag
  * tells how to recurse downward to find these chains.
@@ -76,6 +72,8 @@ struct hammer2_flush_info {
        int             cache_index;
        struct h2_flush_list flushq;
        hammer2_xid_t   sync_xid;       /* memory synchronization point */
+       hammer2_tid_t   mirror_tid;     /* avoid digging through hmp */
+       hammer2_tid_t   modify_tid;
        hammer2_chain_t *debug;
 };
 
@@ -166,11 +164,10 @@ hammer2_trans_init(hammer2_trans_t *trans, hammer2_pfs_t *pmp, int flags)
                 * unique TID for proper block table update accounting.
                 */
                ++tman->flushcnt;
-               ++pmp->alloc_tid;
-               pmp->flush_tid = pmp->alloc_tid;
+               ++pmp->modify_tid;
                tman->flush_xid = hammer2_trans_newxid(pmp);
                trans->sync_xid = tman->flush_xid;
-               ++pmp->alloc_tid;
+               trans->modify_tid = pmp->modify_tid;
                TAILQ_INSERT_TAIL(&tman->transq, trans, entry);
                if (TAILQ_FIRST(&tman->transq) != trans) {
                        trans->blocked = 1;
@@ -207,6 +204,7 @@ hammer2_trans_init(hammer2_trans_t *trans, hammer2_pfs_t *pmp, int flags)
                trans->flags |= HAMMER2_TRANS_PREFLUSH;
                TAILQ_INSERT_AFTER(&tman->transq, head, trans, entry);
                trans->sync_xid = head->sync_xid;
+               trans->modify_tid = head->modify_tid;
                trans->flags |= HAMMER2_TRANS_CONCURRENT;
                /* not allowed to block */
        } else {
@@ -265,21 +263,6 @@ hammer2_trans_init(hammer2_trans_t *trans, hammer2_pfs_t *pmp, int flags)
        lockmgr(&tman->translk, LK_RELEASE);
 }
 
-/*
- * This may only be called while in a flush transaction.  It's a bit of a
- * hack but after flushing a PFS we need to flush each volume root as part
- * of the same transaction.
- */
-void
-hammer2_trans_spmp(hammer2_trans_t *trans, hammer2_pfs_t *spmp)
-{
-       ++spmp->alloc_tid;
-       spmp->flush_tid = spmp->alloc_tid;
-       ++spmp->alloc_tid;
-       trans->pmp = spmp;
-}
-
-
 void
 hammer2_trans_done(hammer2_trans_t *trans)
 {
@@ -341,8 +324,8 @@ hammer2_trans_done(hammer2_trans_t *trans)
 
 /*
  * Flush the chain and all modified sub-chains through the specified
- * synchronization point, propagating parent chain modifications and
- * mirror_tid updates back up as needed.
+ * synchronization point, propagating parent chain modifications, modify_tid,
+ * and mirror_tid updates back up as needed.
  *
  * Caller must have interlocked against any non-flush-related modifying
  * operations in progress whos XXX values are less than or equal
@@ -359,8 +342,6 @@ hammer2_trans_done(hammer2_trans_t *trans)
  * UPDATE flag indicates that its parent's block table (which is not yet
  * part of the flush) should be updated.  The chain may be replaced by
  * the call if it was modified.
- *
- * NOTE: mirror_tid is not updated upward along the tree for SLAVE PFSs.
  */
 void
 hammer2_flush(hammer2_trans_t *trans, hammer2_chain_t *chain)
@@ -418,9 +399,9 @@ hammer2_flush(hammer2_trans_t *trans, hammer2_chain_t *chain)
                        if (hammer2_debug & 0x0040)
                                kprintf("deferred flush %p\n", scan);
                        hammer2_chain_lock(scan, HAMMER2_RESOLVE_MAYBE);
-                       hammer2_chain_drop(scan);       /* ref from deferral */
                        hammer2_flush(trans, scan);
                        hammer2_chain_unlock(scan);
+                       hammer2_chain_drop(scan);       /* ref from deferral */
                }
 
                /*
@@ -478,9 +459,9 @@ hammer2_flush(hammer2_trans_t *trans, hammer2_chain_t *chain)
  *
  *                     WARNING ON BREF MODIFY_TID/MIRROR_TID
  *
- * blockref.modify_tid and blockref.mirror_tid are consistent only within a
- * PFS.  This is why we cannot cache sync_tid in the transaction structure.
- * Instead we access it from the pmp.
+ * blockref.modify_tid is consistent only within a PFS, and will not be
+ * consistent during synchronization.  mirror_tid is consistent across the
+ * block device regardless of the PFS.
  */
 static void
 hammer2_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain,
@@ -488,7 +469,6 @@ hammer2_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain,
 {
        hammer2_chain_t *parent;
        hammer2_dev_t *hmp;
-       hammer2_pfs_t *pmp;
        int diddeferral;
 
        /*
@@ -505,18 +485,9 @@ hammer2_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain,
        }
 
        hmp = chain->hmp;
-       pmp = chain->pmp;               /* can be NULL */
        diddeferral = info->diddeferral;
        parent = info->parent;          /* can be NULL */
 
-#if 0
-       /*
-        * XXX mirror_tid allowed to be forward-indexed during synchronization.
-        * mirror_tid should not be forward-indexed
-        */
-       KKASSERT(pmp == NULL || chain->bref.mirror_tid <= pmp->flush_tid);
-#endif
-
        /*
         * Downward search recursion
         */
@@ -588,21 +559,21 @@ again:
 
        if (chain->flags & HAMMER2_CHAIN_MODIFIED) {
                /*
-                * Dispose of the modified bit.  UPDATE should already be
-                * set.
+                * Dispose of the modified bit.
+                *
+                * UPDATE should already be set.
+                * bref.mirror_tid should already be set.
                 */
                KKASSERT((chain->flags & HAMMER2_CHAIN_UPDATE) ||
                         chain == &hmp->vchain);
                atomic_clear_int(&chain->flags, HAMMER2_CHAIN_MODIFIED);
 
                /*
-                * Update mirror_tid unless told otherwise.
+                * Manage threads waiting for excessive dirty memory to
+                * be retired.
                 */
-               if (pmp) {
-                       hammer2_pfs_memory_wakeup(pmp);
-                       if ((chain->flags & HAMMER2_CHAIN_KEEP_MIRROR_TID) == 0)
-                               chain->bref.mirror_tid = pmp->flush_tid;
-               }
+               if (chain->pmp)
+                       hammer2_pfs_memory_wakeup(chain->pmp);
 
                if ((chain->flags & HAMMER2_CHAIN_UPDATE) ||
                    chain == &hmp->vchain ||
@@ -632,9 +603,6 @@ again:
                 *       flush and must be set dirty if we are going to make
                 *       further modifications to the buffer.  Chains with
                 *       embedded data don't need this.
-                *
-                * Update bref.mirror_tid clear MODIFIED, and set UPDATE for
-                * special blockref types.
                 */
                if (hammer2_debug & 0x1000) {
                        kprintf("Flush %p.%d %016jx/%d sync_xid=%08x "
@@ -657,31 +625,56 @@ again:
                switch(chain->bref.type) {
                case HAMMER2_BREF_TYPE_FREEMAP:
                        /*
+                        * Update the volume header's freemap_tid to the
+                        * freemap's flushing mirror_tid.
+                        *
                         * (note: embedded data, do not call setdirty)
                         */
                        KKASSERT(hmp->vchain.flags & HAMMER2_CHAIN_MODIFIED);
-                       hmp->voldata.freemap_tid = hmp->fchain.bref.mirror_tid;
+                       KKASSERT(chain == &hmp->fchain);
+                       hmp->voldata.freemap_tid = chain->bref.mirror_tid;
+                       kprintf("sync freemap mirror_tid %08jx\n",
+                               (intmax_t)chain->bref.mirror_tid);
+
+                       /*
+                        * The freemap can be flushed independently of the
+                        * main topology, but for the case where it is
+                        * flushed in the same transaction, and flushed
+                        * before vchain (a case we want to allow for
+                        * performance reasons), make sure modifications
+                        * made during the flush under vchain use a new
+                        * transaction id.
+                        *
+                        * Otherwise the mount recovery code will get confused.
+                        */
+                       ++hmp->voldata.mirror_tid;
                        break;
                case HAMMER2_BREF_TYPE_VOLUME:
                        /*
-                        * The free block table is flushed by hammer2_vfs_sync()
-                        * before it flushes vchain.  We must still hold fchain
-                        * locked while copying voldata to volsync, however.
+                        * The free block table is flushed by
+                        * hammer2_vfs_sync() before it flushes vchain.
+                        * We must still hold fchain locked while copying
+                        * voldata to volsync, however.
                         *
                         * (note: embedded data, do not call setdirty)
                         */
                        hammer2_voldata_lock(hmp);
                        hammer2_chain_lock(&hmp->fchain,
                                           HAMMER2_RESOLVE_ALWAYS);
+                       kprintf("sync volume  mirror_tid %08jx\n",
+                               (intmax_t)chain->bref.mirror_tid);
+
                        /*
-                        * There is no parent to our root vchain and fchain to
-                        * synchronize the bref to, their updated mirror_tid's
-                        * must be synchronized to the volume header.
+                        * Update the volume header's mirror_tid to the
+                        * main topology's flushing mirror_tid.  It is
+                        * possible that voldata.mirror_tid is already
+                        * beyond bref.mirror_tid due to the bump we made
+                        * above in BREF_TYPE_FREEMAP.
                         */
-                       hmp->voldata.mirror_tid = chain->bref.mirror_tid;
-                       hmp->voldata.freemap_tid = hmp->fchain.bref.mirror_tid;
-                       kprintf("mirror_tid %08jx\n",
-                               (intmax_t)chain->bref.mirror_tid);
+                       if (hmp->voldata.mirror_tid < chain->bref.mirror_tid) {
+                               hmp->voldata.mirror_tid =
+                                       chain->bref.mirror_tid;
+                       }
 
                        /*
                         * The volume header is flushed manually by the
@@ -706,6 +699,10 @@ again:
                                        (char *)&hmp->voldata +
                                         HAMMER2_VOLUME_ICRCVH_OFF,
                                        HAMMER2_VOLUME_ICRCVH_SIZE);
+
+                       kprintf("syncvolhdr %016jx %016jx\n",
+                               hmp->voldata.mirror_tid,
+                               hmp->vchain.bref.mirror_tid);
                        hmp->volsync = hmp->voldata;
                        atomic_set_int(&chain->flags, HAMMER2_CHAIN_VOLUMESYNC);
                        hammer2_chain_unlock(&hmp->fchain);
@@ -748,8 +745,10 @@ again:
 
                                hammer2_io_setdirty(chain->dio);
                                ipdata = &chain->data->ipdata;
-                               if (pmp)
-                                       ipdata->pfs_inum = pmp->inode_tid;
+                               if (chain->pmp) {
+                                       ipdata->pfs_inum =
+                                               chain->pmp->inode_tid;
+                               }
                        } else {
                                /* can't be mounted as a PFS */
                        }
@@ -932,15 +931,7 @@ again:
         * Final cleanup after flush
         */
 done:
-       KKASSERT(chain->refs > 1);
-#if 0
-       /*
-        * XXX mirror_tid allowed to be forward-indexed during synchronization.
-        * mirror_tid should not be forward-indexed
-        */
-       KKASSERT(pmp == NULL ||
-                chain->bref.mirror_tid <= chain->pmp->flush_tid);
-#endif
+       KKASSERT(chain->refs > 0);
        if (hammer2_debug & 0x200) {
                if (info->debug == chain)
                        info->debug = NULL;
@@ -990,25 +981,21 @@ hammer2_flush_recurse(hammer2_chain_t *child, void *data)
        hammer2_chain_lock(child, HAMMER2_RESOLVE_MAYBE);
 
        /*
-        * Never recurse across a mounted PFS boundary.
-        *
-        * Recurse and collect deferral data.
+        * Recurse and collect deferral data.  We're in the media flush,
+        * this can cross PFS boundaries.
         */
-       if ((child->flags & HAMMER2_CHAIN_PFSBOUNDARY) == 0 ||
-           child->pmp == NULL) {
-               if (child->flags & HAMMER2_CHAIN_FLUSH_MASK) {
-                       ++info->depth;
-                       hammer2_flush_core(info, child, 0); /* XXX deleting */
-                       --info->depth;
-               } else if (hammer2_debug & 0x200) {
-                       if (info->debug == NULL)
-                               info->debug = child;
-                       ++info->depth;
-                       hammer2_flush_core(info, child, 0); /* XXX deleting */
-                       --info->depth;
-                       if (info->debug == child)
-                               info->debug = NULL;
-               }
+       if (child->flags & HAMMER2_CHAIN_FLUSH_MASK) {
+               ++info->depth;
+               hammer2_flush_core(info, child, 0); /* XXX deleting */
+               --info->depth;
+       } else if (hammer2_debug & 0x200) {
+               if (info->debug == NULL)
+                       info->debug = child;
+               ++info->depth;
+               hammer2_flush_core(info, child, 0); /* XXX deleting */
+               --info->depth;
+               if (info->debug == child)
+                       info->debug = NULL;
        }
 
        /*
index e89de3f..3213e65 100644 (file)
@@ -271,6 +271,7 @@ hammer2_freemap_alloc(hammer2_trans_t *trans, hammer2_chain_t *chain,
         * Iterate the freemap looking for free space before and after.
         */
        parent = &hmp->fchain;
+       hammer2_chain_ref(parent);
        hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS);
        error = EAGAIN;
        iter.bnext = iter.bpref;
@@ -282,6 +283,7 @@ hammer2_freemap_alloc(hammer2_trans_t *trans, hammer2_chain_t *chain,
        }
        hmp->heur_freemap[hindex] = iter.bnext;
        hammer2_chain_unlock(parent);
+       hammer2_chain_drop(parent);
 
        if (trans->flags & (HAMMER2_TRANS_ISFLUSH | HAMMER2_TRANS_PREFLUSH))
                --trans->sync_xid;
@@ -499,8 +501,10 @@ hammer2_freemap_try_alloc(hammer2_trans_t *trans, hammer2_chain_t **parentp,
        /*
         * Cleanup
         */
-       if (chain)
+       if (chain) {
                hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
+       }
        return (error);
 }
 
@@ -852,6 +856,7 @@ hammer2_freemap_adjust(hammer2_trans_t *trans, hammer2_dev_t *hmp,
        l1mask = l1size - 1;
 
        parent = &hmp->fchain;
+       hammer2_chain_ref(parent);
        hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS);
 
        chain = hammer2_chain_lookup(&parent, &key_dummy, key, key + l1mask,
@@ -872,6 +877,7 @@ hammer2_freemap_adjust(hammer2_trans_t *trans, hammer2_dev_t *hmp,
                        (intmax_t)bref->data_off,
                        hammer2_error_str(chain->error));
                hammer2_chain_unlock(chain);
+               hammer2_chain_drop(chain);
                chain = NULL;
                goto done;
        }
@@ -1058,8 +1064,10 @@ again:
                chain->bref.check.freemap.bigmask |= 1 << radix;
 
        hammer2_chain_unlock(chain);
+       hammer2_chain_drop(chain);
 done:
        hammer2_chain_unlock(parent);
+       hammer2_chain_drop(parent);
 }
 
 /*
index 434e6e3..07cf47f 100644 (file)
@@ -91,9 +91,6 @@ hammer2_inode_cmp(hammer2_inode_t *ip1, hammer2_inode_t *ip2)
  * NOTE: In-memory inodes always point to hardlink targets (the actual file),
  *      and never point to a hardlink pointer.
  *
- * NOTE: Caller must not passed HAMMER2_RESOLVE_NOREF because we use it
- *      internally and refs confusion will ensue.
- *
  * NOTE: If caller passes HAMMER2_RESOLVE_RDONLY the exclusive locking code
  *      will feel free to reduce the chain set in the cluster as an
  *      optimization.  It will still be validated against the quorum if
@@ -106,8 +103,6 @@ hammer2_inode_lock(hammer2_inode_t *ip, int how)
 {
        hammer2_cluster_t *cluster;
 
-       KKASSERT((how & HAMMER2_RESOLVE_NOREF) == 0);
-
        hammer2_inode_ref(ip);
 
        /* 
@@ -132,7 +127,7 @@ hammer2_inode_lock(hammer2_inode_t *ip, int how)
         * working copy if the hint does not work out, so beware.
         */
        cluster = hammer2_cluster_copy(&ip->cluster);
-       hammer2_cluster_lock(cluster, how | HAMMER2_RESOLVE_NOREF);
+       hammer2_cluster_lock(cluster, how);
 
        /*
         * cluster->focus will be set if resolving RESOLVE_ALWAYS, but
@@ -144,9 +139,11 @@ hammer2_inode_lock(hammer2_inode_t *ip, int how)
 
        /*
         * Returned cluster must resolve hardlink pointers.
+        * XXX remove me.
         */
        if ((how & HAMMER2_RESOLVE_MASK) == HAMMER2_RESOLVE_ALWAYS &&
-           cluster->error == 0) {
+           cluster->error == 0 &&
+           cluster->focus) {
                const hammer2_inode_data_t *ripdata;
 
                ripdata = &hammer2_cluster_rdata(cluster)->ipdata;
@@ -158,8 +155,10 @@ hammer2_inode_lock(hammer2_inode_t *ip, int how)
 void
 hammer2_inode_unlock(hammer2_inode_t *ip, hammer2_cluster_t *cluster)
 {
-       if (cluster)
+       if (cluster) {
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
+       }
        hammer2_mtx_unlock(&ip->lock);
        hammer2_inode_drop(ip);
 }
@@ -653,6 +652,7 @@ retry:
                if ((lhc & HAMMER2_DIRHASH_LOMASK) == HAMMER2_DIRHASH_LOMASK)
                        error = ENOSPC;
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
                cluster = NULL;
                ++lhc;
        }
@@ -827,6 +827,7 @@ hammer2_hardlink_shiftup(hammer2_trans_t *trans, hammer2_cluster_t *cluster,
                        xcluster->focus, dip, dcluster->focus,
                        dip->cluster.focus);
                hammer2_cluster_unlock(xcluster);
+               hammer2_cluster_drop(xcluster);
                xcluster = NULL;
                *errorp = ENOSPC;
 #if 0
@@ -933,6 +934,7 @@ hammer2_inode_connect(hammer2_trans_t *trans,
                                error = ENOSPC;
                        }
                        hammer2_cluster_unlock(ncluster);
+                       hammer2_cluster_drop(ncluster);
                        ncluster = NULL;
                        ++lhc;
                }
@@ -1032,6 +1034,7 @@ hammer2_inode_connect(hammer2_trans_t *trans,
                wipdata->op_flags = HAMMER2_OPFLAG_DIRECTDATA;
                hammer2_cluster_modsync(ncluster);
                hammer2_cluster_unlock(ncluster);
+               hammer2_cluster_drop(ncluster);
                ncluster = ocluster;
                ocluster = NULL;
        } else {
@@ -1058,8 +1061,10 @@ hammer2_inode_connect(hammer2_trans_t *trans,
         * case where ocluster is left unchanged the code above sets
         * ncluster to ocluster and ocluster to NULL, resulting in a NOP here.
         */
-       if (ocluster)
+       if (ocluster) {
                hammer2_cluster_unlock(ocluster);
+               hammer2_cluster_drop(ocluster);
+       }
        *clusterp = ncluster;
 
        return (0);
@@ -1260,6 +1265,7 @@ again:
                        hcluster = cluster;
                        cluster = NULL; /* safety */
                        hammer2_cluster_unlock(cparent);
+                       hammer2_cluster_drop(cparent);
                        cparent = NULL; /* safety */
                        ripdata = NULL; /* safety (associated w/cparent) */
                        error = hammer2_hardlink_find(dip, &hparent, &hcluster);
@@ -1300,6 +1306,7 @@ again:
                                                  HAMMER2_LOOKUP_NODATA);
                if (dcluster) {
                        hammer2_cluster_unlock(dcluster);
+                       hammer2_cluster_drop(dcluster);
                        hammer2_cluster_lookup_done(dparent);
                        error = ENOTEMPTY;
                        goto done;
@@ -1318,7 +1325,9 @@ again:
                hammer2_cluster_delete(trans, cparent, cluster,
                                       HAMMER2_DELETE_PERMANENT);
                hammer2_cluster_unlock(cparent);
+               hammer2_cluster_drop(cparent);
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
                cparent = hparent;
                cluster = hcluster;
                hparent = NULL;
@@ -1397,14 +1406,22 @@ again:
        }
        error = 0;
 done:
-       if (cparent)
+       if (cparent) {
                hammer2_cluster_unlock(cparent);
-       if (cluster)
+               hammer2_cluster_drop(cparent);
+       }
+       if (cluster) {
                hammer2_cluster_unlock(cluster);
-       if (hparent)
+               hammer2_cluster_drop(cluster);
+       }
+       if (hparent) {
                hammer2_cluster_unlock(hparent);
-       if (hcluster)
+               hammer2_cluster_drop(hparent);
+       }
+       if (hcluster) {
                hammer2_cluster_unlock(hcluster);
+               hammer2_cluster_drop(hcluster);
+       }
        if (hlinkp)
                *hlinkp = hlink;
 
@@ -1585,6 +1602,7 @@ hammer2_hardlink_consolidate(hammer2_trans_t *trans,
 
        if (hammer2_hardlink_enable == 0) {     /* disallow hardlinks */
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
                *clusterp = NULL;
                return (ENOTSUP);
        }
@@ -1686,6 +1704,7 @@ hammer2_hardlink_consolidate(hammer2_trans_t *trans,
                /* XXX transaction ids */
                hammer2_cluster_modsync(ncluster);
                hammer2_cluster_unlock(ncluster);
+               hammer2_cluster_drop(ncluster);
        }
        ripdata = wipdata;
 
@@ -1710,8 +1729,10 @@ done:
         *
         * Return the shifted cluster in *clusterp.
         */
-       if (cparent)
+       if (cparent) {
                hammer2_cluster_unlock(cparent);
+               hammer2_cluster_drop(cparent);
+       }
        *clusterp = cluster;
 
        return (error);
@@ -1778,6 +1799,7 @@ hammer2_hardlink_find(hammer2_inode_t *dip,
        lhc = ipdata->inum;
        ipdata = NULL;                  /* safety */
        hammer2_cluster_unlock(cluster);
+       hammer2_cluster_drop(cluster);
        *clusterp = NULL;               /* safety */
 
        rcluster = NULL;
@@ -1924,6 +1946,7 @@ hammer2_inode_fsync(hammer2_trans_t *trans, hammer2_inode_t *ip,
                        switch (hammer2_cluster_type(cluster)) {
                        case HAMMER2_BREF_TYPE_INODE:
                                hammer2_cluster_unlock(cluster);
+                               hammer2_cluster_drop(cluster);
                                cluster = NULL;
                                break;
                        case HAMMER2_BREF_TYPE_DATA:
index 5478350..cac8ea6 100644 (file)
@@ -441,6 +441,7 @@ hammer2_ioctl_pfs_get(hammer2_inode_t *ip, void *data)
                        ripdata = &hammer2_cluster_rdata(cluster)->ipdata;
                        pfs->name_next = ripdata->name_key;
                        hammer2_cluster_unlock(cluster);
+                       hammer2_cluster_drop(cluster);
                } else {
                        pfs->name_next = (hammer2_key_t)-1;
                }
@@ -510,6 +511,7 @@ hammer2_ioctl_pfs_lookup(hammer2_inode_t *ip, void *data)
                ripdata = NULL;
 
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
        } else {
                error = ENOENT;
        }
@@ -568,7 +570,7 @@ hammer2_ioctl_pfs_create(hammer2_inode_t *ip, void *data)
                hammer2_cluster_bref(ncluster, &bref);
 #if 1
                kprintf("ADD LOCAL PFS (IOCTL): %s\n", nipdata->filename);
-               hammer2_pfsalloc(ncluster, nipdata, bref.mirror_tid);
+               hammer2_pfsalloc(ncluster, nipdata, bref.modify_tid);
                /* XXX rescan */
 #endif
                hammer2_inode_unlock(nip, ncluster);
index 95e2b4b..be37b9f 100644 (file)
@@ -42,7 +42,7 @@ static void hammer2_update_pfs_status(hammer2_syncthr_t *thr,
                        hammer2_cluster_t *cparent);
 static int hammer2_sync_insert(hammer2_syncthr_t *thr,
                        hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
-                       int i, int *errors);
+                       hammer2_tid_t modify_tid, int i, int *errors);
 static int hammer2_sync_destroy(hammer2_syncthr_t *thr,
                        hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
                        int i, int *errors);
@@ -173,15 +173,17 @@ hammer2_syncthr_primary(void *arg)
                /*
                 * Synchronization scan.
                 */
-               hammer2_trans_init(&thr->trans, pmp, 0);
+               hammer2_trans_init(&thr->trans, pmp, HAMMER2_TRANS_KEEPMODIFY);
                cparent = hammer2_inode_lock(pmp->iroot,
                                             HAMMER2_RESOLVE_ALWAYS);
                hammer2_update_pfs_status(thr, cparent);
+               hammer2_inode_unlock(pmp->iroot, NULL);
                bzero(errors, sizeof(errors));
                error = hammer2_sync_slaves(thr, cparent, errors);
                if (error)
                        kprintf("hammer2_sync_slaves: error %d\n", error);
-               hammer2_inode_unlock(pmp->iroot, cparent);
+               hammer2_cluster_unlock(cparent);
+               hammer2_cluster_drop(cparent);
                hammer2_trans_done(&thr->trans);
 
                /*
@@ -251,6 +253,7 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
        hammer2_pfs_t *pmp;
        hammer2_cluster_t *cluster;
        hammer2_cluster_t *scluster;
+       hammer2_chain_t *focus;
        hammer2_chain_t *chain;
        hammer2_key_t key_next;
        int error;
@@ -258,6 +261,7 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
        int i;
        int n;
        int noslaves;
+       int dorecursion;
 
        pmp = thr->pmp;
 
@@ -265,16 +269,10 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
         * Nothing to do if all slaves are synchronized.
         * Nothing to do if cluster not authoritatively readable.
         */
-       if (pmp->flags & HAMMER2_CLUSTER_SSYNCED) {
-               kprintf("pfs %p: all slaves are synchronized\n", pmp);
+       if (pmp->flags & HAMMER2_CLUSTER_SSYNCED)
                return(0);
-       }
-       if ((pmp->flags & HAMMER2_CLUSTER_RDHARD) == 0) {
-               kprintf("pfs %p: slave sync waiting, cluster not available\n",
-                       pmp);
+       if ((pmp->flags & HAMMER2_CLUSTER_RDHARD) == 0)
                return(HAMMER2_ERROR_INCOMPLETE);
-       }
-       kprintf("pfs %p: run synchronization\n", pmp);
 
        error = 0;
 
@@ -299,7 +297,6 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
        /*
         * Ignore degenerate DIRECTDATA case for file inode
         */
-       kprintf("X1 %p %p\n", cluster, cparent);
        if (cluster == cparent) {
                hammer2_cluster_drop(cluster);
                cluster = NULL;
@@ -325,20 +322,22 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
                         * needs to be copied in.
                         */
                        if (chain && chain->error) {
-                               kprintf("chain error index %d: %d\n", i, chain->error);
+                               kprintf("chain error index %d: %d\n",
+                                       i, chain->error);
                                errors[i] = chain->error;
                                error = chain->error;
                                cluster->array[i].flags |=
                                                HAMMER2_CITEM_INVALID;
                                continue;
                        }
-                       kprintf("chain index %d: %p\n", i, chain);
 
                        noslaves = 0;
 
                        /*
                         * Skip if the slave already has the record (everything
-                        * matches including the mirror_tid).
+                        * matches including the modify_tid).  Note that the
+                        * mirror_tid does not have to match, mirror_tid is
+                        * a per-block-device entity.
                         *
                         * XXX also skip if parent is an indirect block and
                         *     is up-to-date.
@@ -348,21 +347,45 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
                                continue;
                        }
 
+                       focus = cluster->focus;
+                       if (focus->bref.type == HAMMER2_BREF_TYPE_INODE)
+                               dorecursion = 1;
+                       else
+                               dorecursion = 0;
+
                        /*
                         * Otherwise adjust the slave.
                         */
                        if (chain)
-                               n = hammer2_chain_cmp(cluster->focus, chain);
+                               n = hammer2_chain_cmp(focus, chain);
                        else
                                n = -1; /* end-of-scan on slave */
 
                        if (n < 0) {
                                /*
-                                * slave chain missing, create
+                                * slave chain missing, create missing chain.
+                                *
+                                * If we are going to recurse we have to set
+                                * the initial modify_tid to 0 until the
+                                * sub-tree is completely synchronized.
+                                * Setting (n = 0) in this situation forces
+                                * the replacement call to run on the way
+                                * back up after the sub-tree has
+                                * synchronized.
                                 */
-                               nerror = hammer2_sync_insert(thr,
-                                                            cparent, cluster,
-                                                            i, errors);
+                               if (dorecursion) {
+                                       nerror = hammer2_sync_insert(
+                                                       thr, cparent, cluster,
+                                                       0,
+                                                       i, errors);
+                                       if (nerror == 0)
+                                               n = 0;
+                               } else {
+                                       nerror = hammer2_sync_insert(
+                                                       thr, cparent, cluster,
+                                                       focus->bref.modify_tid,
+                                                       i, errors);
+                               }
                        } else if (n > 0) {
                                /*
                                 * excess slave chain, destroy
@@ -387,36 +410,48 @@ hammer2_sync_slaves(hammer2_syncthr_t *thr, hammer2_cluster_t *cparent,
                                continue;
                        } else {
                                /*
-                                * key match but other things did not, replace.
+                                * Replacement is deferred until after any
+                                * recursion.
                                 */
-                               nerror = hammer2_sync_replace(thr,
-                                                             cparent, cluster,
-                                                             i, errors);
+                               nerror = 0;
                        }
-                       if (nerror)
-                               error = nerror;
 
                        /*
                         * Recurse on inode.  Avoid unnecessarily blocking
                         * operations by temporarily unlocking the parent.
                         */
-                       if (cluster->focus->bref.type ==
-                           HAMMER2_BREF_TYPE_INODE) {
+                       if (dorecursion) {
                                hammer2_cluster_unlock(cparent);
                                scluster = hammer2_cluster_copy(cluster);
                                hammer2_cluster_lock(scluster,
-                                                    HAMMER2_RESOLVE_ALWAYS |
-                                                    HAMMER2_RESOLVE_NOREF);
+                                                    HAMMER2_RESOLVE_ALWAYS);
                                nerror = hammer2_sync_slaves(thr, scluster,
                                                             errors);
-                               if (nerror)
-                                       error = nerror;
                                hammer2_cluster_unlock(scluster);
-                               /* XXX mirror_tid on scluster */
-                               /* flush needs to not update mirror_tid */
+                               hammer2_cluster_drop(scluster);
+                               /* XXX modify_tid on scluster */
+                               /* flush needs to not update modify_tid */
                                hammer2_cluster_lock(cparent,
                                                     HAMMER2_RESOLVE_ALWAYS);
                        }
+
+                       /*
+                        * Key match but other things did not, replace.  Do
+                        * this after the recursion rather than before.
+                        *
+                        * Do not update parents if an error occured during
+                        * child processing.  In particular, updating the
+                        * modify_tid when something in the sub-tree is broken
+                        * would cause other parts of the cluster to believe
+                        * that we are up-to-date when we aren't.
+                        */
+                       if (nerror == 0 && n == 0) {
+                               nerror = hammer2_sync_replace(thr,
+                                                             cparent, cluster,
+                                                             i, errors);
+                       }
+                       if (nerror)
+                               error = nerror;
                }
                if (noslaves) {
                        kprintf("exhausted slaves\n");
@@ -443,16 +478,34 @@ static
 int
 hammer2_sync_insert(hammer2_syncthr_t *thr,
                    hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
-                   int i, int *errors)
+                   hammer2_tid_t modify_tid, int i, int *errors)
 {
        hammer2_chain_t *focus;
        hammer2_chain_t *chain;
+       hammer2_key_t dummy;
 
        focus = cluster->focus;
+#if HAMEMR2_SYNCTHR_DEBUG
        kprintf("insert record slave %d %d.%016jx\n",
                i, focus->bref.type, focus->bref.key);
+#endif
 
-       focus = cluster->focus;
+       /*
+        * We have to do a lookup to position ourselves at the correct
+        * parent when inserting a record into a new slave because the
+        * cluster iteration for this slave might not be pointing to the
+        * right place.  Our expectation is that the record will not be
+        * found.
+        */
+       chain = hammer2_chain_lookup(&cparent->array[i].chain, &dummy,
+                                    focus->bref.key, focus->bref.key,
+                                    &cparent->array[i].cache_index,
+                                    0);
+       KKASSERT(chain == NULL);
+
+       /*
+        * Create the missing chain.
+        */
        chain = NULL;
        hammer2_chain_create(&thr->trans, &cparent->array[i].chain,
                             &chain, thr->pmp,
@@ -467,8 +520,8 @@ hammer2_sync_insert(hammer2_syncthr_t *thr,
        chain->bref.methods = focus->bref.methods;
        /* keybits already set */
        chain->bref.vradix = focus->bref.vradix;
-       chain->bref.mirror_tid = focus->bref.mirror_tid;
-       chain->bref.modify_tid = focus->bref.modify_tid;
+       /* mirror_tid set by flush */
+       chain->bref.modify_tid = modify_tid;
        chain->bref.flags = focus->bref.flags;
        /* key already present */
        /* check code will be recalculated */
@@ -494,9 +547,7 @@ hammer2_sync_insert(hammer2_syncthr_t *thr,
        }
 
        hammer2_chain_unlock(focus);
-
-       hammer2_chain_ref(chain);               /* replace lock with ref */
-       hammer2_chain_unlock(chain);
+       hammer2_chain_unlock(chain);            /* unlock, leave ref */
        cluster->array[i].chain = chain;        /* validate cluster */
        cluster->array[i].flags &= ~HAMMER2_CITEM_INVALID;
 
@@ -515,8 +566,10 @@ hammer2_sync_destroy(hammer2_syncthr_t *thr,
        hammer2_chain_t *chain;
 
        chain = cluster->array[i].chain;
+#if HAMEMR2_SYNCTHR_DEBUG
        kprintf("destroy record slave %d %d.%016jx\n",
                i, chain->bref.type, chain->bref.key);
+#endif
 
        hammer2_chain_lock(chain, HAMMER2_RESOLVE_NEVER);
        hammer2_chain_delete(&thr->trans, cparent->array[i].chain, chain, 0);
@@ -540,8 +593,10 @@ hammer2_sync_replace(hammer2_syncthr_t *thr,
 
        focus = cluster->focus;
        chain = cluster->array[i].chain;
+#if HAMEMR2_SYNCTHR_DEBUG
        kprintf("replace record slave %d %d.%016jx\n",
                i, focus->bref.type, focus->bref.key);
+#endif
        if (cluster->focus_index < i)
                hammer2_chain_lock(focus, HAMMER2_RESOLVE_ALWAYS);
        hammer2_chain_lock(chain, HAMMER2_RESOLVE_ALWAYS);
@@ -559,7 +614,7 @@ hammer2_sync_replace(hammer2_syncthr_t *thr,
        chain->bref.methods = focus->bref.methods;
        chain->bref.keybits = focus->bref.keybits;
        chain->bref.vradix = focus->bref.vradix;
-       chain->bref.mirror_tid = focus->bref.mirror_tid;
+       /* mirror_tid updated by flush */
        chain->bref.modify_tid = focus->bref.modify_tid;
        chain->bref.flags = focus->bref.flags;
        /* key already present */
index 4e45220..b152a9b 100644 (file)
@@ -320,12 +320,17 @@ hammer2_vfs_uninit(struct vfsconf *vfsp __unused)
  * Core PFS allocator.  Used to allocate the pmp structure for PFS cluster
  * mounts and the spmp structure for media (hmp) structures.
  *
+ * pmp->modify_tid tracks new modify_tid transaction ids for front-end
+ * transactions.  Note that synchronization does not use this field.
+ * (typically frontend operations and synchronization cannot run on the
+ * same PFS node at the same time).
+ *
  * XXX check locking
  */
 hammer2_pfs_t *
 hammer2_pfsalloc(hammer2_cluster_t *cluster,
                 const hammer2_inode_data_t *ripdata,
-                hammer2_tid_t alloc_tid)
+                hammer2_tid_t modify_tid)
 {
        hammer2_chain_t *rchain;
        hammer2_pfs_t *pmp;
@@ -357,9 +362,10 @@ hammer2_pfsalloc(hammer2_cluster_t *cluster,
                TAILQ_INIT(&pmp->unlinkq);
                spin_init(&pmp->list_spin, "hm2pfsalloc_list");
 
-               /* our first media transaction id */
-               pmp->alloc_tid = alloc_tid + 1;
-               pmp->flush_tid = pmp->alloc_tid;
+               /*
+                * Save last media transaction id for flusher.
+                */
+               pmp->modify_tid = modify_tid;
                if (ripdata) {
                        pmp->inode_tid = ripdata->pfs_inum + 1;
                        pmp->pfs_clid = ripdata->pfs_clid;
@@ -367,6 +373,16 @@ hammer2_pfsalloc(hammer2_cluster_t *cluster,
                hammer2_mtx_init(&pmp->wthread_mtx, "h2wthr");
                bioq_init(&pmp->wthread_bioq);
                TAILQ_INSERT_TAIL(&hammer2_pfslist, pmp, mntentry);
+
+               /*
+                * The synchronization thread may start too early, make
+                * sure it stays frozen until we are ready to let it go.
+                * XXX
+                */
+               /*
+               pmp->primary_thr.flags = HAMMER2_SYNCTHR_FROZEN |
+                                        HAMMER2_SYNCTHR_REMASTER;
+               */
        }
 
        /*
@@ -770,6 +786,7 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                        return error;
                }
                hmp = kmalloc(sizeof(*hmp), M_HAMMER2, M_WAITOK | M_ZERO);
+               ksnprintf(hmp->devrepname, sizeof(hmp->devrepname), "%s", dev);
                hmp->ronly = ronly;
                hmp->devvp = devvp;
                kmalloc_create(&hmp->mchain, "HAMMER2-chains");
@@ -836,13 +853,16 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                 * Really important to get these right or flush will get
                 * confused.
                 */
-               hmp->spmp = hammer2_pfsalloc(NULL, NULL,
-                                            hmp->voldata.mirror_tid);
+               hmp->spmp = hammer2_pfsalloc(NULL, NULL, 0);
                kprintf("alloc spmp %p tid %016jx\n",
                        hmp->spmp, hmp->voldata.mirror_tid);
                spmp = hmp->spmp;
                spmp->inode_tid = 1;
 
+               /*
+                * Dummy-up vchain and fchain's modify_tid.  mirror_tid
+                * is inherited from the volume header.
+                */
                xid = 0;
                hmp->vchain.bref.mirror_tid = hmp->voldata.mirror_tid;
                hmp->vchain.bref.modify_tid = hmp->vchain.bref.mirror_tid;
@@ -874,12 +894,14 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                        kprintf("hammer2_mount: error %s reading super-root\n",
                                hammer2_error_str(schain->error));
                        hammer2_chain_unlock(schain);
+                       hammer2_chain_drop(schain);
                        schain = NULL;
                        hammer2_unmount_helper(mp, NULL, hmp);
                        lockmgr(&hammer2_mntlk, LK_RELEASE);
                        hammer2_vfs_unmount(mp, MNT_FORCE);
                        return EINVAL;
                }
+               spmp->modify_tid = schain->bref.modify_tid;
 
                /*
                 * Sanity-check schain's pmp and finish initialization.
@@ -976,8 +998,9 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
         */
        ripdata = &hammer2_cluster_rdata(cluster)->ipdata;
        hammer2_cluster_bref(cluster, &bref);
-       pmp = hammer2_pfsalloc(NULL, ripdata, bref.mirror_tid);
+       pmp = hammer2_pfsalloc(NULL, ripdata, bref.modify_tid);
        hammer2_cluster_unlock(cluster);
+       hammer2_cluster_drop(cluster);
 
        if (pmp->mp) {
                kprintf("hammer2_mount: PFS already mounted!\n");
@@ -1093,7 +1116,7 @@ hammer2_update_pmps(hammer2_dev_t *hmp)
                hammer2_cluster_bref(cluster, &bref);
                kprintf("ADD LOCAL PFS: %s\n", ripdata->filename);
 
-               pmp = hammer2_pfsalloc(cluster, ripdata, bref.mirror_tid);
+               pmp = hammer2_pfsalloc(cluster, ripdata, bref.modify_tid);
                cluster = hammer2_cluster_next(cparent, cluster,
                                               &key_next,
                                               key_next,
@@ -1350,8 +1373,10 @@ hammer2_write_file_core(struct buf *bp, hammer2_trans_t *trans,
                hammer2_write_bp(cluster, bp, ioflag, pblksize, errorp,
                                 ripdata->check_algo);
                /* ripdata can become invalid */
-               if (cluster)
+               if (cluster) {
                        hammer2_cluster_unlock(cluster);
+                       hammer2_cluster_drop(cluster);
+               }
                break;
        case HAMMER2_COMP_AUTOZERO:
                /*
@@ -1611,8 +1636,10 @@ hammer2_compress_and_write(struct buf *bp, hammer2_trans_t *trans,
                }
        }
 done:
-       if (cluster)
+       if (cluster) {
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
+       }
        if (comp_buffer)
                objcache_put(cache_buffer_write, comp_buffer);
 }
@@ -1640,8 +1667,10 @@ hammer2_zero_check_and_write(struct buf *bp, hammer2_trans_t *trans,
                hammer2_write_bp(cluster, bp, ioflag, pblksize, errorp,
                                 check_algo);
                /* ripdata can become invalid */
-               if (cluster)
+               if (cluster) {
                        hammer2_cluster_unlock(cluster);
+                       hammer2_cluster_drop(cluster);
+               }
        }
 }
 
@@ -1693,6 +1722,7 @@ zero_write(struct buf *bp, hammer2_trans_t *trans,
                                               HAMMER2_DELETE_PERMANENT);
                }
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
        }
        hammer2_cluster_lookup_done(cparent);
 }
@@ -2189,12 +2219,23 @@ hammer2_recovery(hammer2_dev_t *hmp)
        struct hammer2_recovery_elm *elm;
        hammer2_chain_t *parent;
        hammer2_tid_t sync_tid;
+       hammer2_tid_t mirror_tid;
        int error;
        int cumulative_error = 0;
 
        hammer2_trans_init(&trans, hmp->spmp, 0);
 
-       sync_tid = 0;
+       sync_tid = hmp->voldata.freemap_tid;
+       mirror_tid = hmp->voldata.mirror_tid;
+
+       kprintf("hammer2 mount \"%s\": ", hmp->devrepname);
+       if (sync_tid >= mirror_tid) {
+               kprintf(" no recovery needed\n");
+       } else {
+               kprintf(" freemap recovery %016jx-%016jx\n",
+                       sync_tid + 1, mirror_tid);
+       }
+
        TAILQ_INIT(&info.list);
        info.depth = 0;
        parent = hammer2_chain_lookup_init(&hmp->vchain, 0);
@@ -2208,11 +2249,12 @@ hammer2_recovery(hammer2_dev_t *hmp)
                sync_tid = elm->sync_tid;
                kfree(elm, M_HAMMER2);
 
-               hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS |
-                                          HAMMER2_RESOLVE_NOREF);
+               hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS);
                error = hammer2_recovery_scan(&trans, hmp, parent,
-                                             &info, sync_tid);
+                                             &info,
+                                             hmp->voldata.freemap_tid);
                hammer2_chain_unlock(parent);
+               hammer2_chain_drop(parent);     /* drop elm->chain ref */
                if (error)
                        cumulative_error = error;
        }
@@ -2232,7 +2274,6 @@ hammer2_recovery_scan(hammer2_trans_t *trans, hammer2_dev_t *hmp,
        hammer2_chain_t *chain;
        int cache_index;
        int cumulative_error = 0;
-       int pfs_boundary = 0;
        int error;
 
        /*
@@ -2262,18 +2303,6 @@ hammer2_recovery_scan(hammer2_trans_t *trans, hammer2_dev_t *hmp,
                        hammer2_chain_unlock(parent);
                        return 0;
                }
-               if ((ripdata->op_flags & HAMMER2_OPFLAG_PFSROOT) &&
-                   info->depth != 0) {
-                       pfs_boundary = 1;
-                       sync_tid = parent->bref.mirror_tid - 1;
-                       kprintf("recovery scan PFS synctid %016jx \"%s\"\n",
-                               sync_tid, ripdata->filename);
-               }
-#if 0
-               if ((ripdata->op_flags & HAMMER2_OPFLAG_PFSROOT) == 0) {
-                       kprintf("%*.*s\"%s\"\n", info->depth, info->depth, "", ripdata->filename);
-               }
-#endif
                hammer2_chain_unlock(parent);
                break;
        case HAMMER2_BREF_TYPE_INDIRECT:
@@ -2298,7 +2327,7 @@ hammer2_recovery_scan(hammer2_trans_t *trans, hammer2_dev_t *hmp,
         * Defer operation if depth limit reached or if we are crossing a
         * PFS boundary.
         */
-       if (info->depth >= HAMMER2_RECOVERY_MAXDEPTH || pfs_boundary) {
+       if (info->depth >= HAMMER2_RECOVERY_MAXDEPTH) {
                struct hammer2_recovery_elm *elm;
 
                elm = kmalloc(sizeof(*elm), M_HAMMER2, M_ZERO | M_WAITOK);
@@ -2322,7 +2351,7 @@ hammer2_recovery_scan(hammer2_trans_t *trans, hammer2_dev_t *hmp,
                                   HAMMER2_LOOKUP_NODATA);
        while (chain) {
                atomic_set_int(&chain->flags, HAMMER2_CHAIN_RELEASE);
-               if (chain->bref.mirror_tid >= sync_tid) {
+               if (chain->bref.mirror_tid > sync_tid) {
                        ++info->depth;
                        error = hammer2_recovery_scan(trans, hmp, chain,
                                                      info, sync_tid);
@@ -2429,8 +2458,9 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
 
        total_error = 0;
 
+#if 0
        /*
-        * Flush all storage elements making up the cluster
+        * Flush all nodes making up the cluster
         *
         * We must also flush any deleted siblings because the super-root
         * flush won't do it for us.  They all must be staged or the
@@ -2442,11 +2472,15 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
        for (i = 0; iroot && i < iroot->cluster.nchains; ++i) {
                chain = iroot->cluster.array[i].chain;
                if (chain) {
+                       hmp = chain->hmp;
+                       hammer2_chain_ref(chain);    /* prevent destruction */
                        hammer2_chain_lock(chain, HAMMER2_RESOLVE_ALWAYS);
                        hammer2_flush(&info.trans, chain);
                        hammer2_chain_unlock(chain);
+                       hammer2_chain_drop(chain);
                }
        }
+#endif
 #if 0
        hammer2_trans_done(&info.trans);
 #endif
@@ -2478,7 +2512,9 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
                }
                if (j >= 0)
                        continue;
+#if 0
                hammer2_trans_spmp(&info.trans, hmp->spmp);
+#endif
 
                /*
                 * Force an update of the XID from the PFS root to the
@@ -2501,7 +2537,9 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
                 * ahead of the topology.  We depend on the bulk free scan
                 * code to deal with any loose ends.
                 */
+               hammer2_chain_ref(&hmp->vchain);
                hammer2_chain_lock(&hmp->vchain, HAMMER2_RESOLVE_ALWAYS);
+               hammer2_chain_ref(&hmp->fchain);
                hammer2_chain_lock(&hmp->fchain, HAMMER2_RESOLVE_ALWAYS);
                if (hmp->fchain.flags & HAMMER2_CHAIN_FLUSH_MASK) {
                        /*
@@ -2515,6 +2553,8 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
                }
                hammer2_chain_unlock(&hmp->fchain);
                hammer2_chain_unlock(&hmp->vchain);
+               hammer2_chain_drop(&hmp->fchain);
+               /* vchain dropped down below */
 
                hammer2_chain_lock(&hmp->vchain, HAMMER2_RESOLVE_ALWAYS);
                if (hmp->vchain.flags & HAMMER2_CHAIN_FLUSH_MASK) {
@@ -2526,6 +2566,7 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
                        force_fchain = 0;
                }
                hammer2_chain_unlock(&hmp->vchain);
+               hammer2_chain_drop(&hmp->vchain);
 
 #if 0
                hammer2_chain_lock(&hmp->fchain, HAMMER2_RESOLVE_ALWAYS);
index 33c0e66..9e20d99 100644 (file)
@@ -797,8 +797,10 @@ hammer2_vop_readdir(struct vop_readdir_args *ap)
                if (cookie_index == ncookies)
                        break;
        }
-       if (cluster)
+       if (cluster) {
                hammer2_cluster_unlock(cluster);
+               hammer2_cluster_drop(cluster);
+       }
 done:
        hammer2_inode_unlock(ip, cparent);
        if (ap->a_eofflag)
@@ -1317,6 +1319,7 @@ hammer2_vop_nresolve(struct vop_nresolve_args *ap)
                        hammer2_inode_ref(ip);
                        hammer2_inode_unlock(ip, NULL);
                        hammer2_cluster_unlock(cluster);
+                       hammer2_cluster_drop(cluster);
                        cluster = hammer2_inode_lock(ip,
                                                     HAMMER2_RESOLVE_ALWAYS);
                        ripdata = &hammer2_cluster_rdata(cluster)->ipdata;
@@ -2244,6 +2247,7 @@ hammer2_strategy_read_callback(hammer2_iocb_t *iocb)
                                hammer2_io_complete(iocb);
                                biodone(bio);
                                hammer2_cluster_unlock(cluster);
+                               hammer2_cluster_drop(cluster);
                        } else {
                                hammer2_io_complete(iocb); /* XXX */
                                chain = cluster->array[i].chain;
@@ -2327,6 +2331,7 @@ hammer2_strategy_read_callback(hammer2_iocb_t *iocb)
        if (dio)                                /* physical dio & buffer */
                hammer2_io_bqrelse(&dio);
        hammer2_cluster_unlock(cluster);        /* cluster management */
+       hammer2_cluster_drop(cluster);          /* cluster management */
        biodone(bio);                           /* logical buffer */
 }
 
@@ -2444,6 +2449,7 @@ hammer2_run_unlinkq(hammer2_trans_t *trans, hammer2_pfs_t *pmp)
                hammer2_cluster_delete(trans, cparent, cluster,
                                       HAMMER2_DELETE_PERMANENT);
                hammer2_cluster_unlock(cparent);
+               hammer2_cluster_drop(cparent);
                hammer2_inode_unlock(ip, cluster);      /* inode lock */
                hammer2_inode_drop(ip);                 /* ipul ref */