sys/vfs/hammer2/TODO

   1
   2 * bulkfree pass needs to do a vchain flush from the root to avoid
   3   accidently freeing live in-process chains.
   4
   5 * Need backend synchronization / serialization when the frontend detaches
   6   a XOP.  modify_tid tests won't be enough, the backend may wind up executing
   7   the XOP out of order after the detach.
   8
   9 * xop_start - only start synchronized elements
  10
  11 * See if we can remove hammer2_inode_repoint()
  12
  13 * FIXME - logical buffer associated with write-in-progress on backend
  14   disappears once the cluster validates, even if more backend nodes
  15   are in progress.
  16
  17 * FIXME - backend ops need per-node transactions using spmp to protect
  18   against flush.
  19
  20 * FIXME - modifying backend ops are not currently validating the cluster.
  21   That probably needs to be done by the frontend in hammer2_xop_start()
  22
  23 * modify_tid handling probably broken w/ the XOP code for the moment.
  24
  25 * embedded transactions in XOPs - interlock early completion
  26
  27 * remove current incarnation of EAGAIN
  28
  29 * mtx locks should not track td_locks count?.  They can be acquired by one
  30   thread and released by another.  Need API function for exclusive locks.
  31
  32 * Convert xops and hammer2_update_spans() from cluster back into chain calls
  33
  34 * syncthr leaves inode locks for entire sync, which is wrong.
  35
  36 * recovery scan vs unmount.  At the moment an unmount does its flushes,
  37   and if successful the freemap will be fully up-to-date, but the mount
  38   code doesn't know that and the last flush batch will probably match
  39   the PFS root mirror_tid.  If it was a large cpdup the (unnecessary)
  40   recovery pass at mount time can be extensive.  Add a CLEAN flag to the
  41   volume header to optimize out the unnecessary recovery pass.
  42
  43 * More complex transaction sequencing and flush merging.  Right now it is
  44   all serialized against flushes.
  45
  46 * adding new pfs - freeze and force remaster
  47
  48 * removing a pfs - freeze and force remaster
  49
  50 * bulkfree - sync between passes and enforce serialization of operation
  51
  52 * bulkfree - signal check, allow interrupt
  53
  54 * bulkfree - sub-passes when kernel memory block isn't large enough
  55
  56 * bulkfree - limit kernel memory allocation for bmap space
  57
  58 * bulkfree - must include any detached vnodes in scan so open unlinked files
  59              are not ripped out from under the system.
  60
  61 * bulkfree - must include all volume headers in scan so they can be used
  62              for recovery or automatic snapshot retrieval.
  63
  64 * bulkfree - snapshot duplicate sub-tree cache and tests needed to reduce
  65              unnecessary re-scans.
  66
  67 * Currently the check code (bref.methods / crc, sha, etc) is being checked
  68   every single blasted time a chain is locked, even if the underlying buffer
  69   was previously checked for that chain.  This needs an optimization to
  70   (significantly) improve performance.
  71
  72 * flush synchronization boundary crossing check and current flush chain
  73   interlock needed.
  74
  75 * snapshot creation must allocate and separately pass a new pmp for the pfs
  76   degenerate 'cluster' representing the snapshot.  This theoretically will
  77   also allow a snapshot to be generated inside a cluster of more than one
  78   node.
  79
  80 * snapshot copy currently also copies uuids and can confuse cluster code
  81
  82 * hidden dir or other dirs/files/modifications made to PFS before
  83   additional cluster entries added.
  84
  85 * transaction on cluster - multiple trans structures, subtrans
  86
  87 * inode always contains target cluster/chain, not hardlink
  88
  89 * chain refs in cluster, cluster refs
  90
  91 * check inode shared lock ... can end up in endless loop if following
  92   hardlink because ip->chain is not updated in the exclusive lock cycle
  93   when following hardlink.
  94
  95 cpdup /build/boomdata/jails/bleeding-edge/usr/share/man/man4 /mnt/x3
  96
  97
  98         * The block freeing code.  At the very least a bulk scan is needed
  99           to implement freeing blocks.
 100
 101         * Crash stability.  Right now the allocation table on-media is not
 102           properly synchronized with the flush.  This needs to be adjusted
 103           such that H2 can do an incremental scan on mount to fixup
 104           allocations on mount as part of its crash recovery mechanism.
 105
 106         * We actually have to start checking and acting upon the CRCs being
 107           generated.
 108
 109         * Remaining known hardlink issues need to be addressed.
 110
 111         * Core 'copies' mechanism needs to be implemented to support multiple
 112           copies on the same media.
 113
 114         * Core clustering mechanism needs to be implemented to support
 115           mirroring and basic multi-master operation from a single host
 116           (multi-host requires additional network protocols and won't
 117           be as easy).
 118
 119 * make sure we aren't using a shared lock during RB_SCAN's?
 120
 121 * overwrite in write_file case w/compression - if device block size changes
 122   the block has to be deleted and reallocated.  See hammer2_assign_physical()
 123   in vnops.
 124
 125 * freemap / clustering.  Set block size on 2MB boundary so the cluster code
 126   can be used for reading.
 127
 128 * need API layer for shared buffers (unfortunately).
 129
 130 * add magic number to inode header, add parent inode number too, to
 131   help with brute-force recovery.
 132
 133 * modifications past our flush point do not adjust vchain.
 134   need to make vchain dynamic so we can (see flush_scan2).??
 135
 136 * MINIOSIZE/RADIX set to 1KB for now to avoid buffer cache deadlocks
 137   on multiple locked inodes.  Fix so we can use LBUFSIZE!  Or,
 138   alternatively, allow a smaller I/O size based on the sector size
 139   (not optimal though).
 140
 141 * When making a snapshot, do not allow the snapshot to be mounted until
 142   the in-memory chain has been freed in order to break the shared core.
 143
 144 * Snapshotting a sub-directory does not snapshot any
 145   parent-directory-spanning hardlinks.
 146
 147 * Snapshot / flush-synchronization point.  remodified data that crosses
 148   the synchronization boundary is not currently reallocated.  see
 149   hammer2_chain_modify(), explicit check (requires logical buffer cache
 150   buffer handling).
 151
 152 * on fresh mount with multiple hardlinks present separate lookups will
 153   result in separate vnodes pointing to separate inodes pointing to a
 154   common chain (the hardlink target).
 155
 156   When the hardlink target consolidates upward only one vp/ip will be
 157   adjusted.  We need code to fixup the other chains (probably put in
 158   inode_lock_*()) which will be pointing to an older deleted hardlink
 159   target.
 160
 161 * Filesystem must ensure that modify_tid is not too large relative to
 162   the iterator in the volume header, on load, or flush sequencing will
 163   not work properly.  We should be able to just override it, but we
 164   should complain if it happens.
 165
 166 * Kernel-side needs to clean up transaction queues and make appropriate
 167   callbacks.
 168
 169 * Userland side needs to do the same for any initiated transactions.
 170
 171 * Nesting problems in the flusher.
 172
 173 * Inefficient vfsync due to thousands of file buffers, one per-vnode.
 174   (need to aggregate using a device buffer?)
 175
 176 * Use bp->b_dep to interlock the buffer with the chain structure so the
 177   strategy code can calculate the crc and assert that the chain is marked
 178   modified (not yet flushed).
 179
 180 * Deleted inode not reachable via tree for volume flush but still reachable
 181   via fsync/inactive/reclaim.  Its tree can be destroyed at that point.
 182
 183 * The direct write code needs to invalidate any underlying physical buffers.
 184   Direct write needs to be implemented.
 185
 186 * Make sure a resized block (hammer2_chain_resize()) calculates a new
 187   hash code in the parent bref
 188
 189 * The freemap allocator needs to getblk/clrbuf/bdwrite any partial
 190   block allocations (less than 64KB) that allocate out of a new 64K
 191   block, to avoid causing a read-before-write I/O.
 192
 193 * Check flush race upward recursion setting SUBMODIFIED vs downward
 194   recursion checking SUBMODIFIED then locking (must clear before the
 195   recursion and might need additional synchronization)
 196
 197 * There is definitely a flush race in the hardlink implementation between
 198   the forwarding entries and the actual (hidden) hardlink inode.
 199
 200   This will require us to associate a small hard-link-adjust structure
 201   with the chain whenever we create or delete hardlinks, on top of
 202   adjusting the hardlink inode itself.  Any actual flush to the media
 203   has to synchronize the correct nlinks value based on whether related
 204   created or deleted hardlinks were also flushed.
 205
 206 * When a directory entry is created and also if an indirect block is
 207   created and entries moved into it, the directory seek position can
 208   potentially become incorrect during a scan.
 209
 210 * When a directory entry is deleted a directory seek position depending
 211   on that key can cause readdir to skip entries.
 212
 213 * TWO PHASE COMMIT - store two data offsets in the chain, and
 214   hammer2_chain_delete() needs to leave the chain intact if MODIFIED2 is
 215   set on its buffer until the flusher gets to it?
 216
 217
 218                                 OPTIMIZATIONS
 219
 220 * If a file is unlinked buts its descriptors is left open and used, we
 221   should allow data blocks on-media to be reused since there is no
 222   topology left to point at them.