sys/dev/disk/dm/dmirror_notes.txt

   1     Now that Alex has the basic lvm stuff in we need to add soft-raid-1
   2     to it.  I have some ideas on how it could be implemented.
   3
   4     This is not set in stone at all, this is just me rattling off my
   5     RAID-1 implementation ideas.  It isn't quite as complex as it sounds,
   6     really!  I swear it isn't!  But if we could implement something like
   7     this we would have the best soft-raid-1 implementation around.
   8
   9     Here are the basic problems which need to be solved:
  10
  11         * Allow partial downtimes for pieces of the mirror such that
  12           when the mirror becomes whole again the entire drive does not
  13           have to be copied.  Instead only the segments of the drive that
  14           are out of sync would be resynchronized.
  15
  16           We want to avoid having to completely resynchronize the entire
  17           contents of a potentitally multi-terrabyte drive if one is
  18           taken offline temporarily and then brought back online.
  19
  20         * Allow mixed I/O errors on both drives making up the mirror
  21           without taking the entire mirror offline.
  22
  23         * Allow I/O read or write errors on one drive to degrade only
  24           the related segment and not the whole drive.
  25
  26         * Allow most writes to be asynchronous to the two drives making
  27           up the mirror up to the synchronization point.  Avoid unnecessary
  28           writes to the segment array on-media even through a synchronization
  29           point.
  30
  31         * Detect out-of-sync mirrors that are out of sync due to a system
  32           crash occuring prior to a synchronization point (i.e. when the
  33           drives themselves are just fine).  When this case occurs either
  34           copy is valid and one must be selected, but then the selected
  35           copy must be resynchronized to the other drive in the mirror
  36           to prevent the read data from 'changing' randomly from the point
  37           of view of whoever is reading it.
  38
  39     And my idea on implementation:
  40
  41         * Implement a segment descriptor array for each drive in the
  42           mirror, breaking the drive down into large pieces.  For
  43           example, 128MB per segment.  The segment array would be stored
  44           on both disks making up the mirror.  In addition, each disk will
  45           store the segment state for BOTH disks.
  46
  47           Thus a 1TBx2 mirror would have 8192x4 segments (4 segment
  48           descriptors for each logical segment).  The segment descriptor
  49           array would idealy be small enough to cache in-memory.  Being
  50           able to cache it in-memory simplifies lookups.
  51
  52           A segment descriptor would be, oh I don't know... probably
  53           16 bytes.  Leave room for expansion :-)
  54
  55           Why does each disk need to store a segment descriptor for both
  56           disks?  So we can 'remember' the state of the dead disk on the
  57           live disk in order to resolve mismatches later on when the
  58           dead disk comes back to life.
  59
  60         * The state of the segment descriptor must be consulted when reading
  61           or writing.  Some states are in-memory-only states while others
  62           can exist on-media or in-memory.  The states are represented by
  63           a set of bit flags:
  64
  65           MEDIA_UNSTABLE        0: The content is stable on-media and
  66                                    fully synchronized.
  67
  68                                 1: The content is unstable on-media
  69                                    (writes have been made and have not
  70                                     been completely synchronized to both
  71                                     drives).
  72
  73           MEDIA_READ_DEGRADED   0: No I/O read error occured on this segment
  74                                 1: I/O read error(s) occured on this segment
  75
  76           MEDIA_WRITE_DEGRADED  0: No I/O write error occured on this segment
  77                                 1: I/O write error(s) occured on this segment
  78
  79           MEDIA_MASTER          0: Normal operation
  80
  81                                 1: Mastership operation for this segment
  82                                    on this drive, which is set when the
  83                                    other drive in the mirror has failed
  84                                    and writes are made to the drive that
  85                                    is still operational.
  86
  87           UNINITIALIZED         0: The segment contains normal data.
  88
  89                                 1: The entire segment is empty and should
  90                                    read all zeros regardless of the actual
  91                                    content on the media.
  92
  93                                    (Use for newly initialized mirrors as
  94                                    a way to avoid formatting the whole
  95                                    drive or SSD?).
  96
  97           OLD_UNSTABLE          Copy of original MEDIA_UNSTABLE bit initially
  98                                 read from the media.  This bit is only
  99                                 recopied after the related segment has been
 100                                 fully synchronized.
 101
 102           OLD_MASTER            Copy of original MEDIA_MASTER bit initially
 103                                 read from the media.  This bit is only
 104                                 recopied after the related segment has been
 105                                 fully synchronized.
 106
 107           We probably need room for a serial number or timestamp in the
 108           segment descriptor as well in order to resolve certain situations.
 109
 110         * Since updating a segment descriptor on-media is expensive
 111           (requiring at least one disk synchronization command and of
 112           course a nasty seek), segment descriptors on-media are updated
 113           synchronously only when going from a STABLE to an UNSTABLE state,
 114           meaning the segment is undergoing active writing.
 115
 116           Changing a segment descriptor from unstable to stable can be
 117           delayed indefinitely (synchronized on a long timer, like
 118           30 or 60 seconds).  All that happens if a crash occurs in the
 119           mean time is a little extra copying of segments occurs on
 120           reboot.  Theoretically anyway.
 121
 122     Ok, now what actions need to be taken to satisfy a read or write?
 123     The actions taken will be based on the segment state for the segment
 124     involved in the I/O.  Any I/O which crosses a segment boundary would
 125     be split into two or more I/Os and treated separately.
 126
 127     Remember there are four descriptors for each segment, two on each drive:
 128
 129         DISK1 STATE stored on disk1
 130         DISK2 STATE stored on disk1
 131
 132         DISK1 STATE stored on disk2
 133         DISK2 STATE stored on disk2
 134
 135     In order to simplify matters any inconstencies between e.g. the DISK2
 136     state as stored on disk1 and the DISK2 state as stored on disk2 would
 137     be resolved immediately prior to initiation of the actual I/O.  Otherwise
 138     the combination of four states is just too complex.
 139
 140     So if both drives are operational this resolution must take place.  If
 141     only one drive is operational then the state stored in the segment
 142     descriptors on that one operational drive is consulted to obtain the
 143     state of both drives.
 144
 145     This is the hard part.  Lets take the mismatched cases first.  That is,
 146     when the DISK2 STATE stored on DISK1 is different from the DISK2 STATE
 147     stored on DISK2 (or vise-versa... disk1 state stored on each drive):
 148
 149         * If one of the two conflicting states has the UNSTABLE or MASTER
 150           bits set then set the same bits in the other.
 151
 152           Basically just OR some of the bits together and store to
 153           both copies.  But not all of the bits.
 154
 155         * If doing a write operation and the segment is marked UNITIALIZED
 156           the entire segment must be zero-filled and the bit cleared prior
 157           to the write operation. ????  (needs more thought, maybe even a
 158           sub-bitmap. See later on in this email).
 159
 160     Ok, now we have done that we can just consider two states, one for
 161     DISK1 and one for DISK2, coupled with the I/O operation:
 162
 163     WHEN READING:
 164
 165         * If MASTER is NOT set on either drive the read may be
 166           sent to either drive.
 167
 168         * If MASTER is set on one of the drives the read must be sent
 169           only to that drive.
 170
 171         * If MASTER is set on both drives then we are screwed.  This case
 172           can occur if one of the mirror drives goes down and a bunch of
 173           writes are made to the other, then system is rebooted and the
 174           original mirror drive comes up but the other drive goes down.
 175
 176           So this condition detects a conflict.  We must return an I/O
 177           error for the READ, presumably.  The only way to resolve this
 178           is for a manual intervention to explicitly select one or the
 179           other drive as the master.
 180
 181         * If READ_DEGRADED is set on one drive the read can be directed to
 182           the other.  If READ_DEGRADED is set on both drives then either
 183           drive can be selected.  If the read fails on any given drive
 184           it is of course redispatched to the other drive regardless.
 185
 186           When READ_DEGRADED is set on one drive and only one drive is up
 187           we still issue the read to that drive, obviously, since we have
 188           no other choice.
 189
 190     WHEN WRITING:
 191
 192         * If MASTER is NOT set on either drive the write is directed to
 193           both drives.
 194
 195         * Otherwise a WRITE is directed only to the drive with MASTER set.
 196
 197         * If both drives are marked MASTER the write is directed to both
 198           drives.  This is a conflict situation on read but writing will
 199           still work just fine.  The MASTER bit is left alone.
 200
 201         * If an I/O error occurs on one of the drives the WRITE_DEGRADED
 202           bit is set for that drive and the other drive (where the write
 203           succeeded) is marked as MASTER.
 204
 205           However, we can only do this if neither drive is already a MASTER.
 206
 207           If a drive is already marked MASTER we cannot mark the other drive
 208           as MASTER.  The failed write will cause an I/O error to be
 209           returned.
 210
 211     RESYNCHRONIZATION:
 212
 213         * A kernel thread is created manage mirror synchronization.
 214
 215         * Synchronization of out-of-sync mirror segments can occur
 216           asynchnronously, but must interlock against I/O operations
 217           that might conflict.
 218
 219           The segment array on the drive(s) is used to determine what
 220           segments need to be resynchronized.
 221
 222         * Synchronization occurs when the segment for one drive is
 223           marked MASTER and the segment for the other drive is not.
 224
 225         * In a conflict situation (where both drives are marked MASTER
 226           for any given segment) a manual intervention is required to
 227           specify (e.g. through an ioctl) which of the two drives is
 228           the master.  This overrides the MASTER bits for all segments
 229           and allows synchronization to occur for all conflicting
 230           segments (or possibly all segments, period, in the case where
 231           a new mirror drive is being deployed).
 232
 233     Segment array on-media and header.
 234
 235         * The mirroring code must reserve some of the sectors on the
 236           drives to hold a header and the segment array, making the
 237           resulting logical mirror a bit smaller than it otherwise would
 238           be.
 239
 240         * The header must contain a unique serial number (the uuid code
 241           can be used to generate it).
 242
 243         * When manual intervention is required to specify a master a new
 244           unique serial number must be generated for that master to
 245           prevent 'old' mirror drives that were removed from the system
 246           from being improperly recognized as being part of the new mirror
 247           when they aren't any more.
 248
 249         * Automatic detection of the mirror status is possible by using
 250           the serial number in the header.
 251
 252         * If the serial numbers for the header(s) for the two drives
 253           making up the mirror do not match (when both drives are up and
 254           both header read I/Os succeeded), manual intervention is required.
 255
 256         * Auto-detection of mirror segments ala Geom... using on-disk headers,
 257           is discouraged.  I think it is too dangerous and would much rather
 258           the detection be based on drive serial number rather than serial
 259           numbers stored on-media in headers.
 260
 261           However, I guess this is a function of LVM?  So I might not have
 262           any control over it.
 263
 264     The UNINITIALIZED FLAG
 265
 266         When formatting a new mirror or when a drive is torn out and a new
 267         drive is added the drive(s) in question must be formatted.  To
 268         avoid actually writing to all sectors of the drive, which would
 269         take too long on multi-terrabyte drives and create unnecesary
 270         writes on things like SSDs we instead of an UNINITIALIZED flag
 271         state in the descriptor.
 272
 273         If set any read I/O to the related segment is simply zero-filled.
 274
 275         When writing we have to zero-fill the segment (write zeros to the
 276         whole 128MB segment) and then clear the UNINITIALIZED flag before
 277         allowing the write I/O to proceed.
 278
 279         We might want to use some of the bits in the descriptor as a
 280         sub-bitmap.  e.g. if we reserve 4 bytes in the 16-byte descriptor
 281         to be an 'UNINITIALIZED' sub-bitmap we can break the 128MB
 282         segment down into 4MB pieces and only zero-fill/write portions
 283         of the 128MB segment instead of having to do the whole segment.
 284
 285         I don't know how well this idea would work in real life.  Another
 286         option is to just return random data for the uninitialized portions
 287         of a new mirror but that kinda breaks the whole abstraction and
 288         could blow up certain types of filesystems, like ZFS, which
 289         assume any read data is stable on-media.
 290
 291
 292                                                 -Matt
 293
 294