From 1fed726e9f053cebfc43e32cd5ad938e84cfee9d Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Fri, 4 Mar 2011 12:47:31 +1100
Subject: [PATCH] README update

Signed-off-by: NeilBrown <neilb@suse.de>
---
 README | 106 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 87 insertions(+), 19 deletions(-)

diff --git a/README b/README
index a5e15f4..6c3d454 100644
--- a/README
+++ b/README
@@ -5127,39 +5127,39 @@ DONE 15az/ Sanity check all values in cluster head during roll-forward
       i.e. in roll_valid.  If the head isn't complete, we can still
       use this to commit some previous checkpoints.
 
-15ba/ roll forward should not BUG on bad data like inodefile in
+DONE 15ba/ roll forward should not BUG on bad data like inodefile in
     non-primary filesystem.
 
-15bb/ Do I need to sync something before copying an update over part
+DONE 15bb/ Do I need to sync something before copying an update over part
     of an inode, then reloading the inode.
 
-15bc/ Handle DescHole in roll forward.
+DONE 15bc/ Handle DescHole in roll forward.
 
-15bd/ Call lafs_add_block_address from writeback rather than iolock
+DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
     in roll forward, just for consistency.
 
-15be/ Confirm various files loaded at mount time (segusage, orphan ...)
+DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
     are actually the correct type.
 
-15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
+DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
    a lookup - or at least we can test for that.
    lafs_seg_apply_all has similar problems and needs a good solution.
 
-15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
+DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
     if parent splits.  See what to do about that.
 
-15bh/ after roll-forward, check that free_blocks hasn't gone negative.
+DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
   or handle if it has.
 
 DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
   to twostage.
 
-15bj/ Make sure .last link in segtracker is kepts uptodate, particularly in
+DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
    segdelete.
 
-15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
+DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
 
-15bl/ better checks for 'valid state block address' in valid_devblock
+DONE 15bl/ better checks for 'valid state block address' in valid_devblock
     include that segment_count is credible
     also in valid_stateblock
 
@@ -5173,12 +5173,13 @@ DONE 15bp/ review all superblocks - maybe use more anon??
 
 15bq/ check readonly status in lafs_get_sb
 
-15br/ sync_fs should probably wait for something if 'wait'.
+DONE 15br/ sync_fs should probably wait for something if 'wait'.
 
-15bs/ set f_fsid properly in lafs_statfs
+DONE 15bs/ set f_fsid properly in lafs_statfs
 
- - use new write_begin / write_end
-    - review how we ensure that credit remain with block.
+DONE  - use new write_begin / write_end
+
+15bt/    - review how we ensure that credit remain with block.
 
 15ca/ When pin inode data block, pin it as well as index block I think
     It is still kept of the leaf list until the index block is done with
@@ -5205,7 +5206,10 @@ DONE 15bp/ review all superblocks - maybe use more anon??
 15cc/ free any stray B_ASync block found in destroy_inode
 
 15cd/ Some code assumes a cluster header does not exceed 1 page.
-     Is this safe?  Is in true? Is it enforced?
+     Is this safe?  Is in true? Is it enforced?p
+     roll-forward now handles large cluster_head.
+     Need cleaner to handle it, and need to possibly write large
+     cluster head when making new clusters.
 
 15ce/ classify BUGs as
         - internal logic errors
@@ -5300,7 +5304,7 @@ DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
      than just iolock, for consistency...
 DONE 36f/ What to do if table becomes full when add_block_address in
      roll_block ??
-36g/ Write roll_mini for directories.
+DONE 36g/ Write roll_mini for directories.
 DONE 36h/ In roll_one, use the cluster counting code to find block number and
      make sure we don't exceed the segment.
 DONE 36i/ add more general error checking to lafs_mount - 
@@ -5362,8 +5366,8 @@ DONE 52/ NFS export
     Need owner/group/perm for device file, but not for symlink.
     Can we create unique inode numbers?
     hard links for dev-files would be problematic.
-    What do we gain?  Maybe something for sort symlinks.
-    40 seems a ood length to et 70% of symlinks.
+    What do we gain?  Maybe something for short symlinks.
+    40 seems a good length to get 70% of symlinks.
 
 59/ Fix NeedFlush handling so we don't drop-then-retake
     a mutex as that isn't sensible.
@@ -5396,6 +5400,8 @@ DONE 52/ NFS export
    If a cross-directory rename happens care is needed:  either flush updates
    first or ensure that a flush does happen before the cross-directory
    update is flushed.
+   Note that if the target of a rename is a directory, it must also be fully
+   flushed before the rename can proceed.
 
 26June2010
  Investigating 5a
@@ -6973,3 +6979,65 @@ WritePhase - what is that all about?
    It is OK to delay the write-out of these until an fsync, and not bother
    if a checkpoint happens.
    So add that to th TODO list - item 66.
+
+28feb2010
+  - roll forward directory updates ... I wonder if I got it right :-)(untested).
+
+
+  I don't seem to have easy-access notes about the various meaning of
+  'width' and 'stride'
+
+  width:  The number of independent devices across which the (virtual) device
+    is placed.  The normal goal is to write 'width' blocks on every single write.
+    On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
+    and it will keep all devices equally busy with writes.
+    The 'width' blocks probably aren't consecutive.
+
+    There are two different layouts - one with width*stride <= segment_size
+    and one with width*stride > segment_size.
+
+  width*stride <= segment_size
+     This is a traditional striped layout like RAID0/4/5/6.
+     The 'stride' is the chunk size, so 'width*stride' is the stripe size,
+     and segment_size must be a multiple of this.
+     In this case all addresses in a single segment are contigious.   We don't
+     necessarily write them in order if we want to write less than one stripe.
+     segment_offset will normally be a multiple  of width*stride though this isn't
+     enforced as one could have a partition with an non-aligned start.
+
+  width*stride > segment_size
+     This implies a catentated layout.  If parity-redundancy is in use when the
+     blocks which combine to form a stripe are 'stride' blocks apart.
+     The benefit of this layout is that an extra drive can be added by simply
+     zeroing it and joining it to the array - no re-stripe needed.
+     This will make all stripes slightly larger so at first the space will not
+     be available.  As cleaning happens the space will gradually become
+     available.  This still requires restriping, but unlike a normal
+     raid5 restripe, the space becomes available in small amounts immediately,
+     when there is no demand for more space, the re-striping (cleaning) can happen
+     at a very low priority with no cost.
+
+     In this case the blocks in a segment are not contiguous.  
+      'segment_size/width' are, then there is a large gap (in virtual address 
+      space) to the next chunk.
+
+     The segment_offset is an amount of space which is free at the start of
+     each device.  0..segment_offset and stride..stride+segment_offset etc
+     do not contain data and can be used for metadata.
+
+  When width > 1 it makes sense to replicate each state block across
+     every device - as we want to write the whole stripe anyway.
+  For now we only write and read the first two copies at the beginning, and
+  the last two at the end...
+
+  Question:  what do we want to do about metadata on flash devices?  We really
+   don't want a small number of locations to store the metadata, but a large
+   number that we search through - possibly a binary search. 
+   These could be all at start/end or scattered throughout the device.
+   The later would make it impossible to find efficiently - there is no way to
+   create useful linkage without writing something else at start of end.
+   As many devices optimise for random writes where the FAT table would be,
+   it make sense to just put the metadata there and not at the end.
+   We should allow one 'page' for each metadatum, which probably meanss
+   32K.
+   So we should allow all state blocks to be near the start.
-- 
2.39.5