README update

author NeilBrown <neilb@suse.de>

Wed, 4 May 2011 07:07:12 +0000 (17:07 +1000)

committer NeilBrown <neilb@suse.de>

Wed, 4 May 2011 07:07:12 +0000 (17:07 +1000)
author NeilBrown <neilb@suse.de>
Wed, 4 May 2011 07:07:12 +0000 (17:07 +1000)
committer NeilBrown <neilb@suse.de>
Wed, 4 May 2011 07:07:12 +0000 (17:07 +1000)
diff --git a/README b/README

index 252e9b723978dcc492532bf6657c409ecc90f7df..1be7c53279754e70e37889eee32827d108ad2e50 100644 (file)
--- a/README
+++ b/README
@@ -5186,20 +5186,20 @@ DONE  - use new write_begin / write_end
      I think.
  
  15cb/ Layout issues:
      I think.
  
  15cb/ Layout issues:
-     - subset filesys still needs a parent pointer
-     - cluster head needs mtime/ctime to log these.
+     DONE - subset filesys still needs a parent pointer
+     DONE - cluster head needs mtime/ctime to log these.
       - need better tracking of which devices are in this array??
              Need to be able to have read-only devices that are shared
               among arrays.
       - need better tracking of which devices are in this array??
              Need to be able to have read-only devices that are shared
               among arrays.
-     - need multiple parallel write-clusters to allow parallel writes.
+     DONE - need multiple parallel write-clusters to allow parallel writes.
       - record tuning in state block:
             - max_segs
       - record tuning in state block:
             - max_segs
-     - use crc or something, not toy checksum (e.g. cluster - state already has)
+     DONE - use crc or something, not toy checksum (e.g. cluster - state already has)
       - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
       - policies of whether old or new data is allowed on each device
       - policies of how much duplication of metadata is required
       DONE - inode map - not host-endian
       - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
       - policies of whether old or new data is allowed on each device
       - policies of how much duplication of metadata is required
       DONE - inode map - not host-endian
-     - segments > 16bit:
+     DONE - segments > 16bit:
          segusage file - what about youth?
          cluster_head Clength
  
          segusage file - what about youth?
          cluster_head Clength
  
@@ -5233,6 +5233,25 @@ DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesys
  15dc/ resolve table_size - it should be stored in the segusage file and validated
        based on device geometry.
  
  15dc/ resolve table_size - it should be stored in the segusage file and validated
        based on device geometry.
  
+15ea/ rollforward should recognise VerifyDevNext{,2} to allow next
+      cluster on same device to verify previous.
+
+15eb/ When multiple devices and lots to do and plenty of free space,
+       allow multiple segments, one per device, to be open at once,
+       and possibly be writing multiple clusters at once using
+       VerifyDevNext2
+
+15ec/ Implement i_version tracking.  This should be a 64bit numbers
+       that appears to change every time the file changes.  We only
+       need a new number when someone looks at the value with
+       getattr.
+       We could simply use mtime with the sub-millisecond part being
+       a counter of times that getattr sees a change in the same
+       millisecond.
+       However as mtime can go backwards we might get i_version going
+       backwards, which is awkward.  I wonder if I care.
+       Otherwise, leave for an inode extention later.
+
  16/ Update locking.doc
  
  17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
  16/ Update locking.doc
  
  17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
@@ -6503,7 +6522,7 @@ WritePhase - what is that all about?
      The writeout can be much later, but logging the mtime is fairly
      boring ... we could log mtime in the group head, which might be cheap
      enough.  How much precision is needed, and against what base?
      The writeout can be much later, but logging the mtime is fairly
      boring ... we could log mtime in the group head, which might be cheap
      enough.  How much precision is needed, and against what base?
-    probsably mtime of last checkpoint from superblock.  That should
+    probably mtime of last checkpoint from superblock.  That should
      be not more than 2048 seconds ago, so 16 bits gets is 30msec...
  
  14Aug2010
      be not more than 2048 seconds ago, so 16 bits gets is 30msec...
  
  14Aug2010
@@ -7362,11 +7381,156 @@ WritePhase - what is that all about?
  
  21mar2011
    My short-term todo list is:
  
  21mar2011
    My short-term todo list is:
-  - get 'lafs' to the stage where I can create an fs requiring roll-forward
-  - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
-  - Make lots of 'layout' changes - see 15cb
+DONE  - get 'lafs' to the stage where I can create an fs requiring roll-forward
+DONE  - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
+DONE  - Make lots of 'layout' changes - see 15cb
  
  02may2011
    - 'run' goes to completion, but segusage isn't updated in the final cluster
         and the number left over from before looks wrong.
  
  02may2011
    - 'run' goes to completion, but segusage isn't updated in the final cluster
         and the number left over from before looks wrong.
-  - 'ls -l' on a subset file gets confused.
+DONE  - 'ls -l' on a subset file gets confused.
+  - fs created by 'lafs' has wrong Blocks and Inodes counts
+  - we lose a ref to a segsum and sometimes put it too often.
+REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+
+
+03may2011
+  Once I have these bugs sorted out I want to make some format changes.
+
+   DONE - fs_metadata need a 'parent' link
+        rename needs to be careful about what is updated!
+        so does roll_mini
+        lafs_get_parent needs some thought.
+
+   DONE - roll-forward should get exact mtime stamps, and ctime.
+     So each data block must have an exact timestamp
+     of when the change actually happened.   Or the group_head
+     has a timestamp for the most recent update to the file
+     As we use nanosecond timestamps (pointless though they are)
+     we need 30 bits for the nanoseconds and at least 11 for the seconds.
+     So 48 bits (6 bytes) is plenty.
+     So include a 64bit timestamp in the cluster_head and 48bit
+     number to subtract in the group_head
+     But saving 2 bytes per file isn't really worth it, and we may
+     well lose it in padding.  So just store a 64bit timestamp in
+     the group_head.
+
+   DONE - use CRC in place of all checksums - lafs_calc_cluster_csum
+
+   DONE - state block flags for inconsistencies found
+       If any inconsistency found, fsck is advised.
+       For some it may be imperative.
+       Things that can be wrong include:
+       - generic read error
+       - segusage negative
+       - index block incoherent
+       - dir block incoherent
+       - link count negative
+       - cluster header incoherent
+       -
+       64 bits should be adequate and simple for this.
+       Any unknown bit requires a full fsck.
+
+   DONE - 32bit segment size
+        With 16bit at 4K blocks we are limited to 256Meg segments.
+       64Meg with 1k blocks.  This takes about 1 second to write on
+       a modern drive.  On an array it will take even less time.
+       24bits gives 16 to 64 gigabytes which is plenty.
+       However 24bits is awkward to access. a 1K block holds 341 1/3.
+       A 4K block holds 1365 1/3.
+       But this wastes less space than 256 or 1024 and so causes less IO.
+       But then we probably want to size segments to be very big.
+       A few thousand segments should be OK, which is tens of blocks.
+       I don't think the savings with 24bits are worth it, and I do
+       think v.big segments could be useful, so lets go with 32bit segments.
+
+       Youth is currently tuned to 16bits.  Let's leave it there and
+       maybe waste some space.
+
+
+   - parallel new-data write clusters.
+       I think it is sufficient to include a second 'next_addr' in the
+       cluster_head - or maybe two.  alt_next_addr[2].
+       When a thread wants to start a new stream of clusters it allocates
+       the segments then attaches to the next outgoing write cluster.
+       Once that is written everything in the new cluster is safe.
+       On a checkpoint every stream writes at least one checkpoint cluster
+       and these are linked together through alt_next_addr.
+       The 'next' cluster for each must be the checkpoint cluster and must
+       carry linkage but unlike with first-link, there is no need to wait
+       The data is already safe as long as the state block isn't updated
+       until every cluster_end block is written.
+       So really, one is enough.  I had though 2 would enable quick fan-out
+       but there is no real need for that.
+
+       As 0 is a valid write-cluster address we use 'this_address' to signify
+       that there is no alt-next.
+
+       It is possible that a block of a file could be written to two
+       different streams at different points in time between two checkpoints.
+       We need to ensure that roll-forward gets these in the right order.
+       'seq' can be the same in two different streams so we cannot use that.
+       timestamp could possibly be used, but as times can go backwards it
+       is not ideal.
+
+       NEW IDEA.  Just use one stream of clusters.  However it can
+       bounce from one device to another easily.  So two different
+       threads can be building up two different write clusters at the
+       same time as long as they synchronise at some point to pass
+       addresses around.  They also need some other Verify mode as
+       VerifyNext or VerifyNext2 will destroy any parallelism.
+       As the point of this is two write to multiple devices in
+       parallel, maybe VerifyDevNext{,2} meaning the next header on
+       the same device serves to verify this.
+
+   - policies.
+       This includes
+               maximum number of segments written between checkpoints
+               whether data can be cleaned to a particular device
+               whether a device can receive new data
+               whether metadata duplication is needed
+               whether an RO device from a different array is allowed.
+       Some of these are per-device policies.  Some are per-array.
+
+       The 'RO Device' thing is special.  I think I want an alt_uuid.
+       It works like this:  You assemble the RO array when you
+       mount a new filesystem identifying the old as a component.
+       So that 'state' block on the new devices must identify the alt_uuid
+       and state seq number.
+
+       Do we want to record more info about which devices are in the
+       array?  Currently we just record how many.  If we find enough
+       with the right UUID/seq, they must be it.. what else would we
+       want?
+
+       For all the other policy statements it is probably simplest to
+       allow a set of simple strings. e.g. "noclean", "nonew",
+       "dup=2" "maxseg=5"
+       devblock currently uses 146 bytes, so room for 878
+       stateblock uses 112 plus some for snapshots, so much the same.
+       We currently don't use 'version' and have no concrete plans.
+       The vague idea is to allow lafs to *know* that it cannot mount
+       the array, so any incompatible feature gets set.
+       We could keep those in the policy sets.  From that perspective
+       there are 3 types of things.
+        - if you don't understand, don't worry
+        - if you don't understand, don't try to write
+        - if you don't understand, you cannot even read.
+
+       That last is really best avoided.  We have version info
+       elsewhere in the tree so that a new index style will simply
+       make that block unreadable.
+       So I think make the dev and state blocks a simple incrementing
+       version number which apply to that block, and have "don't
+       worry" and "don't write" policies distinguished by first
+       letter.
+       Capital is "If you don't understand, don't write"
+       Lower is "if you don't understand, don't worry".
+
+       These are space separated strings
+
+   - etc.
+
+   - what about i_version?  Include in timestamp?
author	NeilBrown <neilb@suse.de>
	Wed, 4 May 2011 07:07:12 +0000 (17:07 +1000)
committer	NeilBrown <neilb@suse.de>
	Wed, 4 May 2011 07:07:12 +0000 (17:07 +1000)