+DONE - 'ls -l' on a subset file gets confused.
+ - fs created by 'lafs' has wrong Blocks and Inodes counts
+ - we lose a ref to a segsum and sometimes put it too often.
+REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+
+
+03may2011
+ Once I have these bugs sorted out I want to make some format changes.
+
+ DONE - fs_metadata need a 'parent' link
+ rename needs to be careful about what is updated!
+ so does roll_mini
+ lafs_get_parent needs some thought.
+
+ DONE - roll-forward should get exact mtime stamps, and ctime.
+ So each data block must have an exact timestamp
+ of when the change actually happened. Or the group_head
+ has a timestamp for the most recent update to the file
+ As we use nanosecond timestamps (pointless though they are)
+ we need 30 bits for the nanoseconds and at least 11 for the seconds.
+ So 48 bits (6 bytes) is plenty.
+ So include a 64bit timestamp in the cluster_head and 48bit
+ number to subtract in the group_head
+ But saving 2 bytes per file isn't really worth it, and we may
+ well lose it in padding. So just store a 64bit timestamp in
+ the group_head.
+
+ DONE - use CRC in place of all checksums - lafs_calc_cluster_csum
+
+ DONE - state block flags for inconsistencies found
+ If any inconsistency found, fsck is advised.
+ For some it may be imperative.
+ Things that can be wrong include:
+ - generic read error
+ - segusage negative
+ - index block incoherent
+ - dir block incoherent
+ - link count negative
+ - cluster header incoherent
+ -
+ 64 bits should be adequate and simple for this.
+ Any unknown bit requires a full fsck.
+
+ DONE - 32bit segment size
+ With 16bit at 4K blocks we are limited to 256Meg segments.
+ 64Meg with 1k blocks. This takes about 1 second to write on
+ a modern drive. On an array it will take even less time.
+ 24bits gives 16 to 64 gigabytes which is plenty.
+ However 24bits is awkward to access. a 1K block holds 341 1/3.
+ A 4K block holds 1365 1/3.
+ But this wastes less space than 256 or 1024 and so causes less IO.
+ But then we probably want to size segments to be very big.
+ A few thousand segments should be OK, which is tens of blocks.
+ I don't think the savings with 24bits are worth it, and I do
+ think v.big segments could be useful, so lets go with 32bit segments.
+
+ Youth is currently tuned to 16bits. Let's leave it there and
+ maybe waste some space.
+
+
+ - parallel new-data write clusters.
+ I think it is sufficient to include a second 'next_addr' in the
+ cluster_head - or maybe two. alt_next_addr[2].
+ When a thread wants to start a new stream of clusters it allocates
+ the segments then attaches to the next outgoing write cluster.
+ Once that is written everything in the new cluster is safe.
+ On a checkpoint every stream writes at least one checkpoint cluster
+ and these are linked together through alt_next_addr.
+ The 'next' cluster for each must be the checkpoint cluster and must
+ carry linkage but unlike with first-link, there is no need to wait
+ The data is already safe as long as the state block isn't updated
+ until every cluster_end block is written.
+ So really, one is enough. I had though 2 would enable quick fan-out
+ but there is no real need for that.
+
+ As 0 is a valid write-cluster address we use 'this_address' to signify
+ that there is no alt-next.
+
+ It is possible that a block of a file could be written to two
+ different streams at different points in time between two checkpoints.
+ We need to ensure that roll-forward gets these in the right order.
+ 'seq' can be the same in two different streams so we cannot use that.
+ timestamp could possibly be used, but as times can go backwards it
+ is not ideal.
+
+ NEW IDEA. Just use one stream of clusters. However it can
+ bounce from one device to another easily. So two different
+ threads can be building up two different write clusters at the
+ same time as long as they synchronise at some point to pass
+ addresses around. They also need some other Verify mode as
+ VerifyNext or VerifyNext2 will destroy any parallelism.
+ As the point of this is two write to multiple devices in
+ parallel, maybe VerifyDevNext{,2} meaning the next header on
+ the same device serves to verify this.
+
+ - policies.
+ This includes
+ maximum number of segments written between checkpoints
+ whether data can be cleaned to a particular device
+ whether a device can receive new data
+ whether metadata duplication is needed
+ whether an RO device from a different array is allowed.
+ Some of these are per-device policies. Some are per-array.
+
+ The 'RO Device' thing is special. I think I want an alt_uuid.
+ It works like this: You assemble the RO array when you
+ mount a new filesystem identifying the old as a component.
+ So that 'state' block on the new devices must identify the alt_uuid
+ and state seq number.
+
+ Do we want to record more info about which devices are in the
+ array? Currently we just record how many. If we find enough
+ with the right UUID/seq, they must be it.. what else would we
+ want?
+
+ For all the other policy statements it is probably simplest to
+ allow a set of simple strings. e.g. "noclean", "nonew",
+ "dup=2" "maxseg=5"
+ devblock currently uses 146 bytes, so room for 878
+ stateblock uses 112 plus some for snapshots, so much the same.
+ We currently don't use 'version' and have no concrete plans.
+ The vague idea is to allow lafs to *know* that it cannot mount
+ the array, so any incompatible feature gets set.
+ We could keep those in the policy sets. From that perspective
+ there are 3 types of things.
+ - if you don't understand, don't worry
+ - if you don't understand, don't try to write
+ - if you don't understand, you cannot even read.
+
+ That last is really best avoided. We have version info
+ elsewhere in the tree so that a new index style will simply
+ make that block unreadable.
+ So I think make the dev and state blocks a simple incrementing
+ version number which apply to that block, and have "don't
+ worry" and "don't write" policies distinguished by first
+ letter.
+ Capital is "If you don't understand, don't write"
+ Lower is "if you don't understand, don't worry".
+
+ These are space separated strings
+
+ - etc.
+
+ - what about i_version? Include in timestamp?