From: NeilBrown <neilb@suse.de>
Date: Mon, 9 Aug 2010 02:41:42 +0000 (+1000)
Subject: README update
X-Git-Url: http://git.neil.brown.name/?a=commitdiff_plain;h=a2406969209b8fa48c622a38bc32f2f68ae7478b;p=LaFS.git

README update
---

diff --git a/README b/README
index 14da9c6..2c2f0ad 100644
--- a/README
+++ b/README
@@ -5284,6 +5284,13 @@ DONE 52/ NFS export
 
 56/ Review roll-forward completely.
 
+57/ learn about FS_HAS_SUBTYPE and document it.
+
+58/ Consider embedding symlinks and device files in directory.
+    Need owner/group/perm for device file, but not for symlink.
+    Can we create unique inode numbers?
+    hard links for dev-files would be problematic.
+
 26June2010
  Investigating 5a
 
@@ -5936,3 +5943,365 @@ WritePhase - what is that all about?
  writeback, do I need to wait, or can I just mark it dirty in 
  commit_write?  If no checksum and no duplication applies, this should
  be fine.
+
+16July2010
+ BUT e.g. dir operations are in particular phases.  If the dirblock
+ is pinned to the old phase, we need to flush it, then wait for io
+ to complete.  So we need lafs_phase_wait as well as iolock_written.
+ This is already done by pin_dblock.
+ I wonder if we need a way to accelerate pinned blocks that are being
+ waited for - probably not, they should be done early.
+
+ So we probably want to iolock after phase_wait in pin_dblock.
+ Though dir.c pins early.
+ I need to review all of this and get it right.
+
+ So:
+  - we aren't allowed to block much holding  checkpoint_lock as
+    checkpoint_start waits for that.  However phase_wait will only
+    block if a new checkpoint has started already, so there is not
+    chance of phase_wait ever blocking checkpoint_start.
+    So it is safe to call phase_wait in checkpoint_lock.
+    phase_wait will wait until block is written, added back to
+    the lru clean, then found and flipped... I wonder if that is 
+    good - it keeps parent from being a leaf, and so written, until
+    child write has completed.
+    We want to phase-flip a block as soon as it is allocated by cluster_flush.
+
+    With directory blocks, i_mutex stops other changes, so an early iolock_written
+    will leave the block clean and phase won't be an issue.
+
+    With inode-map blocks.. we:
+      set B_Pinned to ensure no-one writes except for phase change
+        do that after lock_written so it starts safe.
+      once we have checkpointlock, wait for phase if needed.
+      then lock_written again which should be instant but ensures
+      that block is locked while we change it...
+
+  I think I want
+    - refile to call phase flip if index is not dirty and is in wrong phase
+       and has no pinned children in that phase.
+    - Only clear PinPending if we have i_mutex or refcnt == 0
+    - before transaction:
+          lock_written / set PinPending / unlock
+      the inside cluster_lock
+          lock_written pin / change / dirty / unlock
+      it will only wait for writeout if phase changed.
+      so don't need phase_wait
+     but want pre-pin then pindblock
+     Transactions are:
+        dir create/delete/update - DONE
+        inode allocate/deallocate - on inode map DONE
+        setattr  DONE
+	orphan set/change/discard
+
+     Orphans are a little different as when we compact the
+     file, the orphan file block 'owned' by the orphan block
+     can change.  As along as we keep them all PinPending it
+     should be fine though.
+     I think that every block in the orphan file will always be
+     PinPending ???
+
+    OK - done most of that.
+    Early phase_flip is awkward.  We need an iolock to phase_flip,
+    and we don't have one.  The phase_flip could cause incorporation
+    which cannot happen until the write completes.  So I guess
+    we leave it as it is.
+
+
+   FIXME what about inode data block - cluster_allocate is removing
+    PinPending after making them dirty from the index block..
+
+  If all free inode numbers a B_Claimed,  don't think we allocate
+  a new block... yes we do, as 'restarted' is local to caller.
+
+ Also
+  each device has a number of flags
+   - new metadata can go here
+   - new data can go here
+   - clean data can go here
+   - clean metadata can go here
+   - non-logged segments allowed
+   - priority clean - any segment can be cleaned
+   - dev is shared and read-only - no state-block updates
+
+  state block needs a uuid for an ro-filesystem that this is
+  layered on.
+
+  Is metadata an issue?
+    We might want it on a faster device, but ditto for directories
+    and for some data.  So probably skip that.
+
+  Have separate segment tables for:
+    - can have new data
+    - can have clean data but not new. (this often empty)
+
+  Clean data can go to new-not-clean if nothing else
+  new data can go to clean-not-new ?? if not sync??
+  Maybe call them 'prefer clean' and 'prefer new'
+
+  I think we want:
+    'no sync new' - don't write new data, unless it is in big chunks and
+           can wait for checkpoint to be 'synced'
+    'no write' - never write anything - this is readonly.
+               used for removing a device from the fs.
+
+  A 'no sync new' device can have single-block segments.
+  This doesn't allow compression, but avoids any need to clean
+  In this case we don't store youth and the segusage is 32 bits per segment.
+  That means  - for 1K block size - 0.5% of devices used for segusage.  That
+  feels high.  For 4K, 1/1024 so a giga per terabyte.
+  Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
+
+  Other segusage for 29 snaps is 1/million of space used.
+  So we 'waste' 0.1% of device for no secondary cleaning.
+  Can still do defrag though.
+
+  clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
+  as does creating a snapshot.
+
+18jul2010
+ If lafs were cluster enabled we would want multiple checkpoint clusters,
+ one for each node. When a node crashes some node would need to find and
+ roll-forward.  For single node failure, it is enough to broadcast cluster
+ address to all others.  For whole-cluster failure, need to either list all
+ in superblock or link from main write cluster.
+
+ When writing to multiple devices we may want multiple write clusters
+ active for new data.  These all need to be findable from checkpoint cluster
+ so linking sounds good.
+ Having a single 'fork' link in cluster head might work but does scale to large
+ cluster.  I doesn't need to be committed to other not does checkpoint end, so
+ that should be ok.
+ Could have a special group_head to list other clusters for roll forward.
+ If we put fsnum first, a large value - 0xffffffff - could easily mean
+ something else
+ 
+ Or every  cluster head could point to an alternate stream, and if we want many
+ quickly, each simply points to another, so we create a chain across all writers.
+
+
+ Another issue...
+  When we 'sync' we don't wait for blocks until after the checkpoint is started,
+  and we know that will be driven through to CheckpointEnd which will commit and
+  release everything.
+  However 'fsync' doesn't have the same guarantee.  The sync_page call will ensure
+  the data has been written, but we don't know it is safe until the next
+  header is written.  So we need to push out the next cluster promptly.
+
+  So if sync_page is called on a page in writeback, then we mark the cluster as
+  synchronous.  When a sync cluster completes, the next (or even next+1) clusters
+  are flushed out promptly.  Hopefully they won't be empty on a reasonably busy system,
+  but it is OK if they are.
+
+  If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
+  as the write completes the block will be released.
+
+  So: to clarify sync_page:
+    This can be called when page is in writeback or locked.
+    If locked there is nothing we can do except maybe unplug the read queue.
+    If page is in writeback and block is dirty, then it is probably in
+    a cluster queue and we should flush the cluster and the next.
+    If page is in writeback and block is not dirty, but is writeback,
+    just flush one cluster.
+    But we don't want these cluster flushes to start while the previous is
+    still outstanding else we stop new requests from being added.
+    So as soon as the cluster can be flushed we flush, but no sooner.
+    I guess we use FlushNeeded and make that be less hasty.
+
+19June2010
+
+  superblocks....
+   We currently have a superblock for each device.
+   I cannot see a good reason for that.
+   We can just bdev_claim for 'this' filesystem.
+   Rather we should have a number of anon superblocks,
+    one for each fileset, then one for each snapshot.
+   Do we use different fs types? probably yes
+       lafs - main filesystem made from devices
+       lafs_subset - subordinate fileset, given a path to  fileset object
+                 can have 'create' option when given an empty directory.
+       lafs_snap - snapshot - given a path to filesys and textname.
+
+    Cannot create a snap of a subset, only of the whole filesystem
+    Is it OK to mount eith snap of subset or subset of snap?
+    It probably does, so need to use the same filesystem type for both.
+    Maybe lafs_sub or sublafs. Needs path to directory.
+    can be given 'snap=foo'.
+    No: a given filesystem may not exist in a snapshot.  You need to
+    mount the snapshot first, then the subset of the snapshot.
+    So we have three types as above.  All subsets as 'lafs_subset',
+    whether they are subset of main or of snapshot.
+
+    Should we be able to create a snapshot or subset without mounting it?
+    It doesn't really seem necessary but might be elegant..
+
+    remount doesn't seem the right way to edit a filesystem as it forces
+     some cache flushing.
+    What do we want to edit?
+          - add device,  remove device
+          - add/remove snapshot by name
+          - add/remove subset?  Not needed, just mkdir/rmdir and mount to convert
+                     empty dir to subset.
+          - change cleaner settings??
+    Could have remount as an option. If problem find other option.
+
+    While cleaning (which is always) we potentially need all superblocks
+    available as we might need to load blocks in those filesystems to
+    relocate them.
+    Unfortunately each super needs to be in a global list so there is a cost
+    in having them appear and disappear. I guess that is not a big deal.  They
+    are refcounted and will disappear cleanly when the count hits zero.
+
+    So:
+     DONE - change all prime_sb->blocksize refs to fs->blocksize
+     DONE - create an anon sb for the main filesystem
+     DONE - discard the device sbs, just bd_claim the devices and add to list
+     - use lafs_subset for creating/mounting subsets.
+
+  Changed s_fs_info to point to the TypeInodeFile for the super, but
+   for root/snapshot that doesn't exist early enough to differentiate the
+   super in sget.
+   So we make an inode before the super exists and attach it after.
+   Need to do all that get_new_inode does.
+        inode_stat.nr_inodes++   - just don't generic_forget the inode
+        add to inode_in_use -   seems pointless - just set i_list to something
+        add to sb->s_inodes - if we don't it won't flush - maybe that is good?
+        add to hash - don't want
+        i_state == lock|new - only really needed if hashed.
+    but there is lots of initialisation in alloc_inode that we cannot access!!
+
+   Problem is that we need s_fs_info to uniquely identify the fs with something
+   that can be set in the spinlock, so allocating an inode is out.
+   And also to get to the filesystem metadata which is in the inode.
+   I guess we allocate a little something that stores identifier and later inode.
+     for lafs  we use uuid
+     for subset we use just the inode
+     for snapshot we use fs and number
+
+
+25July2010
+  superblocks:
+   - sget gives us an active super_block.  We need to attach to a vfsmnt
+     using simple_set_mnt, or call deactivate_locked_super.
+   - sget's set should call set_anon_super
+   - kill_sb (called by deactive_super) should then call kill_anon_super
+
+  If we have a vfsmnt, we have an active reference, so we can atomic_inc
+  s_active safely.  So use this to allow snapshots and subsets to hold a
+  ref on the prime_sb and thence on the 'fs'.
+
+26July2010
+ - DONE  need to set MS_ACTIVE somewhere!!
+ - FIXME if an inode is being dropped when iget comes in, it gets confused
+    and the inode appears to be deleted.
+
+   We cannot really break the dblock <-> inode link until after write_inode_now,
+   but there is no call-back before generic_detach_inode is complete.
+   The last is write_inode which is only calledif I_DIRTY_something.
+   Maybe when writeback completes on an inode dblock, we should check if
+   the inode is I_WILL_FREE and if so, we break the link...
+
+   Or maybe when we find my_inode set we can check the block and if it isn't
+   dirty or being deleted we break the link directly... That makes more sense.
+
+   So... what is the deal with freeing inodes???
+     ->iblock is like a hashtable reference.  It is not refcounted
+             It gets set under private_lock
+      iblock is freed by memory pressure or lafs_release_index from
+             destroy_inode
+     when refcount of iblock is non-zero, ->dblock ref is counted,
+     else it is not.
+     dblock is set to NULL if I_Destroyed, or when dblock is discarded,
+       (under lafs_hash_lock)
+       and set to 'b' in lafs_iget and lafs_inode_dblock
+
+     We can drop the dblock link as soon as iblock has no reference
+
+    probably get clear_inode to break the link if possible, which it should
+    be on 'forget_inode'.  Then lafs_iget can wait on the bit_waitqueue.
+    or maybe do clear_inode itself
+
+   FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
+      dblock is not NULL.
+
+28July2010
+  So: ->dblock and ->my_inode need to be clarified.
+
+  Neither is a counted reference - the idea is that either can be freed and
+  will destroy the pointer at the time so if the pointer is there, the
+  object must be ... but we need locking for that.
+  ->dblock is reasonably protected by private_lock, though if ->iblock exists
+  we hold a ref of ->dblock so we can access it more safely.
+
+  Need to check getiref_locked knows ->dblock exists when called on iblock
+  and lafs_inode_fillblock
+   yes, both safe!
+
+ But ->my_inode needs locking too so the inode can safely disappear without
+ having to wait for the data block to go.  After all data blocks some in sets,
+ and one shouldn't keep others with inodes.
+ So something light-weight like rcu might work.
+ We use call_rcu to free the inode and rcu_readlock to access ->my_inode
+
+ Yes, that will work.  Occasionally we will want an igrab to, but not
+ often.
+ Should look into rcu for index hash table and ->iblock as well.
+ Current ->iblock is only cleared when the block is freed .. I guess that is fine...
+
+
+31Jul2010
+  rcu protection of ->my_inode
+  A/ orphan inodes - are they protected?  
+  B/ orphan blocks - are the inodes of those protected? Probably...
+
+  inodes are 'orphan' for two reasons
+    1/ a truncate is in progress
+    2/ there are no remaining links, so inode should be truncated/deleted
+       on restart.
+
+  The second precludes us from holding a refcount on any orphan inode,
+  else it would never get deleted.
+  So we must assert that an inode with I_Deleting or I_Trunc has an implied
+  reference and so delete must be delayed... not quite.
+  If we set I_Trunc but not I_Deleting, then we igrab the inode until
+  I_Trunc is cleared.  While we hold the igrab, I_Deleting cannot possibly
+  be set as that is set when last ref is dropped.
+
+01Aug2010
+  FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
+    .. and in lafs_orphan_release
+  Well... only iolock_written can be a problem, and our rules require that
+  only phase-change writeout can set writeback.  So the cleaner can never
+  wait for writeout here.  Maybe it can wait for a lock, and maybe we don't
+  really need a lock, just 'wait_writeback'.
+08Aug2010
+  So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
+   It is writeback waiting on 74/BIGNUM fromm file.c:329.  So writepage
+   tried to write a block in a directory .. but it is PinPending so that
+   must have been set after writepage got it...
+   lafs_dir_handle_orphan gets an async lock, then sets PinPending.
+   If write_page is before that, it will have the lock and dir_handle will try later.
+   If write_page is after it will block on the lock, or see PinPending and
+   release the lock.
+   So someone else must be clearing PinPending!
+     - checkpoint clears and re-sets under the lock, so that is safe
+     - dir.c clears under i_mutex
+         dir_handle_orphans always hold i_mutex ... or does it.
+     - refile drops when the last non-lru reference goes.
+     - inode_map_new_abort clears for inode
+   No, not that - just bad test on result lof iolock_written_async ;-(
+
+  Now have an interesting deadlock.
+    rm in lafs_delete_inode in inode_map_free is waiting for the block to
+    flush which requires the cleaner.
+    The cleaner thread in inode-handle_orphan is calling erase_dblock
+     on the same inode which blocks while inode_map_free has it locked....
+     no, not same block - just waiting for writeout which requires cleaner.
+     lafs_erase_dblock from inode_map_free must be async!
+   pin_dblock in lafs_orphan_release must too.... no - only the setting of
+   PinPending needs to be async or out side of cleaner, which it is.
+
+  Ok, got that fixed.  All seems happy again, time for a commit.
+
+