From 5befac9ad5b5fc6280c14c8a344490413ea4e50c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 14 Sep 2010 11:50:34 +1000
Subject: [PATCH] README update and some comment fixes.

Still working through the TODO list.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 README  | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 index.c |   7 +-
 2 files changed, 183 insertions(+), 20 deletions(-)

diff --git a/README b/README
index f95490e..26b1947 100644
--- a/README
+++ b/README
@@ -56,11 +56,6 @@ FIXME
   index pages never get put on an LRU - how is this supposed to work?
 
 
-
-
-
-
-
 --------------------------
 Thoughts:
   Inodes live in an address-space, much like a file.  To load the
@@ -5034,35 +5029,35 @@ DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
 DONE 15q/ check checksums when reading cluster_header for cleaner
        This is already done!
 
-15r/ consider further optimisation in cleaner to avoid lookups.
+DONE 15r/ consider further optimisation in cleaner to avoid lookups.
 
-15s/ memory barrier for i_size check in cleaner???
+DONE 15s/ memory barrier for i_size check in cleaner???
 
-15t/ review usable-space calculations in clean.
+DONE 15t/ review usable-space calculations in clean.
 
-15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
+DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
 
-15v/ tidy up all code that fiddles bits and credits - maybe make some
+DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some
      common helpers.
 
-15w/ review cluster updates and make sure space used is accounted properly.
+DONE 15w/ review cluster updates and make sure space used is accounted properly.
 
-15x/ Consider caching result of a failed dir lookup in case we immediately
+DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately
      try to create it.  Would this actually save anything significant?
 
-15y/ Don't make dir blocks into orphans if it cannot be needed?
+DONE 15y/ Don't make dir blocks into orphans if it cannot be needed?
 
-15z/ make sure symlink creation is safe - do I need to log the body??
+DONE 15z/ make sure symlink creation is safe - do I need to log the body??
 
-15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
+DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
 
-15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
+DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
      after lock is taken on block?
 
-15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
+DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
       writeout.
 
-15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
+DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
         if more is needed.
 
 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
@@ -5194,12 +5189,28 @@ DONE 15q/ check checksums when reading cluster_header for cleaner
      - flags for inconsistencies found, at layout/fileset/file levels(?)
      - policies of whether old or new data is allowed on each device
      - policies of how much duplication of metadata is required
+     - inode map - not host-endian
 
 15cc/ free any stray B_ASync block found in destroy_inode
 
 15cd/ Some code assume a cluster header does not exceed 1 page.
      Is this safe?  Is in true? Is it enforced?
 
+15ce/ classify BUGs as
+        - internal logic errors
+        - IO errors
+        - unusual conditions I want a warning of
+        - data corruption errors
+
+15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems
+     This is needed for the cleaner - the cleaner needs to hold a ref somehow.
+
+15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster
+      stuff??
+
+15ch/ Review values of youth and checkpoint_youth and think about off-by-one
+     issues.
+
 16/ Update locking.doc
 
 17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
@@ -5328,6 +5339,23 @@ DONE 52/ NFS export
 61/ Try to make the inode struct smaller - maybe move some of the
     fs metadata into a separately-allocated struct.
 
+62/ System/trusted extended attributes:
+         fileset max size
+         directory hash/seed
+         
+63/ user extended attributes.
+
+64/ wonder if index blocks can be flushed out by memory pressure somehow.
+   e.g. if a data block is written by reclaim, flag the index block.
+   When a flagged index block has no children, it is incorporated and written.
+    ??
+
+65/ review why lafs_allocated_block needs the new_parent label.  Should not
+   lafs_incorporate leave all parents dirty? Maybe it is just the need for
+   B_Realloc - so maybe lafs_incorporate should leave the new block either
+   realloc or dirty rather than lafs_allocated_block doing it.?
+   See also 15ad below.
+
 26June2010
  Investigating 5a
 
@@ -6461,3 +6489,135 @@ WritePhase - what is that all about?
    So it should all be finished when we do a checkpoint.
 
    So if checkpoint, or release_page, finds an async block, drop it.
+
+ 15r - further optimisations in cleaner to avoid lookups.
+   We have fsnum,inum,blocknum and cluster seq number and trunc num.
+
+   I want to introduce more async though.  Currently it only loads
+   one inode at a time.
+   To do more, I need to mark inodes as 'done' when they are and always
+   restart from the start of the cluster (only do one cluster at a time
+   for now).
+   So if we get all the way though a cluster with no 'EAGAIN' we finish
+   with the cluster.
+
+ 15y - when could a directory block become an orphan?
+    - when deleting that last entry - we don't know if it can be fully
+      deleted until we look in next block
+    - when deleting an entry follows a chain back to the first block
+    - when deleting the last entry in the block.
+
+    So it could be an orphan if the entry found:
+        - is at end of block
+	- is first entry
+	- is only entry
+     or first entry is already deleted.
+
+15Aug2010
+  looking at flushing etc when run out of space.
+  We often force a checkpoint when it won't do any good as
+  nothing has been cleaned.
+  In fact we write lots of dead checkpoints to 0/0 until it is full,
+  then move on, clean 0/0 and suddenly have space.
+  We shouldn't do that.  sync should be what pushes us forwards.
+  Maybe that is fixed..
+
+  InoIdx blocks still cause confusion.  Should they ever have credits?
+  or do only the data block have those?  Certainly they cannot have
+  SegRef.
+  And there is confusion in my mind whether data blocks can be pinned 
+  while the InoIdx block is - need to clarify that.
+
+
+13Sep2010 - now, where was I...
+ - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense.
+ - test run: block.c:660 - no credits available while dirtying an InoIdx block during
+   orphan handling.  lafs_reserver_block (under checkpoint lock) should have set credit.
+   Only I just changed reserve_block to do that dblock instead - I wonder why.
+   OK, I think I cleaned that up...
+
+
+ - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN
+   so pin_dblock did too.  So reserve_block did too, so prealloc or summary_alloc or seg_ref_block
+   returned error.
+   Problem is that we don't push a checkpoint when cleaner runs out of things to do.
+   But we don't want to go back to pushing a checkpoint too often.
+   Maybe the problem is that we only force the checkpoint when we have enough space to do
+   new allocations, but we need to force it earlier if nothing new can be cleaned.
+
+   Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so
+   we need to wake 'checkpoint_wait' then.
+   But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway.
+   ...anyway, that is sorted now at commit  95b6b05e460
+
+
+  So: InoIdx blocks.
+    - These never get SegRef as that is meaningless - done.
+    - These can have credits.  It possibly isn't necessary bit it makes things
+      easier.  They are 'written' by transfering the credits to the data block, or discarding them.
+    - I think dblock and iblock can both be pinned
+      The problem this caused was that the dblock might get processed as a leaf before iblock.
+      We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock
+      is pinned to the same phase.
+      lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when
+      it unpins an iblock
+      So lafs_pin_dblock doesn't need to pin the inode instead.
+   OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000
+
+ 15u - I don't need to get a segref there, but I need to have one from the original dirty block,
+       so fix that up - commit Mon Sep 13 15:28:08 2010 +100
+
+ 15v - What do we have?
+       lafs_dirty_dblock:  set Dirty, clear Credit clear NCredit
+                           set Uninc, clear Icredit clear NICredit
+       lafs_dirty_iblock:  set dirty, clear credit
+                           test uninc, clear ICredit, set Unincredit - not essential
+       mark_cleaning:      test realloc, / alloc / set realloc
+                           test dirty / clear realloc/ set credit
+                           set uninc clear icredit
+       cleaner_flush:      set dirty, clear realloc, clear credit
+                           test dirty, clear realloc set credit
+       flush_data_to_inode:
+       lafs_cluster_allocate - there is some odd code ther!!
+       flip_phase
+       lafs_allocated_block
+
+       all rather different really.
+       Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock
+
+ 15w/ Space used by cluster updates??
+       It is all fine - just some confusion of function names.
+
+ 15z/ logging symlink creation.
+      Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the
+      symlink itself.  So I guess we queue the block for writeout so it will go with the
+      dir update.
+      Yes, that works: Mon Sep 13 17:33:54 2010 +100
+
+ 15ab/ already did that in commit f90959e6f492b6
+
+
+ 15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to
+    be freed after the write completes?  A checkpoint could do it, but that would write out index block
+    that cannot be freed too.  A checkpoint would only be good after lots of data pages had been written.
+    We could just wait and let other processes kick in..
+
+    I don't think we need to do anything.  lafs_shrinker doesn't really know how tight memory
+    is, and periodic checkpoint will free up any memory that we are pinning.
+
+    .... but something is needed.  We need some trigger to write dirty index blocks
+    Maybe:
+	- a timeout on checkpoints - every dirty_expire_interval - but that isn't exported.
+        DONE THAT.
+
+    Not sure this is a complete solution.  I might want to incorp/flush index block when they
+    have no dirty children, but I'm not sure about that.
+
+14sep2010
+  15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly?
+     failure happens when b2 is data block and uninc table is full so we called incorporate on the parent.
+    This could split the parent which means the block could have been re-parented - it would have been in the
+    child list and so found and fixed.
+    lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate.
+    Inf this case, realloc isn't an issue, only dirty.  lafs_incorporate must have made it dirty and
+    it won't get written while it has these in-phase children, so all is happy.
diff --git a/index.c b/index.c
index 65d36c8..18adc7d 100644
--- a/index.c
+++ b/index.c
@@ -644,8 +644,11 @@ void lafs_phase_flip(struct fs *fs, struct indexblock *ib)
 		b2->chain = NULL;
 		clear_bit(B_Uninc, &b2->flags);
 		while (lafs_add_block_address(fs, b2) == 0)
-			/* FIXME do I need more like I do in
-			 * lafs_allocated_block */
+			/* We had to incorporate b2->parent which might
+			 * have split the parent but as that will have made
+			 * it dirty, there is nothing extra that we need to
+			 * do here
+			 */
 			;
 		putref(b2, MKREF(uninc));
 	}
-- 
2.39.5