From 5a928587a59779ad28509c856f4a25e6f4c034d7 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Fri, 9 Jul 2010 16:29:39 +1000
Subject: [PATCH] README update and spelling fixes.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 README   | 372 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 orphan.c |   4 +-
 2 files changed, 374 insertions(+), 2 deletions(-)

diff --git a/README b/README
index 692711a..3806134 100644
--- a/README
+++ b/README
@@ -4816,6 +4816,37 @@ DONE 5d/ At unmount, 16/1 is still pinned.
  
   Would we be spinning on -EAGAIN ?? 4 empty segment are present.
 
+ 6a/ index.c:1947 - lafs_add_block_address of index block where parent
+          has depth on 1.
+looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
+/home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
+
+ 6b/  check_seg_cnt sees to be spinning on the 3rd section 
+    the clean list has no end!
+    we were in seg scan 
+CLEANABLE: 0/0 y=0 u=0 cpy=32773
+CLEANABLE: 0/1 y=0 u=0 cpy=32773
+CLEANABLE: 0/2 y=0 u=0 cpy=32773
+CLEANABLE: 0/3 y=32773 u=6 cpy=32773
+CLEANABLE: 0/4 y=32772 u=124 cpy=32773
+CLEANABLE: 0/5 y=32771 u=273 cpy=32773
+CLEANABLE: 0/6 y=32770 u=0 cpy=32773
+
+of 
+0 0
+1
+2  
+3 6
+4 124
+5 273
+6 0
+7 496
+8 0
+
+
+ 6c/ at shut down, some simple orphans remain
+    missing wakeup ???
+
 DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
    truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
 Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
@@ -4868,6 +4899,12 @@ DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
     Understand why CleanSpace can be tried and failed 1000
     times before there is any change.
 
+ 7k/ use B_Async for all async waits, don't depend on B_Orphan to do
+     a wakeup.
+     write lafs_iolock_written_async.
+
+ 7l/ make sure i_blocks is correct.
+
 DONE 8/ looping in do_checkpoint
    root is still i Phase1 because 0/2 is in Phase 1
   [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
@@ -4933,6 +4970,9 @@ DONE 15b/ Report directory size less confusingly
 DONE 15d/ What does I_Dirty mean - and implement it.
 
 15e/ setattr should queue an update for the inode metadata.
+     and clean up lafs_write_inode at the same time (it shouldn't do an update).
+     and confirm when s_dirt should be set.  It causes fsync to run a
+     checkpoint.
 
 15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
 ## Items from 6 jul 2007.  
@@ -4944,6 +4984,7 @@ DONE 15d/ What does I_Dirty mean - and implement it.
      in erase_dblock, but that won't complete until cleaner gets to run,
      but this is the cleaner blocked on orphans.
 
+15i/ separate thread management from 'cleaner' name.
 
 16/ Update locking.doc
 
@@ -5038,6 +5079,10 @@ DONE 15d/ What does I_Dirty mean - and implement it.
 
 50/ Support O_DIRECT
 
+51/ Check support for multiple devices
+    - add a device to an live array
+    - remove a device from a live array
+
 26June2010
  Investigating 5a
 
@@ -5209,3 +5254,330 @@ DONE 15d/ What does I_Dirty mean - and implement it.
 
   So we could easily not have a my_inode - e.g. just cleaning the data block.
   ->my_inode cannot disappear while we hold the block, so a test is safe.
+
+
+ ----------------------------------------------
+ Space reservation and file-system-full conditions.
+
+  Space is needed for everything we write.
+  Some things we can reject if the fs is too full
+  Some things we can delay when space is tight
+  Some things we need to write in order to free up space.
+  Others absolutely must be written so we need to always have
+  a reserve.
+
+  The things that must be written are
+       - cluster header  - which we never allocate
+       - some seg-usage and youth blocks - and quota blocks
+         Whese continually have credit attached - it is a bug if there
+          are not enough. (We hit this bug)
+
+  Things that we need to write to free up space are
+   any block - data or index - that the cleaner finds.
+
+  Things that we can delay, but not fail, are any change to a block that
+   has already been written or allocate.
+
+  When space is needed it can come from one of three places.
+     - the remainder of the current main segment
+     - the remainder of the current cleaner segment
+     - a new segment.
+
+  Only Realloc blocks can go to the cleaner segment, so the
+  'must write' blocks cannot go there, so unused + main must have enough
+  space for all those.
+  Realloc blocks can go anywhere - we don't need a cleaner segment if things
+  are too tight.
+
+  When we run out of space there are several things we can do to get more:
+   - incorporate index blocks.  This tends to free up uninc-credits which
+     are normally over-allocated for safety.
+   - cluster_allocate/cluster_flush so more blocks get allocated and so
+     more can be incorporated.  See above.  This is probably most helpful
+     for data blocks.
+   - clean several segments into whole cleaner segments or into the main segment.
+  Much of this happens by triggering a snapshot, however we should only do that
+  when we have full cleaner-segments (or zero cleaner segments).
+
+  When cleaning we don't want to over-clean.  i.e. we don't want to commit
+  any blocks from a second segment if that will stop us from commiting blocks
+  from the first segment.  Otherwise we might use one cleaning segment up by
+  makeing 4 half-clean.  This doesn't help.
+
+
+  So: we reserve multiple segments for the cleaner, possibly zero.
+
+  We clean up to that many segments at a time, though if that many is zero,
+  we clean one segment at a time.
+  lafs_cluster_allocate only succeeds if there was room in an allocated segment.
+  If allocating a new segment fails, the cluster_allocate must fail.  This
+  will push extra cleaning into the main segment where allocations must not
+  fail.
+
+  The last 3(?) [adjusted for number of snapshots] segments can only be allocated
+  to the main segment, and this space can only be used for cleaning.
+  Once the "free_space - allocated_space"  drops below one segment, we 
+  force a checkpoint.  This should free up at least one segment.
+
+  We need some point at which we stop cleaning because the chance of finding
+  something to clean is too low. At that point all 'new' requests defintely
+  become failures.  They might do earlier too.
+  Possibly at some point we start discounting youth from new usage scores so
+  that the list becomes sorted by usage.
+
+
+  Need:
+    cut-off point for free_seg where we don't allow cleaner to use segments
+      3? 4?
+
+    event when we start using fixed '0x8000' youth for new segment scores.
+       Maybe when we clean a segment with usage gap below 16 or 1/128
+    event when we stop doing that.
+       Maybe when free_segs cross some number - 8?
+
+    point when alloc failure for NewSpace becomes ENOSPC
+       same as above?
+
+    point when we don't bother cleaning
+      no cleaner segments can be allocated, and checkpoint did not increase
+      number of clean segments (used as many as freed).
+      Clear this state when something is deleted.
+
+
+   Allocations come out of free_blocks which does not included those
+   segments that have been promised to the cleaner.
+   CleanSpace and AccountSpace cannot fail.
+     We *know* not to ask for too many - cleaner knows when to stop.
+   ReleaseSpace fail (to be retried) if available is below a threshold,
+     providing the cleaner hasn't been stopped.
+   NewSpace fail if below a somewhat higher threshold.  If we haven't entered
+     emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
+
+   
+   Possibly limit some 'cleaner' segments to data only??
+
+
+  So: work items.
+    - change CleanSpace to never fail, but cluster_allocate new_segment
+      can for cleaner segment.  This is propagated through lafs_cluster_alloc
+    - cleaner pre-allocates cleaner segments (for new_segment to use)
+      and only cleans that many segments at a time.
+    - introduce emergency cleaning mode which causes ENOSPC to be returned
+      and ignores 'youth' on score.
+    - pause cleaner when we are so short of space that there is not point
+      trying until something is deleted.
+
+30june2010
+  notes on current issue with checkpoint misbehaving and running out of
+  segments.
+
+  1/ don't want to cluster-flush too early.  Ideally wait until segment is
+   full, but we currently hold writeback on everything so we cannot delay
+   indefinitely.
+  2/ row goes negative!!  let's see...
+
+    seg_remainder doesn't change the set, but just returns
+        the remaining rows times the width
+
+    seg_step  move nxt_* to *, stepping to the next ... row?
+             save current as 'st_*
+
+    seg_setsize - allocate space in the segment for 'size' blocks plus
+         a bit to round of to a whole number of table/rows
+               nxt_table nxt_row
+
+    seg_setpos initialises the seg to a location and makes it empty,
+       st_ and nxt_ are the same
+
+    seg_next reports address of next block, and moves forward.
+
+    seg_addr  simply reports address of next block
+
+   So the sequence should be:
+
+     seg_setpos  to initialise
+     seg_remainder as much as you want
+     seg_setsize when we start a cluster
+     seg_next  up to seg_remainder times
+     seg_step  to go to next cluster (when not seg_setpos).
+            or maybe just before seg_setpos
+
+     Need cluster_reset to be called after new_segment, or after we
+     flush a cluster but don't need a new_segment.
+
+   I think I'm cleaning too early ...  I am even cleaning
+   the current main segment!!!!
+
+   OK, I got rid of the worst bugs.  Now it just keeps cleaning
+   the same blocks in the current segment over and over.
+   2 problems I see
+      1/ it cleans a segment that it should not touch
+           We need to  avoid cleaner segment increasing the
+             checkpoint youth number.
+      2/ it has 6 free segments and doesn't use them
+
+   clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
+   watermake is 4 segs, so free < 4.  So we have 3 allocated to cleaner,
+   3 in reserve and so nothing much to clean!
+
+   The heuristic for returning ENOSPC is not working.  Need something more
+   directly related to what is happening.
+   Maybe if cleaning doesn't actually increase free space.
+
+   !Need to leave segments in the table until we have finished
+   writing to them, so they cannot be cleanable. - DONE
+
+   WAIT - problem.  If cleaner segment is part-used, the alloc_cleaner_segs
+   doesn't count that.  Bad?
+
+   When nearly full we keep checkpointing even though it cannot help.
+   Need clearer rules on when there is any point pushing forward.
+   Need to know when to fail requests.
+
+02 july 2010
+
+  I am wasting lots of space creating snapshots that don't serve any
+  purpose.
+  The reasons for creating a snapshot are:
+    - turn clean segments into free segments
+    - reduce size of required roll-forward
+    - possibly flush all inode updates for 'sync'.
+
+  We currently force one when
+       newblocks > max_newblocks
+          max is 1000 , newblocks is never reset!
+          probably make that a number of segments.
+       lafs_checkpoint_start is called
+          when cleaner blocks, and space is available
+          at shutdown
+          on write_super is s_dirt
+             __fsync_super before ->sync_fs
+               freeze_bdev
+               fsync_super
+                 fsync_bdev
+                 do_remount_sb
+             generic_shutdown_super before put_super if s_dirt
+             sync_supers is s_dirt
+               do_sync
+             file_sync !!! is s_dirt
+
+      I think I should move checkpoint_start to
+            ->sync_fs
+
+
+ After testing
+  - blocks remaining after truncate - one index and 1-4 data
+  - truncate finds blocks being cleaned
+         FIXED - move setting of I_Trunc
+  - orphans aren't being cleaned up sometimes.
+        Hacked by forcing the thread to run.
+  - parent of index block has depth==1
+        Don't reduce depth while dirty children.
+        Probably don't want uninc either?
+
+  - some sort of deadlock? lafs_cluster_update_commit_both
+     has got the wc lock and wants to flush
+    writepage also is flushed.
+   Not sure what the blockage is.
+   I think the writepage is the one in clusiter_flush, and it
+    is blocking
+
+  - Async is keeping 16/0 pinned during shutdpwn
+03July2010
+
+  Testing overnight with 250 runs produced:
+ - blocked for more than 120 seconds
+      Cleaner tries to get an inode that is being deleted
+      and blocks, so inode_map_free is blocked waiting for
+      checkpoint to finish - deadlock.
+     Need to create a ->drop_inode which provides interlock with
+     cleaner/iget
+ 
+    But this is hard to get right.
+    generic_forget_inode need to write_inode_now and flush all changes
+    out and then truncate the pages off so the inode will be
+    empty and can be freed.  But flushing needs the cleaner thread
+    which can block on the inode lookup.
+    Ahh.... I can abuse iget5_locked.
+    If test sees I_WILL_FREE or similar, it fails and sets a flag.
+    if the flag was set, then 'set' fails
+
+
+ - block.c:504 DONE (I trink).
+    unlink/delete_commit dirties a block without credits
+    It could have been just cleaned..
+    It looks like it was in Writeback for the cleaner when
+    unlink pinned and allocated it....
+    or maybe it was on a cluster (due to writepage) when
+    it was pinned.  Then cluster_flush cleared dirty ... but
+    it should still have a Credit.
+    Maybe I should iolock the block ??
+
+    On reflection it wasn't cleaning, just tiny clusters
+    of recent changes which were originally written as tiny
+    checkpoints. Maybe lots of directory updates triggered the clusters.
+    I guess writepage is being called to sync the directory???
+    Or maybe the checkpoint was pushed by s_dirt being set.
+
+    So use PinPending and iolock to protect dir blocks from writepage.
+    
+ - dir.c:1266 DONE
+    dir handle orphan find a block (74/0) which is not
+    valid
+    This can happen if orphan_release failed to reserve a block.
+    We need to retry the release.
+ - inode.c:615
+    index block and some data blocks still accounted to deleted file.
+
+    No theory on this yet.  Always one index block and a small number
+    of data blocks.  Maybe the index block looked dirty, but was then
+    incorporated with something that was missed from the children list...
+    Or maybe I_Trunc is cleared a bit early...
+    Or trunc_next advanced too far?? or too soon
+    ??
+
+ - segments.c:640 DONE
+     prealloc in the cleaner finds all 2315 free blocks allocated.
+     no clean reserved.
+    Need to be able to fail CleanSpace requests when cleaner_reserve
+    is all gone.??
+
+    or just slow down the cleaner to one segment per checkpoint when
+    we are tight..  Hope that works.
+ - super.c:699
+     async flag on 16/0 keeping block pinned
+   Maybe clear Async flag during checkpoint.  Cleaner won't need it
+   No, just ensure to clear Async on all successful async calls.
+   
+     orphan file 8/0 has orphan reference keeping parent pinned
+      [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
+   Orphan handling is failing to get a reservation to write out the
+   orphan file block?  Not convincing as there should be lots of space
+   at unmount, and 'orphan sleeping' has become empty.
+
+ - Show State
+     orphan inode blocked by leaf index stuck in writeback:
+   [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5) 
+   [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0] 
+
+    This is in the write-cluster waiting to be flushed
+
+
+9July2010
+  Review B_Async.
+    If a thread wants async something, it 
+         - sets B_Async
+         - checks if it can have what it wants.
+           + if not, fail
+           + if so, clear B_Async and succeed
+
+    If a thread releases something that might be requested Async,
+         it doesn't clear Async, but wakes up *the*thread*.
+
+    This applies to
+        IOLock      - iolock_block
+        Writeback   - writeback_donem iolock_written
+        Valid        - erase_dblock, wait_block
+        inode I_*   - iget / drop_inode
+
+     orphan handler, cleaner, segscan - all in the cleaner thread.
diff --git a/orphan.c b/orphan.c
index d429955..69464d0 100644
--- a/orphan.c
+++ b/orphan.c
@@ -564,8 +564,8 @@ void lafs_add_orphan(struct fs *fs, struct datablock *db)
 
 void lafs_orphan_forget(struct fs *fs, struct datablock *db)
 {
-	/* This is still and orphan, but we don't want to handle
-	 * it just now.  When we we, lafs_add_orphan will be called */
+	/* This is still an orphan, but we don't want to handle
+	 * it just now.  When we do, lafs_add_orphan will be called */
 	LAFS_BUG(!test_bit(B_Orphan, &db->b.flags), &db->b);
 	spin_lock(&fs->lock);
 	if (!list_empty(&db->orphans)) {
-- 
2.39.5