Would we be spinning on -EAGAIN ?? 4 empty segment are present.
+ 6a/ index.c:1947 - lafs_add_block_address of index block where parent
+ has depth on 1.
+looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
+/home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
+
+ 6b/ check_seg_cnt sees to be spinning on the 3rd section
+ the clean list has no end!
+ we were in seg scan
+CLEANABLE: 0/0 y=0 u=0 cpy=32773
+CLEANABLE: 0/1 y=0 u=0 cpy=32773
+CLEANABLE: 0/2 y=0 u=0 cpy=32773
+CLEANABLE: 0/3 y=32773 u=6 cpy=32773
+CLEANABLE: 0/4 y=32772 u=124 cpy=32773
+CLEANABLE: 0/5 y=32771 u=273 cpy=32773
+CLEANABLE: 0/6 y=32770 u=0 cpy=32773
+
+of
+0 0
+1
+2
+3 6
+4 124
+5 273
+6 0
+7 496
+8 0
+
+
+ 6c/ at shut down, some simple orphans remain
+ missing wakeup ???
+
DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
Understand why CleanSpace can be tried and failed 1000
times before there is any change.
+ 7k/ use B_Async for all async waits, don't depend on B_Orphan to do
+ a wakeup.
+ write lafs_iolock_written_async.
+
+ 7l/ make sure i_blocks is correct.
+
DONE 8/ looping in do_checkpoint
root is still i Phase1 because 0/2 is in Phase 1
[cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
DONE 15d/ What does I_Dirty mean - and implement it.
15e/ setattr should queue an update for the inode metadata.
+ and clean up lafs_write_inode at the same time (it shouldn't do an update).
+ and confirm when s_dirt should be set. It causes fsync to run a
+ checkpoint.
15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
## Items from 6 jul 2007.
in erase_dblock, but that won't complete until cleaner gets to run,
but this is the cleaner blocked on orphans.
+15i/ separate thread management from 'cleaner' name.
16/ Update locking.doc
50/ Support O_DIRECT
+51/ Check support for multiple devices
+ - add a device to an live array
+ - remove a device from a live array
+
26June2010
Investigating 5a
So we could easily not have a my_inode - e.g. just cleaning the data block.
->my_inode cannot disappear while we hold the block, so a test is safe.
+
+
+ ----------------------------------------------
+ Space reservation and file-system-full conditions.
+
+ Space is needed for everything we write.
+ Some things we can reject if the fs is too full
+ Some things we can delay when space is tight
+ Some things we need to write in order to free up space.
+ Others absolutely must be written so we need to always have
+ a reserve.
+
+ The things that must be written are
+ - cluster header - which we never allocate
+ - some seg-usage and youth blocks - and quota blocks
+ Whese continually have credit attached - it is a bug if there
+ are not enough. (We hit this bug)
+
+ Things that we need to write to free up space are
+ any block - data or index - that the cleaner finds.
+
+ Things that we can delay, but not fail, are any change to a block that
+ has already been written or allocate.
+
+ When space is needed it can come from one of three places.
+ - the remainder of the current main segment
+ - the remainder of the current cleaner segment
+ - a new segment.
+
+ Only Realloc blocks can go to the cleaner segment, so the
+ 'must write' blocks cannot go there, so unused + main must have enough
+ space for all those.
+ Realloc blocks can go anywhere - we don't need a cleaner segment if things
+ are too tight.
+
+ When we run out of space there are several things we can do to get more:
+ - incorporate index blocks. This tends to free up uninc-credits which
+ are normally over-allocated for safety.
+ - cluster_allocate/cluster_flush so more blocks get allocated and so
+ more can be incorporated. See above. This is probably most helpful
+ for data blocks.
+ - clean several segments into whole cleaner segments or into the main segment.
+ Much of this happens by triggering a snapshot, however we should only do that
+ when we have full cleaner-segments (or zero cleaner segments).
+
+ When cleaning we don't want to over-clean. i.e. we don't want to commit
+ any blocks from a second segment if that will stop us from commiting blocks
+ from the first segment. Otherwise we might use one cleaning segment up by
+ makeing 4 half-clean. This doesn't help.
+
+
+ So: we reserve multiple segments for the cleaner, possibly zero.
+
+ We clean up to that many segments at a time, though if that many is zero,
+ we clean one segment at a time.
+ lafs_cluster_allocate only succeeds if there was room in an allocated segment.
+ If allocating a new segment fails, the cluster_allocate must fail. This
+ will push extra cleaning into the main segment where allocations must not
+ fail.
+
+ The last 3(?) [adjusted for number of snapshots] segments can only be allocated
+ to the main segment, and this space can only be used for cleaning.
+ Once the "free_space - allocated_space" drops below one segment, we
+ force a checkpoint. This should free up at least one segment.
+
+ We need some point at which we stop cleaning because the chance of finding
+ something to clean is too low. At that point all 'new' requests defintely
+ become failures. They might do earlier too.
+ Possibly at some point we start discounting youth from new usage scores so
+ that the list becomes sorted by usage.
+
+
+ Need:
+ cut-off point for free_seg where we don't allow cleaner to use segments
+ 3? 4?
+
+ event when we start using fixed '0x8000' youth for new segment scores.
+ Maybe when we clean a segment with usage gap below 16 or 1/128
+ event when we stop doing that.
+ Maybe when free_segs cross some number - 8?
+
+ point when alloc failure for NewSpace becomes ENOSPC
+ same as above?
+
+ point when we don't bother cleaning
+ no cleaner segments can be allocated, and checkpoint did not increase
+ number of clean segments (used as many as freed).
+ Clear this state when something is deleted.
+
+
+ Allocations come out of free_blocks which does not included those
+ segments that have been promised to the cleaner.
+ CleanSpace and AccountSpace cannot fail.
+ We *know* not to ask for too many - cleaner knows when to stop.
+ ReleaseSpace fail (to be retried) if available is below a threshold,
+ providing the cleaner hasn't been stopped.
+ NewSpace fail if below a somewhat higher threshold. If we haven't entered
+ emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
+
+
+ Possibly limit some 'cleaner' segments to data only??
+
+
+ So: work items.
+ - change CleanSpace to never fail, but cluster_allocate new_segment
+ can for cleaner segment. This is propagated through lafs_cluster_alloc
+ - cleaner pre-allocates cleaner segments (for new_segment to use)
+ and only cleans that many segments at a time.
+ - introduce emergency cleaning mode which causes ENOSPC to be returned
+ and ignores 'youth' on score.
+ - pause cleaner when we are so short of space that there is not point
+ trying until something is deleted.
+
+30june2010
+ notes on current issue with checkpoint misbehaving and running out of
+ segments.
+
+ 1/ don't want to cluster-flush too early. Ideally wait until segment is
+ full, but we currently hold writeback on everything so we cannot delay
+ indefinitely.
+ 2/ row goes negative!! let's see...
+
+ seg_remainder doesn't change the set, but just returns
+ the remaining rows times the width
+
+ seg_step move nxt_* to *, stepping to the next ... row?
+ save current as 'st_*
+
+ seg_setsize - allocate space in the segment for 'size' blocks plus
+ a bit to round of to a whole number of table/rows
+ nxt_table nxt_row
+
+ seg_setpos initialises the seg to a location and makes it empty,
+ st_ and nxt_ are the same
+
+ seg_next reports address of next block, and moves forward.
+
+ seg_addr simply reports address of next block
+
+ So the sequence should be:
+
+ seg_setpos to initialise
+ seg_remainder as much as you want
+ seg_setsize when we start a cluster
+ seg_next up to seg_remainder times
+ seg_step to go to next cluster (when not seg_setpos).
+ or maybe just before seg_setpos
+
+ Need cluster_reset to be called after new_segment, or after we
+ flush a cluster but don't need a new_segment.
+
+ I think I'm cleaning too early ... I am even cleaning
+ the current main segment!!!!
+
+ OK, I got rid of the worst bugs. Now it just keeps cleaning
+ the same blocks in the current segment over and over.
+ 2 problems I see
+ 1/ it cleans a segment that it should not touch
+ We need to avoid cleaner segment increasing the
+ checkpoint youth number.
+ 2/ it has 6 free segments and doesn't use them
+
+ clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
+ watermake is 4 segs, so free < 4. So we have 3 allocated to cleaner,
+ 3 in reserve and so nothing much to clean!
+
+ The heuristic for returning ENOSPC is not working. Need something more
+ directly related to what is happening.
+ Maybe if cleaning doesn't actually increase free space.
+
+ !Need to leave segments in the table until we have finished
+ writing to them, so they cannot be cleanable. - DONE
+
+ WAIT - problem. If cleaner segment is part-used, the alloc_cleaner_segs
+ doesn't count that. Bad?
+
+ When nearly full we keep checkpointing even though it cannot help.
+ Need clearer rules on when there is any point pushing forward.
+ Need to know when to fail requests.
+
+02 july 2010
+
+ I am wasting lots of space creating snapshots that don't serve any
+ purpose.
+ The reasons for creating a snapshot are:
+ - turn clean segments into free segments
+ - reduce size of required roll-forward
+ - possibly flush all inode updates for 'sync'.
+
+ We currently force one when
+ newblocks > max_newblocks
+ max is 1000 , newblocks is never reset!
+ probably make that a number of segments.
+ lafs_checkpoint_start is called
+ when cleaner blocks, and space is available
+ at shutdown
+ on write_super is s_dirt
+ __fsync_super before ->sync_fs
+ freeze_bdev
+ fsync_super
+ fsync_bdev
+ do_remount_sb
+ generic_shutdown_super before put_super if s_dirt
+ sync_supers is s_dirt
+ do_sync
+ file_sync !!! is s_dirt
+
+ I think I should move checkpoint_start to
+ ->sync_fs
+
+
+ After testing
+ - blocks remaining after truncate - one index and 1-4 data
+ - truncate finds blocks being cleaned
+ FIXED - move setting of I_Trunc
+ - orphans aren't being cleaned up sometimes.
+ Hacked by forcing the thread to run.
+ - parent of index block has depth==1
+ Don't reduce depth while dirty children.
+ Probably don't want uninc either?
+
+ - some sort of deadlock? lafs_cluster_update_commit_both
+ has got the wc lock and wants to flush
+ writepage also is flushed.
+ Not sure what the blockage is.
+ I think the writepage is the one in clusiter_flush, and it
+ is blocking
+
+ - Async is keeping 16/0 pinned during shutdpwn
+03July2010
+
+ Testing overnight with 250 runs produced:
+ - blocked for more than 120 seconds
+ Cleaner tries to get an inode that is being deleted
+ and blocks, so inode_map_free is blocked waiting for
+ checkpoint to finish - deadlock.
+ Need to create a ->drop_inode which provides interlock with
+ cleaner/iget
+
+ But this is hard to get right.
+ generic_forget_inode need to write_inode_now and flush all changes
+ out and then truncate the pages off so the inode will be
+ empty and can be freed. But flushing needs the cleaner thread
+ which can block on the inode lookup.
+ Ahh.... I can abuse iget5_locked.
+ If test sees I_WILL_FREE or similar, it fails and sets a flag.
+ if the flag was set, then 'set' fails
+
+
+ - block.c:504 DONE (I trink).
+ unlink/delete_commit dirties a block without credits
+ It could have been just cleaned..
+ It looks like it was in Writeback for the cleaner when
+ unlink pinned and allocated it....
+ or maybe it was on a cluster (due to writepage) when
+ it was pinned. Then cluster_flush cleared dirty ... but
+ it should still have a Credit.
+ Maybe I should iolock the block ??
+
+ On reflection it wasn't cleaning, just tiny clusters
+ of recent changes which were originally written as tiny
+ checkpoints. Maybe lots of directory updates triggered the clusters.
+ I guess writepage is being called to sync the directory???
+ Or maybe the checkpoint was pushed by s_dirt being set.
+
+ So use PinPending and iolock to protect dir blocks from writepage.
+
+ - dir.c:1266 DONE
+ dir handle orphan find a block (74/0) which is not
+ valid
+ This can happen if orphan_release failed to reserve a block.
+ We need to retry the release.
+ - inode.c:615
+ index block and some data blocks still accounted to deleted file.
+
+ No theory on this yet. Always one index block and a small number
+ of data blocks. Maybe the index block looked dirty, but was then
+ incorporated with something that was missed from the children list...
+ Or maybe I_Trunc is cleared a bit early...
+ Or trunc_next advanced too far?? or too soon
+ ??
+
+ - segments.c:640 DONE
+ prealloc in the cleaner finds all 2315 free blocks allocated.
+ no clean reserved.
+ Need to be able to fail CleanSpace requests when cleaner_reserve
+ is all gone.??
+
+ or just slow down the cleaner to one segment per checkpoint when
+ we are tight.. Hope that works.
+ - super.c:699
+ async flag on 16/0 keeping block pinned
+ Maybe clear Async flag during checkpoint. Cleaner won't need it
+ No, just ensure to clear Async on all successful async calls.
+
+ orphan file 8/0 has orphan reference keeping parent pinned
+ [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
+ Orphan handling is failing to get a reservation to write out the
+ orphan file block? Not convincing as there should be lots of space
+ at unmount, and 'orphan sleeping' has become empty.
+
+ - Show State
+ orphan inode blocked by leaf index stuck in writeback:
+ [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5)
+ [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0]
+
+ This is in the write-cluster waiting to be flushed
+
+
+9July2010
+ Review B_Async.
+ If a thread wants async something, it
+ - sets B_Async
+ - checks if it can have what it wants.
+ + if not, fail
+ + if so, clear B_Async and succeed
+
+ If a thread releases something that might be requested Async,
+ it doesn't clear Async, but wakes up *the*thread*.
+
+ This applies to
+ IOLock - iolock_block
+ Writeback - writeback_donem iolock_written
+ Valid - erase_dblock, wait_block
+ inode I_* - iget / drop_inode
+
+ orphan handler, cleaner, segscan - all in the cleaner thread.