From 5a928587a59779ad28509c856f4a25e6f4c034d7 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 9 Jul 2010 16:29:39 +1000 Subject: [PATCH] README update and spelling fixes. Signed-off-by: NeilBrown --- README | 372 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ orphan.c | 4 +- 2 files changed, 374 insertions(+), 2 deletions(-) diff --git a/README b/README index 692711a..3806134 100644 --- a/README +++ b/README @@ -4816,6 +4816,37 @@ DONE 5d/ At unmount, 16/1 is still pinned. Would we be spinning on -EAGAIN ?? 4 empty segment are present. + 6a/ index.c:1947 - lafs_add_block_address of index block where parent + has depth on 1. +looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1) +/home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) + + 6b/ check_seg_cnt sees to be spinning on the 3rd section + the clean list has no end! + we were in seg scan +CLEANABLE: 0/0 y=0 u=0 cpy=32773 +CLEANABLE: 0/1 y=0 u=0 cpy=32773 +CLEANABLE: 0/2 y=0 u=0 cpy=32773 +CLEANABLE: 0/3 y=32773 u=6 cpy=32773 +CLEANABLE: 0/4 y=32772 u=124 cpy=32773 +CLEANABLE: 0/5 y=32771 u=273 cpy=32773 +CLEANABLE: 0/6 y=32770 u=0 cpy=32773 + +of +0 0 +1 +2 +3 6 +4 124 +5 273 +6 0 +7 496 +8 0 + + + 6c/ at shut down, some simple orphans remain + missing wakeup ??? + DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0 @@ -4868,6 +4899,12 @@ DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs Understand why CleanSpace can be tried and failed 1000 times before there is any change. + 7k/ use B_Async for all async waits, don't depend on B_Orphan to do + a wakeup. + write lafs_iolock_written_async. + + 7l/ make sure i_blocks is correct. + DONE 8/ looping in do_checkpoint root is still i Phase1 because 0/2 is in Phase 1 [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid writepageflush(1) @@ -4933,6 +4970,9 @@ DONE 15b/ Report directory size less confusingly DONE 15d/ What does I_Dirty mean - and implement it. 15e/ setattr should queue an update for the inode metadata. + and clean up lafs_write_inode at the same time (it shouldn't do an update). + and confirm when s_dirt should be set. It causes fsync to run a + checkpoint. 15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward? ## Items from 6 jul 2007. @@ -4944,6 +4984,7 @@ DONE 15d/ What does I_Dirty mean - and implement it. in erase_dblock, but that won't complete until cleaner gets to run, but this is the cleaner blocked on orphans. +15i/ separate thread management from 'cleaner' name. 16/ Update locking.doc @@ -5038,6 +5079,10 @@ DONE 15d/ What does I_Dirty mean - and implement it. 50/ Support O_DIRECT +51/ Check support for multiple devices + - add a device to an live array + - remove a device from a live array + 26June2010 Investigating 5a @@ -5209,3 +5254,330 @@ DONE 15d/ What does I_Dirty mean - and implement it. So we could easily not have a my_inode - e.g. just cleaning the data block. ->my_inode cannot disappear while we hold the block, so a test is safe. + + + ---------------------------------------------- + Space reservation and file-system-full conditions. + + Space is needed for everything we write. + Some things we can reject if the fs is too full + Some things we can delay when space is tight + Some things we need to write in order to free up space. + Others absolutely must be written so we need to always have + a reserve. + + The things that must be written are + - cluster header - which we never allocate + - some seg-usage and youth blocks - and quota blocks + Whese continually have credit attached - it is a bug if there + are not enough. (We hit this bug) + + Things that we need to write to free up space are + any block - data or index - that the cleaner finds. + + Things that we can delay, but not fail, are any change to a block that + has already been written or allocate. + + When space is needed it can come from one of three places. + - the remainder of the current main segment + - the remainder of the current cleaner segment + - a new segment. + + Only Realloc blocks can go to the cleaner segment, so the + 'must write' blocks cannot go there, so unused + main must have enough + space for all those. + Realloc blocks can go anywhere - we don't need a cleaner segment if things + are too tight. + + When we run out of space there are several things we can do to get more: + - incorporate index blocks. This tends to free up uninc-credits which + are normally over-allocated for safety. + - cluster_allocate/cluster_flush so more blocks get allocated and so + more can be incorporated. See above. This is probably most helpful + for data blocks. + - clean several segments into whole cleaner segments or into the main segment. + Much of this happens by triggering a snapshot, however we should only do that + when we have full cleaner-segments (or zero cleaner segments). + + When cleaning we don't want to over-clean. i.e. we don't want to commit + any blocks from a second segment if that will stop us from commiting blocks + from the first segment. Otherwise we might use one cleaning segment up by + makeing 4 half-clean. This doesn't help. + + + So: we reserve multiple segments for the cleaner, possibly zero. + + We clean up to that many segments at a time, though if that many is zero, + we clean one segment at a time. + lafs_cluster_allocate only succeeds if there was room in an allocated segment. + If allocating a new segment fails, the cluster_allocate must fail. This + will push extra cleaning into the main segment where allocations must not + fail. + + The last 3(?) [adjusted for number of snapshots] segments can only be allocated + to the main segment, and this space can only be used for cleaning. + Once the "free_space - allocated_space" drops below one segment, we + force a checkpoint. This should free up at least one segment. + + We need some point at which we stop cleaning because the chance of finding + something to clean is too low. At that point all 'new' requests defintely + become failures. They might do earlier too. + Possibly at some point we start discounting youth from new usage scores so + that the list becomes sorted by usage. + + + Need: + cut-off point for free_seg where we don't allow cleaner to use segments + 3? 4? + + event when we start using fixed '0x8000' youth for new segment scores. + Maybe when we clean a segment with usage gap below 16 or 1/128 + event when we stop doing that. + Maybe when free_segs cross some number - 8? + + point when alloc failure for NewSpace becomes ENOSPC + same as above? + + point when we don't bother cleaning + no cleaner segments can be allocated, and checkpoint did not increase + number of clean segments (used as many as freed). + Clear this state when something is deleted. + + + Allocations come out of free_blocks which does not included those + segments that have been promised to the cleaner. + CleanSpace and AccountSpace cannot fail. + We *know* not to ask for too many - cleaner knows when to stop. + ReleaseSpace fail (to be retried) if available is below a threshold, + providing the cleaner hasn't been stopped. + NewSpace fail if below a somewhat higher threshold. If we haven't entered + emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN. + + + Possibly limit some 'cleaner' segments to data only?? + + + So: work items. + - change CleanSpace to never fail, but cluster_allocate new_segment + can for cleaner segment. This is propagated through lafs_cluster_alloc + - cleaner pre-allocates cleaner segments (for new_segment to use) + and only cleans that many segments at a time. + - introduce emergency cleaning mode which causes ENOSPC to be returned + and ignores 'youth' on score. + - pause cleaner when we are so short of space that there is not point + trying until something is deleted. + +30june2010 + notes on current issue with checkpoint misbehaving and running out of + segments. + + 1/ don't want to cluster-flush too early. Ideally wait until segment is + full, but we currently hold writeback on everything so we cannot delay + indefinitely. + 2/ row goes negative!! let's see... + + seg_remainder doesn't change the set, but just returns + the remaining rows times the width + + seg_step move nxt_* to *, stepping to the next ... row? + save current as 'st_* + + seg_setsize - allocate space in the segment for 'size' blocks plus + a bit to round of to a whole number of table/rows + nxt_table nxt_row + + seg_setpos initialises the seg to a location and makes it empty, + st_ and nxt_ are the same + + seg_next reports address of next block, and moves forward. + + seg_addr simply reports address of next block + + So the sequence should be: + + seg_setpos to initialise + seg_remainder as much as you want + seg_setsize when we start a cluster + seg_next up to seg_remainder times + seg_step to go to next cluster (when not seg_setpos). + or maybe just before seg_setpos + + Need cluster_reset to be called after new_segment, or after we + flush a cluster but don't need a new_segment. + + I think I'm cleaning too early ... I am even cleaning + the current main segment!!!! + + OK, I got rid of the worst bugs. Now it just keeps cleaning + the same blocks in the current segment over and over. + 2 problems I see + 1/ it cleans a segment that it should not touch + We need to avoid cleaner segment increasing the + checkpoint youth number. + 2/ it has 6 free segments and doesn't use them + + clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark + watermake is 4 segs, so free < 4. So we have 3 allocated to cleaner, + 3 in reserve and so nothing much to clean! + + The heuristic for returning ENOSPC is not working. Need something more + directly related to what is happening. + Maybe if cleaning doesn't actually increase free space. + + !Need to leave segments in the table until we have finished + writing to them, so they cannot be cleanable. - DONE + + WAIT - problem. If cleaner segment is part-used, the alloc_cleaner_segs + doesn't count that. Bad? + + When nearly full we keep checkpointing even though it cannot help. + Need clearer rules on when there is any point pushing forward. + Need to know when to fail requests. + +02 july 2010 + + I am wasting lots of space creating snapshots that don't serve any + purpose. + The reasons for creating a snapshot are: + - turn clean segments into free segments + - reduce size of required roll-forward + - possibly flush all inode updates for 'sync'. + + We currently force one when + newblocks > max_newblocks + max is 1000 , newblocks is never reset! + probably make that a number of segments. + lafs_checkpoint_start is called + when cleaner blocks, and space is available + at shutdown + on write_super is s_dirt + __fsync_super before ->sync_fs + freeze_bdev + fsync_super + fsync_bdev + do_remount_sb + generic_shutdown_super before put_super if s_dirt + sync_supers is s_dirt + do_sync + file_sync !!! is s_dirt + + I think I should move checkpoint_start to + ->sync_fs + + + After testing + - blocks remaining after truncate - one index and 1-4 data + - truncate finds blocks being cleaned + FIXED - move setting of I_Trunc + - orphans aren't being cleaned up sometimes. + Hacked by forcing the thread to run. + - parent of index block has depth==1 + Don't reduce depth while dirty children. + Probably don't want uninc either? + + - some sort of deadlock? lafs_cluster_update_commit_both + has got the wc lock and wants to flush + writepage also is flushed. + Not sure what the blockage is. + I think the writepage is the one in clusiter_flush, and it + is blocking + + - Async is keeping 16/0 pinned during shutdpwn +03July2010 + + Testing overnight with 250 runs produced: + - blocked for more than 120 seconds + Cleaner tries to get an inode that is being deleted + and blocks, so inode_map_free is blocked waiting for + checkpoint to finish - deadlock. + Need to create a ->drop_inode which provides interlock with + cleaner/iget + + But this is hard to get right. + generic_forget_inode need to write_inode_now and flush all changes + out and then truncate the pages off so the inode will be + empty and can be freed. But flushing needs the cleaner thread + which can block on the inode lookup. + Ahh.... I can abuse iget5_locked. + If test sees I_WILL_FREE or similar, it fails and sets a flag. + if the flag was set, then 'set' fails + + + - block.c:504 DONE (I trink). + unlink/delete_commit dirties a block without credits + It could have been just cleaned.. + It looks like it was in Writeback for the cleaner when + unlink pinned and allocated it.... + or maybe it was on a cluster (due to writepage) when + it was pinned. Then cluster_flush cleared dirty ... but + it should still have a Credit. + Maybe I should iolock the block ?? + + On reflection it wasn't cleaning, just tiny clusters + of recent changes which were originally written as tiny + checkpoints. Maybe lots of directory updates triggered the clusters. + I guess writepage is being called to sync the directory??? + Or maybe the checkpoint was pushed by s_dirt being set. + + So use PinPending and iolock to protect dir blocks from writepage. + + - dir.c:1266 DONE + dir handle orphan find a block (74/0) which is not + valid + This can happen if orphan_release failed to reserve a block. + We need to retry the release. + - inode.c:615 + index block and some data blocks still accounted to deleted file. + + No theory on this yet. Always one index block and a small number + of data blocks. Maybe the index block looked dirty, but was then + incorporated with something that was missed from the children list... + Or maybe I_Trunc is cleared a bit early... + Or trunc_next advanced too far?? or too soon + ?? + + - segments.c:640 DONE + prealloc in the cleaner finds all 2315 free blocks allocated. + no clean reserved. + Need to be able to fail CleanSpace requests when cleaner_reserve + is all gone.?? + + or just slow down the cleaner to one segment per checkpoint when + we are tight.. Hope that works. + - super.c:699 + async flag on 16/0 keeping block pinned + Maybe clear Async flag during checkpoint. Cleaner won't need it + No, just ensure to clear Async on all successful async calls. + + orphan file 8/0 has orphan reference keeping parent pinned + [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1) + Orphan handling is failing to get a reservation to write out the + orphan file block? Not convincing as there should be lots of space + at unmount, and 'orphan sleeping' has become empty. + + - Show State + orphan inode blocked by leaf index stuck in writeback: + [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5) + [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0] + + This is in the write-cluster waiting to be flushed + + +9July2010 + Review B_Async. + If a thread wants async something, it + - sets B_Async + - checks if it can have what it wants. + + if not, fail + + if so, clear B_Async and succeed + + If a thread releases something that might be requested Async, + it doesn't clear Async, but wakes up *the*thread*. + + This applies to + IOLock - iolock_block + Writeback - writeback_donem iolock_written + Valid - erase_dblock, wait_block + inode I_* - iget / drop_inode + + orphan handler, cleaner, segscan - all in the cleaner thread. diff --git a/orphan.c b/orphan.c index d429955..69464d0 100644 --- a/orphan.c +++ b/orphan.c @@ -564,8 +564,8 @@ void lafs_add_orphan(struct fs *fs, struct datablock *db) void lafs_orphan_forget(struct fs *fs, struct datablock *db) { - /* This is still and orphan, but we don't want to handle - * it just now. When we we, lafs_add_orphan will be called */ + /* This is still an orphan, but we don't want to handle + * it just now. When we do, lafs_add_orphan will be called */ LAFS_BUG(!test_bit(B_Orphan, &db->b.flags), &db->b); spin_lock(&fs->lock); if (!list_empty(&db->orphans)) { -- 2.39.5