When a block becomes empty, we need to remove it from the
indexing tree, causing the next block to take over the address
space if possible, else maybe the previous block.
FORMAT CHANGE - include start address explicitly in indirect blocks.
By storing the address of the first block in an indirect block,
that block can be relocated in the index tree with no adverse effects.
This makes it easier to handle blocks that have become empty.
So an indirect block now contains:
u16 0x01 'indirect' indicator
u32 addr virtual address of first indexed block
u48 paddr physical address of first indexblock block
u48 paddr2 next block ... etc
Don't lose path to index blocks that haven't been incorporated
Before a new index block is incorporated, we can find it only by
finding a sibling and walking the sibling list.
If that primary sibling gets released, the new block becomes
unfindable.
So hold a reference to such primaries until incorporation,
and ensure that there is always a primary in the same parent
by incorporating more promptly after a split.
- in a few places we didn't called dirty_inode when we should have
done.
- after inode_{inc,dec}_link_count we don't need to dirty the inode
because it has already been done.
- handle I_Dirty properly rather than unpinning the block
- allocate NCredits for inodes so that I_Dirty can succeed.
If the Credit has moved to Dirty, we don't need to add
another one during prealloc.
Doing so means that prealloc can fail on a block which is
already adequately allocated.
The InoIdx block probably shouldn't have a phys address,
but we must be certainly it doesn't after a grow caused it to
not be an InoIdx block any more, else accounting will get
confused when we allocate it to another block.
I'm still not sure about this handling of empty index blocks,
but leaving it invalid certainly isn't correct. I have to
initialise it and make it valid.
Avoid inadvertently flush blocks that are in middle of a transaction.
If a directory block has PinPending, then a transaction is in progress
and flushing the block (e.g. pdflush) would be a bad idea, as we could
lose the credit before the final update.
If the orphan handler is going to dirty a block, it must
first pin it (and check for errors), and must afterwards remove
the PinPending.
One of the places which caused change did this. The other didn't.
lafs_dir_handle_orphan no longer requires drop_orphan,
and can drop the reference to the orphan block.
So we need to hold our own so we can test if it is still
and orphan after handling (in which case we must wait).
It seems that this list can be empty fairly often without
any problems caused - I was afraid of infinite loops as it appears
there is nothing to do.
Maybe it is just the waiting that is needed.
When erasing a pinned dblock, it might not be on the leaf list.
The fact that we have it locked can mean that it wasn't put there
yet.
So check leaf list membership explicitly.
Note that it cannot be in write-back, as we should have
waited for any writeback to finish, and won't have started any.
And inode can be an orphan for an extended period of time and
should not be on the pending_orphans list all that time.
So we need to remove it if nothing needs doing for now, and
add it back on when it requires attention.
Requiring the 'handle' routines to leave the orphan on the list and
then removing it if B_Orphan isn't set is not appropriate as we will
want to remove block while they still have B_Orphan set.
So do the list management differently to still ensure a single pass
through without needing the current block to hold our place.
Flush orphans before failing rmdir due to not being empty.
Before we can be sure if a directory is empty, we need to
flush and orphans that might be holding space in the directory
in use. Only then can a check on the file size be meaningful.
NeilBrown [Thu, 27 Aug 2009 01:46:02 +0000 (11:46 +1000)]
Further tidy up for make_orphan
Move the calls to orphan_abort and orphan_commit into make_orphan
to follow the pattern used elsewhere.
This requires moving the i_mutex lock out into make_orphan,
but otherwise seems to be a good improvement.
NeilBrown [Thu, 27 Aug 2009 01:46:00 +0000 (11:46 +1000)]
Use orphan_abort in lafs_orphan_release
lafs_orphan_release currently open-codes all of orphan_abort,
except the i_mutex locking is a bit different. So change locking
rules for orphan_abort, then use that directly in lafs_orphan_release.
NeilBrown [Thu, 27 Aug 2009 01:45:56 +0000 (11:45 +1000)]
Orphan: simplify orphan creation.
We don't need to synchronise orphan creation with other parts of
direct ops etc. We just make the block into an orphan if we
think that might be needed. If this gets committed in a previous
checkpoint, there is no problem.
This simplifies things a lot and lets us get rid of some races.
NeilBrown [Tue, 25 Aug 2009 07:09:54 +0000 (17:09 +1000)]
Accelerate walk_indirect.
If an indirect block is being asked to incorporate an address that
is very must after the address of the block, it has to count all
the way up to that number, which is a waste of time.
So if there is an opportunity to jump forward, take it.
NeilBrown [Tue, 25 Aug 2009 07:09:52 +0000 (17:09 +1000)]
Move a dprintk in delete_inode
We very often enter delete_inode (when cleaning) on an inode that
doesn't really exist. So only print a message if this is a real
inode needing to be deleted.
NeilBrown [Tue, 25 Aug 2009 07:09:51 +0000 (17:09 +1000)]
Introduce lafs_index_empty.
We currently use lafs_leaf_next to test if an index block is empty,
but that doesn't work on internal nodes of course.
So create lafs_index_empty for that purpose.
NeilBrown [Tue, 25 Aug 2009 07:09:51 +0000 (17:09 +1000)]
cleaner: recognise end of segment properly.
If we find an incorrect cluster header, then abort cleaning
for that segment. This should not normally happen as we
shouldn't clean a segment until it has been completely written.
But we need to protect against it anyway.
NeilBrown [Mon, 24 Aug 2009 03:30:30 +0000 (13:30 +1000)]
Fix iolock semantics.
There are some races with IOlock and page unlock due to bad
assumption.
So make things more explicit with a flag to say if we own
the lock, or the 'Writeback' flag.
This allows us to remove the extra flag to iounlock_block
which is really nice because it always confused me.
A significant but subtle part of the locking involved the
fact that one a page has been read, it will not be read
again, so when readpage calls lafs_iocheck_block it is not
possible that it will unlock incorrectly.
It could conceivably unwriteback incorrectly, but next patch
should fix that.
NeilBrown [Mon, 24 Aug 2009 03:30:27 +0000 (13:30 +1000)]
IOLock all accesses to index blocks.
We really don't need them to change while being read.
This creates a possibly unpleasant conflict with writeout,
so maybe we need a IOWriteback flag for writeout...
NeilBrown [Mon, 24 Aug 2009 03:30:24 +0000 (13:30 +1000)]
Revise dirty/valid rules for Index blocks.
An Index block is 'Valid' if it contains any index.
So an InoIdx block for a depth=0 inode is never Valid
(as there is data rather than indexes there).
When it comes time to cluster_allocate an Index block,
if it is not Valid, we simple allocated_block it to 0.
An index block is Dirty if there are any children that need incorporation
as well was when incorporate has happened but has not yet been
written.
So an Index block can be Dirty but not Valid (unlike data blocks).
NeilBrown [Sun, 16 Aug 2009 05:20:30 +0000 (15:20 +1000)]
Cleaner: make sure we finish the job.
If there might be blocks in the ->cleaning list even after
we have read all of the cluster headers. So make sure we continue
cleaning until all of those are dealt with.
NeilBrown [Sat, 15 Aug 2009 07:09:22 +0000 (17:09 +1000)]
Simplify writeout rules for inode data block.
Previously we did not write an inode data block until the
InoIdx was ready.
This is not good if we need to sync an inode well before checkpoint
runs.
So just write an inde data block when we find it, but ensure not to
send an inode data block during checkpoint until the InoIdx block is
ready.
Note: it is now clear why an inode has two sets of credits, one on the
data block and one on the index block. The first set may be needed to
sync the inode metadata. The second may be need to update the
indexing information - they are copied across to the data block for
this purpose.
NeilBrown [Sat, 15 Aug 2009 07:09:20 +0000 (17:09 +1000)]
Simplify iolocking in get_flushable
The difference between data and index block is not really supportable,
and we cannot just avoid waiting for some blocks.
But we cannot always for a full iowait as block that have been
allocated to a cluster do not complete until the cluster is written
and we don't want to wait for a cluster to be written, especially as
we there thread that is supposed to do that.
So create an intermediate iowait which wait for iolock to be dropped
or the block to be placed on a list. Once it is on a list we can be
sure not to lose it.
So we wait while incorporation or truncation happens, but not while
writeout happens.
NeilBrown [Sat, 15 Aug 2009 07:03:19 +0000 (17:03 +1000)]
Avoid races when processing blocks for checkpoint.
When doing a checkpoint we need to be sure that every
flushable block is accounted for. So we need to be able
to wait for anything outstanding.
Currently a block can be removed from the leaflist by writepage
before it is placed on the cluster list. During this window
the checkpoint thread cannot see it and so might progress without
waiting for and so will think it has completed prematurely.
So delay the removal from the leaflist until after we have the
writecluster lock. This assures that every leaf block will be
either on the leaf list or on the cluster list when checkpoint is
looking for it.
When checkpoint calls cluster_done, this will release any blocks
from the cluster and, if needed, put them back on the leaf list where
they can be found again.
NeilBrown [Sat, 15 Aug 2009 05:52:56 +0000 (15:52 +1000)]
Remove dirtying of InoIdx in place of inode data block.
I don't remember why this was here, but until very recently
the code was wrong so it didn't do the "right" thing anyway.
And it doesn't seem to make sense.
When we dirty a dblock, we really want it to be dirty.
We might then write it and roll-forward will be able to pick
it all up except the index information which is always
calculated from addresses that are actually found.
NeilBrown [Sat, 15 Aug 2009 05:52:49 +0000 (15:52 +1000)]
Don't set Valid when setting Dirty.
While a block must be Valid to be Dirty, it is best to
only set Valid when actually putting data in the block,
and then check that it is Valid when marking it Dirty.
That can catch more bugs.