NeilBrown [Thu, 27 Aug 2009 01:46:00 +0000 (11:46 +1000)]
Use orphan_abort in lafs_orphan_release
lafs_orphan_release currently open-codes all of orphan_abort,
except the i_mutex locking is a bit different. So change locking
rules for orphan_abort, then use that directly in lafs_orphan_release.
NeilBrown [Thu, 27 Aug 2009 01:45:56 +0000 (11:45 +1000)]
Orphan: simplify orphan creation.
We don't need to synchronise orphan creation with other parts of
direct ops etc. We just make the block into an orphan if we
think that might be needed. If this gets committed in a previous
checkpoint, there is no problem.
This simplifies things a lot and lets us get rid of some races.
NeilBrown [Tue, 25 Aug 2009 07:09:54 +0000 (17:09 +1000)]
Accelerate walk_indirect.
If an indirect block is being asked to incorporate an address that
is very must after the address of the block, it has to count all
the way up to that number, which is a waste of time.
So if there is an opportunity to jump forward, take it.
NeilBrown [Tue, 25 Aug 2009 07:09:52 +0000 (17:09 +1000)]
Move a dprintk in delete_inode
We very often enter delete_inode (when cleaning) on an inode that
doesn't really exist. So only print a message if this is a real
inode needing to be deleted.
NeilBrown [Tue, 25 Aug 2009 07:09:51 +0000 (17:09 +1000)]
Introduce lafs_index_empty.
We currently use lafs_leaf_next to test if an index block is empty,
but that doesn't work on internal nodes of course.
So create lafs_index_empty for that purpose.
NeilBrown [Tue, 25 Aug 2009 07:09:51 +0000 (17:09 +1000)]
cleaner: recognise end of segment properly.
If we find an incorrect cluster header, then abort cleaning
for that segment. This should not normally happen as we
shouldn't clean a segment until it has been completely written.
But we need to protect against it anyway.
NeilBrown [Mon, 24 Aug 2009 03:30:30 +0000 (13:30 +1000)]
Fix iolock semantics.
There are some races with IOlock and page unlock due to bad
assumption.
So make things more explicit with a flag to say if we own
the lock, or the 'Writeback' flag.
This allows us to remove the extra flag to iounlock_block
which is really nice because it always confused me.
A significant but subtle part of the locking involved the
fact that one a page has been read, it will not be read
again, so when readpage calls lafs_iocheck_block it is not
possible that it will unlock incorrectly.
It could conceivably unwriteback incorrectly, but next patch
should fix that.
NeilBrown [Mon, 24 Aug 2009 03:30:27 +0000 (13:30 +1000)]
IOLock all accesses to index blocks.
We really don't need them to change while being read.
This creates a possibly unpleasant conflict with writeout,
so maybe we need a IOWriteback flag for writeout...
NeilBrown [Mon, 24 Aug 2009 03:30:24 +0000 (13:30 +1000)]
Revise dirty/valid rules for Index blocks.
An Index block is 'Valid' if it contains any index.
So an InoIdx block for a depth=0 inode is never Valid
(as there is data rather than indexes there).
When it comes time to cluster_allocate an Index block,
if it is not Valid, we simple allocated_block it to 0.
An index block is Dirty if there are any children that need incorporation
as well was when incorporate has happened but has not yet been
written.
So an Index block can be Dirty but not Valid (unlike data blocks).
NeilBrown [Sun, 16 Aug 2009 05:20:30 +0000 (15:20 +1000)]
Cleaner: make sure we finish the job.
If there might be blocks in the ->cleaning list even after
we have read all of the cluster headers. So make sure we continue
cleaning until all of those are dealt with.
NeilBrown [Sat, 15 Aug 2009 07:09:22 +0000 (17:09 +1000)]
Simplify writeout rules for inode data block.
Previously we did not write an inode data block until the
InoIdx was ready.
This is not good if we need to sync an inode well before checkpoint
runs.
So just write an inde data block when we find it, but ensure not to
send an inode data block during checkpoint until the InoIdx block is
ready.
Note: it is now clear why an inode has two sets of credits, one on the
data block and one on the index block. The first set may be needed to
sync the inode metadata. The second may be need to update the
indexing information - they are copied across to the data block for
this purpose.
NeilBrown [Sat, 15 Aug 2009 07:09:20 +0000 (17:09 +1000)]
Simplify iolocking in get_flushable
The difference between data and index block is not really supportable,
and we cannot just avoid waiting for some blocks.
But we cannot always for a full iowait as block that have been
allocated to a cluster do not complete until the cluster is written
and we don't want to wait for a cluster to be written, especially as
we there thread that is supposed to do that.
So create an intermediate iowait which wait for iolock to be dropped
or the block to be placed on a list. Once it is on a list we can be
sure not to lose it.
So we wait while incorporation or truncation happens, but not while
writeout happens.
NeilBrown [Sat, 15 Aug 2009 07:03:19 +0000 (17:03 +1000)]
Avoid races when processing blocks for checkpoint.
When doing a checkpoint we need to be sure that every
flushable block is accounted for. So we need to be able
to wait for anything outstanding.
Currently a block can be removed from the leaflist by writepage
before it is placed on the cluster list. During this window
the checkpoint thread cannot see it and so might progress without
waiting for and so will think it has completed prematurely.
So delay the removal from the leaflist until after we have the
writecluster lock. This assures that every leaf block will be
either on the leaf list or on the cluster list when checkpoint is
looking for it.
When checkpoint calls cluster_done, this will release any blocks
from the cluster and, if needed, put them back on the leaf list where
they can be found again.
NeilBrown [Sat, 15 Aug 2009 05:52:56 +0000 (15:52 +1000)]
Remove dirtying of InoIdx in place of inode data block.
I don't remember why this was here, but until very recently
the code was wrong so it didn't do the "right" thing anyway.
And it doesn't seem to make sense.
When we dirty a dblock, we really want it to be dirty.
We might then write it and roll-forward will be able to pick
it all up except the index information which is always
calculated from addresses that are actually found.
NeilBrown [Sat, 15 Aug 2009 05:52:49 +0000 (15:52 +1000)]
Don't set Valid when setting Dirty.
While a block must be Valid to be Dirty, it is best to
only set Valid when actually putting data in the block,
and then check that it is Valid when marking it Dirty.
That can catch more bugs.
NeilBrown [Sat, 15 Aug 2009 05:49:46 +0000 (15:49 +1000)]
inode: Fix problems at inode creation.
- we need to pin the new inode dblock to ensure it gets written
- we don't need to dirty it in inode_map_new_commit as it is already
dirtied in lafs_inode_init.
NeilBrown [Sun, 9 Aug 2009 05:50:54 +0000 (15:50 +1000)]
Don't insist on having UnincCredits for all Index blocks.
When an Index block has room for new addresses, it does not need to
have an UnincCredit because we know it will not split before more
credits are available.
NeilBrown [Sat, 8 Aug 2009 04:23:40 +0000 (14:23 +1000)]
Make sure inodes don't get forgotten during cleaning.
During cleaning and other times when inodes might have dirty
index blocks we don't want the inode to be pushed out due to
apparently not being in use.
However there is a difficulty in holding a reference on the inode as
that cause the truncate following a final unlink to be delayed.
So to compromise, whenever the InoIdx block for an inode is pinned,
hold a reference to the inode as long as the link count is non-zero.
Once the link count becomes zero, we drop the extra ref and if this
leads to the inode being deleted, the current delaying of this
deletion (while the dblock is references) will keep the inode around
just long enough.
NeilBrown [Wed, 5 Aug 2009 06:38:08 +0000 (16:38 +1000)]
cleaner: hold ref on inode while preparing to clean blocks.
b->inode does not own a reference to the inode so we need to have some
other way to make sure it never goes invalid.
Normal filesystem references are safe as the VFS will truncate pages
before freeing the inode. But cleaner accesses don't benefit from
that.
Once a block is Pinned (e.g. when B_Realloc) it owns a reference up to
the InoIdx and so the dblock is also referenced. This ensures that
the inode won't go away even on destroy_inode.
However we hold a block for a short period before it is Pinned, so we
must hold a reference on the inode for that period as well.
NeilBrown [Wed, 5 Aug 2009 05:13:24 +0000 (15:13 +1000)]
Make sure cleaner doesn't start up after the FinalCheckpoint
We need to leave the cleaner thread running for checkpoint processing,
but don't want an really cleaning happening after the FinalCheckpoint.
So check and don't start anything new.
NeilBrown [Mon, 3 Aug 2009 01:28:37 +0000 (11:28 +1000)]
Don't use B_Credit to set B_Realloc
We need to keep B_Credit to possibly set B_Dirty (if a B_Realloc
block gets dirtied while being written to the 'clean' cluster we
still need to write it to a normal cluster).
So Find B_Realloc from elsewhere, use lafs_space_alloc if needed,
or as a last result setting B_Dirty.
NeilBrown [Mon, 3 Aug 2009 00:36:59 +0000 (10:36 +1000)]
Don't clear B_Realloc when setting B_Dirty
If the block has not yet been allocated to a cluster, then
B_Realloc will be cleared just before cluster allocation, and the
allocate won't happen.
If the block has already been allocated to a cluster for cleaning,
then we need the credit implied by B_Realloc so we shouldn't clear it.
The block will later be written to a normal cluster on the basis of
B_Dirty, so try to forget this cleaning, and in particular don't call
lafs_allocated_block as that will be a waste.....
Instead call lafs_pin_block to ensure that the block gets written out
in this phase, which is as good as cleaning.
NeilBrown [Sun, 2 Aug 2009 09:29:52 +0000 (19:29 +1000)]
cluster_flush credit handling.
Combine calls to space_use and space_return as they do the same thing.
And move them up to immediately after the credits have been calculated
to avoid possibility of the credit-counter finding an incorrect count
that is only transient.
NeilBrown [Sun, 2 Aug 2009 09:29:52 +0000 (19:29 +1000)]
Improve prepare_checkpoint locking.
prepare_checkpoint currently takes wc[0].lock.
This is presumably to avoid races with ->checkpointing updates.
However it causes a deadlock if any code that holds the checkpoint
lock needs to flush a cluster or make other updates to a cluster,
which can happen.
So use fs->lock to protect ->checkpointing instead.
NeilBrown [Sun, 2 Aug 2009 09:29:52 +0000 (19:29 +1000)]
flush_data_to_inode fixes.
When flushing data to the inode, mark the dblock dirty
rather than the iblock. We aren't allow to mark the iblock
dirty unless it is pinned, which it might not be....
It must be preallocated as the data is dirty, but the iblock doesn't
get pinned until the data is actually written.
As the inode data block is not pinned, it might not be written in
a particular phase, but that isn't a problem as long as it gets
written some time.
Add a BUG_ON if we ever try to dirty a non-Pinned index block again.
They are a problem because non-pinned InoIdx can lose their dblock
and then they fall out of the tree.
Also fix the space_return - we have already returned for the Dirty
flag.
NeilBrown [Sun, 2 Aug 2009 09:29:51 +0000 (19:29 +1000)]
lafs_shrinker fixes.
1/ only clear inode->iblock if the block we are about to
free is the InoIdx block (i.e. is inode->iblock).
2/ Assume that unreffed iblocks are not pinned, they should
never be.
NeilBrown [Sun, 2 Aug 2009 09:29:49 +0000 (19:29 +1000)]
Fix freeing of inodes and ->dblock
The inode and the dblock each have a reference to the other. Neither
are counted because:
We don't want the inode to refcnt the dblock as that wastes space.
We don't want the dblock to refcnt the inode as that stops it from
being freed.
So when either is freed, it must remove the reference from the other.
To ease locking, when the inode is freed it converts the reference,
if present, to a counted reference (using the same rule as
lafs_inode_dblock), then flags the inode for destruction and drops
the reference.
When the last reference to a dblock is dropped, it removes
both references and the calls destroy_inode again.
Notes that the dblock only exists while the inode exists - as soon
as the inode is destroyed, any dblock that might be around will
quickly get destroyed too, and the inode destruction is delayed until
this point.
NeilBrown [Sun, 2 Aug 2009 09:29:49 +0000 (19:29 +1000)]
getref_locked fixes.
When lafs_dirty_inode chooses to dirty the dblock, because there
is no iblock, we need to getref_locked that dblock because there
will be no implied reference.
And when doing that, don't dereference ->my_inode unless we are sure
it is valid - i.e. that this is a block in an inode file.
NeilBrown [Sun, 2 Aug 2009 09:29:48 +0000 (19:29 +1000)]
allocated_block: split out part of code for use in phase_flip.
When we flip phase, delayed incorporation needs to include addresses
in to the block. But some of the lafs_allocated_block work has
already been done. So split out the rest into a smaller function
for phase_flip to call.
NeilBrown [Sun, 2 Aug 2009 09:29:27 +0000 (19:29 +1000)]
umount fixups - flush in the right place.
We really need to sync the filesystem before the
final checkpoint and other cleanups in lafs_release.
So move them to lafs_put_super so they get done after
the sync in generic_shutdown_super, but before the superblock
is completely destroyed.
NeilBrown [Sun, 2 Aug 2009 09:28:53 +0000 (19:28 +1000)]
cluster_allocate: remove from leafs list if needed.
When lafs_writepage called cluster_allocate, the block could
be on a leafs list. In that case we need to cleanly remove it
from the list as lru is about to be used for a different purpose,
and it doesn't need to be on the list for a while.
NeilBrown [Sun, 2 Aug 2009 09:28:52 +0000 (19:28 +1000)]
dir orphans: always put orphans on the dirorphan list.
Once we call orphan_pin, we must put the block on the
dirorphans lists so that it can be checked and possibly
unorphaned. Otherwise the B_Orphan flag which was set will
never be cleared.
NeilBrown [Sun, 2 Aug 2009 09:28:52 +0000 (19:28 +1000)]
Checkpoint: allocate InoIdx block if data block is dirty.
Sometimes the Inode data block can be dirty without the InoIdx
block being dirty. We still need to write it out, so we move the
'pinning' across from the index to data blocks be calling
allocate here.