NeilBrown [Sat, 15 Aug 2009 07:09:22 +0000 (17:09 +1000)]
Simplify writeout rules for inode data block.
Previously we did not write an inode data block until the
InoIdx was ready.
This is not good if we need to sync an inode well before checkpoint
runs.
So just write an inde data block when we find it, but ensure not to
send an inode data block during checkpoint until the InoIdx block is
ready.
Note: it is now clear why an inode has two sets of credits, one on the
data block and one on the index block. The first set may be needed to
sync the inode metadata. The second may be need to update the
indexing information - they are copied across to the data block for
this purpose.
NeilBrown [Sat, 15 Aug 2009 07:09:20 +0000 (17:09 +1000)]
Simplify iolocking in get_flushable
The difference between data and index block is not really supportable,
and we cannot just avoid waiting for some blocks.
But we cannot always for a full iowait as block that have been
allocated to a cluster do not complete until the cluster is written
and we don't want to wait for a cluster to be written, especially as
we there thread that is supposed to do that.
So create an intermediate iowait which wait for iolock to be dropped
or the block to be placed on a list. Once it is on a list we can be
sure not to lose it.
So we wait while incorporation or truncation happens, but not while
writeout happens.
NeilBrown [Sat, 15 Aug 2009 07:03:19 +0000 (17:03 +1000)]
Avoid races when processing blocks for checkpoint.
When doing a checkpoint we need to be sure that every
flushable block is accounted for. So we need to be able
to wait for anything outstanding.
Currently a block can be removed from the leaflist by writepage
before it is placed on the cluster list. During this window
the checkpoint thread cannot see it and so might progress without
waiting for and so will think it has completed prematurely.
So delay the removal from the leaflist until after we have the
writecluster lock. This assures that every leaf block will be
either on the leaf list or on the cluster list when checkpoint is
looking for it.
When checkpoint calls cluster_done, this will release any blocks
from the cluster and, if needed, put them back on the leaf list where
they can be found again.
NeilBrown [Sat, 15 Aug 2009 05:52:56 +0000 (15:52 +1000)]
Remove dirtying of InoIdx in place of inode data block.
I don't remember why this was here, but until very recently
the code was wrong so it didn't do the "right" thing anyway.
And it doesn't seem to make sense.
When we dirty a dblock, we really want it to be dirty.
We might then write it and roll-forward will be able to pick
it all up except the index information which is always
calculated from addresses that are actually found.
NeilBrown [Sat, 15 Aug 2009 05:52:49 +0000 (15:52 +1000)]
Don't set Valid when setting Dirty.
While a block must be Valid to be Dirty, it is best to
only set Valid when actually putting data in the block,
and then check that it is Valid when marking it Dirty.
That can catch more bugs.
NeilBrown [Sat, 15 Aug 2009 05:49:46 +0000 (15:49 +1000)]
inode: Fix problems at inode creation.
- we need to pin the new inode dblock to ensure it gets written
- we don't need to dirty it in inode_map_new_commit as it is already
dirtied in lafs_inode_init.
NeilBrown [Sun, 9 Aug 2009 05:50:54 +0000 (15:50 +1000)]
Don't insist on having UnincCredits for all Index blocks.
When an Index block has room for new addresses, it does not need to
have an UnincCredit because we know it will not split before more
credits are available.
NeilBrown [Sat, 8 Aug 2009 04:23:40 +0000 (14:23 +1000)]
Make sure inodes don't get forgotten during cleaning.
During cleaning and other times when inodes might have dirty
index blocks we don't want the inode to be pushed out due to
apparently not being in use.
However there is a difficulty in holding a reference on the inode as
that cause the truncate following a final unlink to be delayed.
So to compromise, whenever the InoIdx block for an inode is pinned,
hold a reference to the inode as long as the link count is non-zero.
Once the link count becomes zero, we drop the extra ref and if this
leads to the inode being deleted, the current delaying of this
deletion (while the dblock is references) will keep the inode around
just long enough.
NeilBrown [Wed, 5 Aug 2009 06:38:08 +0000 (16:38 +1000)]
cleaner: hold ref on inode while preparing to clean blocks.
b->inode does not own a reference to the inode so we need to have some
other way to make sure it never goes invalid.
Normal filesystem references are safe as the VFS will truncate pages
before freeing the inode. But cleaner accesses don't benefit from
that.
Once a block is Pinned (e.g. when B_Realloc) it owns a reference up to
the InoIdx and so the dblock is also referenced. This ensures that
the inode won't go away even on destroy_inode.
However we hold a block for a short period before it is Pinned, so we
must hold a reference on the inode for that period as well.
NeilBrown [Wed, 5 Aug 2009 05:13:24 +0000 (15:13 +1000)]
Make sure cleaner doesn't start up after the FinalCheckpoint
We need to leave the cleaner thread running for checkpoint processing,
but don't want an really cleaning happening after the FinalCheckpoint.
So check and don't start anything new.
NeilBrown [Mon, 3 Aug 2009 01:28:37 +0000 (11:28 +1000)]
Don't use B_Credit to set B_Realloc
We need to keep B_Credit to possibly set B_Dirty (if a B_Realloc
block gets dirtied while being written to the 'clean' cluster we
still need to write it to a normal cluster).
So Find B_Realloc from elsewhere, use lafs_space_alloc if needed,
or as a last result setting B_Dirty.
NeilBrown [Mon, 3 Aug 2009 00:36:59 +0000 (10:36 +1000)]
Don't clear B_Realloc when setting B_Dirty
If the block has not yet been allocated to a cluster, then
B_Realloc will be cleared just before cluster allocation, and the
allocate won't happen.
If the block has already been allocated to a cluster for cleaning,
then we need the credit implied by B_Realloc so we shouldn't clear it.
The block will later be written to a normal cluster on the basis of
B_Dirty, so try to forget this cleaning, and in particular don't call
lafs_allocated_block as that will be a waste.....
Instead call lafs_pin_block to ensure that the block gets written out
in this phase, which is as good as cleaning.
NeilBrown [Sun, 2 Aug 2009 09:29:52 +0000 (19:29 +1000)]
cluster_flush credit handling.
Combine calls to space_use and space_return as they do the same thing.
And move them up to immediately after the credits have been calculated
to avoid possibility of the credit-counter finding an incorrect count
that is only transient.
NeilBrown [Sun, 2 Aug 2009 09:29:52 +0000 (19:29 +1000)]
Improve prepare_checkpoint locking.
prepare_checkpoint currently takes wc[0].lock.
This is presumably to avoid races with ->checkpointing updates.
However it causes a deadlock if any code that holds the checkpoint
lock needs to flush a cluster or make other updates to a cluster,
which can happen.
So use fs->lock to protect ->checkpointing instead.
NeilBrown [Sun, 2 Aug 2009 09:29:52 +0000 (19:29 +1000)]
flush_data_to_inode fixes.
When flushing data to the inode, mark the dblock dirty
rather than the iblock. We aren't allow to mark the iblock
dirty unless it is pinned, which it might not be....
It must be preallocated as the data is dirty, but the iblock doesn't
get pinned until the data is actually written.
As the inode data block is not pinned, it might not be written in
a particular phase, but that isn't a problem as long as it gets
written some time.
Add a BUG_ON if we ever try to dirty a non-Pinned index block again.
They are a problem because non-pinned InoIdx can lose their dblock
and then they fall out of the tree.
Also fix the space_return - we have already returned for the Dirty
flag.
NeilBrown [Sun, 2 Aug 2009 09:29:51 +0000 (19:29 +1000)]
lafs_shrinker fixes.
1/ only clear inode->iblock if the block we are about to
free is the InoIdx block (i.e. is inode->iblock).
2/ Assume that unreffed iblocks are not pinned, they should
never be.
NeilBrown [Sun, 2 Aug 2009 09:29:49 +0000 (19:29 +1000)]
Fix freeing of inodes and ->dblock
The inode and the dblock each have a reference to the other. Neither
are counted because:
We don't want the inode to refcnt the dblock as that wastes space.
We don't want the dblock to refcnt the inode as that stops it from
being freed.
So when either is freed, it must remove the reference from the other.
To ease locking, when the inode is freed it converts the reference,
if present, to a counted reference (using the same rule as
lafs_inode_dblock), then flags the inode for destruction and drops
the reference.
When the last reference to a dblock is dropped, it removes
both references and the calls destroy_inode again.
Notes that the dblock only exists while the inode exists - as soon
as the inode is destroyed, any dblock that might be around will
quickly get destroyed too, and the inode destruction is delayed until
this point.
NeilBrown [Sun, 2 Aug 2009 09:29:49 +0000 (19:29 +1000)]
getref_locked fixes.
When lafs_dirty_inode chooses to dirty the dblock, because there
is no iblock, we need to getref_locked that dblock because there
will be no implied reference.
And when doing that, don't dereference ->my_inode unless we are sure
it is valid - i.e. that this is a block in an inode file.
NeilBrown [Sun, 2 Aug 2009 09:29:48 +0000 (19:29 +1000)]
allocated_block: split out part of code for use in phase_flip.
When we flip phase, delayed incorporation needs to include addresses
in to the block. But some of the lafs_allocated_block work has
already been done. So split out the rest into a smaller function
for phase_flip to call.
NeilBrown [Sun, 2 Aug 2009 09:29:27 +0000 (19:29 +1000)]
umount fixups - flush in the right place.
We really need to sync the filesystem before the
final checkpoint and other cleanups in lafs_release.
So move them to lafs_put_super so they get done after
the sync in generic_shutdown_super, but before the superblock
is completely destroyed.
NeilBrown [Sun, 2 Aug 2009 09:28:53 +0000 (19:28 +1000)]
cluster_allocate: remove from leafs list if needed.
When lafs_writepage called cluster_allocate, the block could
be on a leafs list. In that case we need to cleanly remove it
from the list as lru is about to be used for a different purpose,
and it doesn't need to be on the list for a while.
NeilBrown [Sun, 2 Aug 2009 09:28:52 +0000 (19:28 +1000)]
dir orphans: always put orphans on the dirorphan list.
Once we call orphan_pin, we must put the block on the
dirorphans lists so that it can be checked and possibly
unorphaned. Otherwise the B_Orphan flag which was set will
never be cleared.
NeilBrown [Sun, 2 Aug 2009 09:28:52 +0000 (19:28 +1000)]
Checkpoint: allocate InoIdx block if data block is dirty.
Sometimes the Inode data block can be dirty without the InoIdx
block being dirty. We still need to write it out, so we move the
'pinning' across from the index to data blocks be calling
allocate here.
NeilBrown [Sun, 2 Aug 2009 09:28:51 +0000 (19:28 +1000)]
Fix sort/splitting in add_cleanable.
When the cleanable list reaches a certain length, we sort it and
discard the half which are least ready to be cleaned.
That code was broken and is now fixed.
When segments are referenced by the free/clean segment table,
it implies a reference to the segusage blocks.
So when we remove things from the table, drop the reference.
During roll-forward we mustn't load segusage blocks until we have
processed the checkpoint. During that time segusage updates must be
delayed.
So make sure phase and qphase have suitable setting, and delay the
load where appropriate.
This allows us to remove 'in_checkpoint' as it can be determined from
qphase.
When truncating a file, we need to update the segusage counts
for all relevant segments. This can require reading segusage blocks
while walking a leaf index. So we cannot use kmap_atomic in
lafs_walk_leaf_index. So use kmap instead.
When an inode block is to be marked dirty, make the
InoIdx block dirty instead. When it gets written, the dirtiness
then gets moved to the data block and it gets written.
Also, don't unpin the InoIdx block while the data blocks
is dirty. This probably shouldn't happen, but adding the test
is safest.
Every dirty or pinned block needs a reference to the segment usage
count for the segment holding that block.
This must be a writable reference so that when the block is written
the segusage can be decremented.
A writable reference requires a pinned block so the segusage block
must hold segref as well. This can recurse a couple of levels.
So introduce lafs_seg_ref_block which walks through the recursion
and gets some seg references at the bottom of the recursion.
Then have lafs_reserve_block call seg_ref_block repeatedly until
everything required has a SegRef.
lafs_invalidate_page: don't erase blocks in special files.
Special internal files - particularly the inode file - does not
keep i_size up-to-date (it really isn't needed) so erasing
blocks beyond i_size is inappropriate.
All internal files do their own block erasure as needed without
depending in the setting of i_size, so simply ignore the
size on these files.
NeilBrown [Thu, 19 Mar 2009 23:01:36 +0000 (10:01 +1100)]
Make sure all global functions have distinctive names.
All non-static functions should have names starting lafs_ or
_lafs_. A few had slipped in that didn't so resolve them either
making them static or changing their name.
NeilBrown [Thu, 19 Mar 2009 06:16:42 +0000 (17:16 +1100)]
Annotate getref depending on whether '0' is safe.
Any place where we might be getting the first reference on
a block, use get?ref_locked, and put appropriate checks in
getref and getref_locked to make sure they are used that way.
This required rearranging lafs_refile so that we didn't drop
the last ref if we were about to add a ref on _leafs.
NeilBrown [Thu, 19 Mar 2009 04:12:37 +0000 (15:12 +1100)]
Remove list_del from ihash_lookup
We don't want these list_dels.
One removes from the global free list, which happens else where based
on B_OnFree
The other removes from the per-inode free list that happens else
where base on ->parent.
NeilBrown [Thu, 19 Mar 2009 01:48:31 +0000 (12:48 +1100)]
Don't erase inode dblock during incorporation.
When incorporation notices that the indexes are all gone
and the inode is ready to die, it should *not* erase the
dblock. That gets things out of order. The dblock gets
erased by the orphan handler which will still be overseeing
the operation.
NeilBrown [Thu, 19 Mar 2009 01:48:20 +0000 (12:48 +1100)]
Further Truncate fixes.
In particular, don't use lafs_find_next to find the next place to
truncate. This can instantiate data blocks past EOF which is confusing.
Just use lafs_leaf_find and only ever instantiate index blocks.
NeilBrown [Thu, 19 Mar 2009 01:47:37 +0000 (12:47 +1100)]
Remove need to hold i_mutex when flushing data into inode
Holding i_mutex while flush data into the inode is a problem because
it can triggered deadlocks if something hold i_mutex and waits for
checkpoint to finish - checkpoint might do the flush.
So check the i_size after clearing B_Dirty. If it changes after this,
we will just do another write-out and fix things up.
NeilBrown [Thu, 19 Mar 2009 01:46:18 +0000 (12:46 +1100)]
Various fixes for truncation (inode orphan handling).
- Don't update ->trunc_next until we do the truncation
(it is possible to retry if reserve_block fails)
- Search down the index tree properly (ib2, not ib!)
- set 'adopt' for lafs_leaf_find - very important.
- general code tidy up.