NeilBrown [Mon, 9 Aug 2010 10:24:20 +0000 (20:24 +1000)]
Unpin data blocks from previous phase before allowing them to be dirty.
While checkpointing will unpin PinPending blocks, it might not
manage to do it before the block gets Dirtied again.
So before we Pin the block - which is a required precursor to dirtying
them, unpin the block.
NeilBrown [Sun, 1 Aug 2010 03:38:06 +0000 (13:38 +1000)]
Use lafs_iget_fs rather than multiple get_blocks in orphan lookup.
When compacting the orphan table so so changing the orphan
slot for a block, use lafs_iget_fs to help find the orphan block.
This avoids allocating blocks if the inodes exist (which they
should).
NeilBrown [Sun, 1 Aug 2010 02:43:04 +0000 (12:43 +1000)]
Hold ref on inode during orphan handling.
Orphan handling will shortly drop references to the inode controlling
the orphan block. As run_orphans needs to drop the mutex at the end
it needs to hold another reference too.
If I_Deleting is set, then the db effectively owns a reference,
so no further igrab is needed, nor will it work.
NeilBrown [Sun, 1 Aug 2010 00:57:01 +0000 (10:57 +1000)]
wait more effectively for truncate to progress.
The times that we wait for truncate to progress, we hold
i_mutex, so truncate cannot progress.
So if there is a need to wait, we need to call the orphan
handler directly.
Break linkage between inode and dblock at earliest opportunity.
clear_inode is the first chance to break this linkage,
so do it there.
It is still possible for lafs_iget to get a new inode before
clear_inode has completed, so we need to do the same
test/clear in lafs_iget if b->my_inode is found to be non-NULL;
cleaner: don't iput while still holding a ref to a block.
As the block->inode ref isn't counted, this isn't really safe.
The inode could disappear and the block might not get killed
when the address-space is truncated.
Using the new s_sb_info structure, we add a snapshot number
so we can uniquely identify a snapshot from the superblock and
the 'sget' can be used to find an existing or new superblock.
If it is new, set it up properly as before.
No need to fiddle with 'primary_sb' - we have a ref into it from the
path lookup so it cannot go away, and it shouldn't really matter if it
does.
Neil Brown [Mon, 19 Jul 2010 08:36:47 +0000 (18:36 +1000)]
Change s_fs_info to point to root inode and fs
We create a new data structure containing the 'fs' and the root inode
of a filesystem, and store this in the superblock.
This allows each access to that root in iget, which previously was
impossible in general.
We want PinPending set whenever a transaction might be in progress to
ensure that write_page doesn't flush the block early, or that the
cleaner doesn't clean the block in the middle.
We also want the block be completely written if a write has already
been scheduled.
So:
- set PinPending - after getting an IOlock and ensure the block is
not in writeback. This is set before the checkpoint lock is
taken.
- Once we have checkpoint lock and call pin_dblock, wait for
writeout to complete again. This can only be in writeout
if the block is being written to the previous phase, and it
is safe to wait for that inside the checkpoint lock.
1/ write_begin needs to drop the page lock and failure,
and generally clean up properly.
2/ sync_page does not need to 'get_block' as a pointer is
readily available - so just use that with appropriate locking.
Neil Brown [Mon, 12 Jul 2010 10:56:10 +0000 (20:56 +1000)]
Handle fsync of inodes correctly
get rid of lafs_write_inode as it doesn't do the right thing.
Instead, create updates for inode changes only when
fsync is called on an inode. The only other time we
flush out an inode is 'sync()' which does a checkpoint
so achieves the same with not updates in clusters.
Fix (Again) handling of new segment for final cluster
There were other things that were being missed when
allocating the final cluster. So change code to take the
same path and make exceptions only where exceptions are clearly
needed.
Only increment pending_nxt when we finish a cluster
cluster_reset is called when we reset a cluster, but also
when we reposition to the start of a new segment - which should be
the same cluster.
pending_nxt should only be changed in the first of those cases.
If there is no space in any 'cleaner segment' to clean to, then
only clean if there are no 'clean' (but not 'free') segments.
As soon as we make a clean segment, we should stop cleaning and
allow a checkpoint to make the clean segment free so maybe more
progress can be made.
Protect directory updates from be hit by writepage.
We don't want writepage flushing a directory block while
we are updating it, or credits can be lost.
So set PinPending and leave it set the whole time. This requires
a change in the handling of PinPending in checkpoint.
We get an iolock on the block to set pinpending to make sure
writepage sees it. This helps make sure we don't change the page
while it is being written.
iget can block if the inode is being initialised or freed
by a different thread. It is not acceptable for the cleaner
to block in these cases as the other thread my need to trigger
a checkpoint.
So use special match/set functions to ensure we never
block, and use B_Async to check if we need to wake the
cleaner when done with an inode.
We don't really want to do anything of lafs_write_super
as we write the superblock when needed anyway.
However lafs_sync_fs needs to do what lafs_write_super was doing,
at least sometimes.
lafs_sync_fs will now force a checkpoint exactly when s_dirt is
set. So revise those setting a little - I think we only want this if
there are dirty inodes to flush. but that needs to be thought about
more when I fix write_inode.
And data blocks in realloc will have been destroyed in
erase_dblock, but there could legitimately be Realloc index blocks
still, so allow them to be handled.
Don't set I_Trunc until pages are invalidated and trunc_next is set.
The block could already be subject to orphan handling, as unlink
sets that up before truncation happens.
So make sure not to set I_Trunc until we a really ready for the
orphan-inode truncation handling to happen.
Without this truncation can race with the cleaner and weird thinks
can happen.
As we processes blocks from the segment in order, it is best
to keep them in order for later processing.
They will be sorted again when being added to a cluster,
but the more we help here, the better.
newblocks is the count of new blocks written to the filesystem in this
checkpoint (roughly the amount of work that roll-forward would have to
do). We use it to trigger new checkpoints.
So we need to reset it after each checkpoint.
Make sure the segment being written is never cleaned.
Cleaning the current segment would be a bad idea as it's
usage count isn't really representative of anything useful.
So leave it in the table flags as 'active' to avoid it
becoming cleanable, and remove it when the segment is finished
with.
More issues with wc->seg being explicitly unset at certain times.
We need to clear wc->seg at the end of a cleaning segment when we
choose not to add another, and we need to cope correctly when such
a segment is found.
When we unified the two loops in lafs_cluster_allocate,
we broke handling for a nearly-full cluster. We need to
require wc->remaining is at least 1 before we even
consider a cluster_insert.
NeilBrown [Tue, 29 Jun 2010 11:45:52 +0000 (21:45 +1000)]
Remove all to lafs_io_wake in lafs_cluster_allocate
We only need io_wake when we unlock or clear writeback, and neither of
those happen here so discard the call, put in a 'break' instead and
turn the (hard to read) do loop into a for loop.
NeilBrown [Tue, 29 Jun 2010 11:34:24 +0000 (21:34 +1000)]
combine two loops in lafs_cluster_allocate
There are two loops which try to cluster_flush to get enough space.
One calls new_segment, the other assumes cluster_flush will do
that, which it might or might not.
Combine these into a single loop, and move the handling of
error from new_segment closer to the call.
NeilBrown [Mon, 28 Jun 2010 22:04:45 +0000 (08:04 +1000)]
Handle write clusters which point to themselves.
This can happen at the end of a cleaner segment, and in
general it is best to be cautious. So if the next pointer
isn't further along in this segment, don't follow it.
NeilBrown [Mon, 28 Jun 2010 10:27:44 +0000 (20:27 +1000)]
Add EmergencyClean mode
In this mode we are nearly full.
Cleaning just goes for the segment with the most space
even if it is quite new.
Allocation failures return ENOSPC rather than EAGAIN
We clean even if it doesn't look like it will help much.
The heuristic for switching in an out is rather odd...
NeilBrown [Mon, 28 Jun 2010 09:40:32 +0000 (19:40 +1000)]
Reserve space for cleaner segments.
Now that we can reserve space specifically for cleaner
segments, do so and limit the number of cleaned segments
to the available number of cleaner segments.
NeilBrown [Mon, 28 Jun 2010 08:10:51 +0000 (18:10 +1000)]
Revise space allocation for cleaning.
We prefer to allocate whole segments for cleaning, but can only
do that if there is enough space.
If we cannot allocate whole segments, then just cleaning to the
main segment is perfectly acceptable.
So allow a 'clean_reserved' number which is a number of blocks that
have been reserved for cleaning - normally some number of segments.
The cleaner write whole segments while this number is big enough,
then gives up so the remainder will go to the main segment and not
create partial clean segments.
CleanSpace now never fails. The next patch will cause the cleaner
to be more reserved in how much it asks for.
NeilBrown [Mon, 28 Jun 2010 05:29:44 +0000 (15:29 +1000)]
Report directory size without holes.
Holes in a directory are an implementation details that does not need
to be exposed in i_size, and doing so is confusing and could leak info
about the hash used.
So when more than one block is used, report size as block size times
number of blocks.