NeilBrown [Mon, 28 Jun 2010 02:33:18 +0000 (12:33 +1000)]
Fix incorrect de-ref of ->my_inode
my_inode may not be set on a block on writeout - e.g. if it was
just cleaned.
As my_inode does not disappear once set (while a ref is held
on the block) it is safe to simply test it.
NeilBrown [Mon, 28 Jun 2010 02:19:39 +0000 (12:19 +1000)]
Fix races between truncate and cleaner.
Not only do we need to recheck the size after putting
the block on the clean list, we also need to check
for inodes that have been cleared. (type == 0).
NeilBrown [Sat, 26 Jun 2010 03:26:37 +0000 (13:26 +1000)]
Pin SegmentMap block when they might need to be dirtied.
This is more in-keeping with other practices and not that SegmentMap
blocks are handles carefully by the cleaner and checkpoint, it is
safe to do this.
They stay pinned until they are no-longer referenced. This may keep
some Credits unavailable but that is not a big cost.
When we pin these we don't hold or need a phase lock, so don't
require it.
Now that we alway pin segments when first used (free_get) we don't
need to prealloc in lafs_seg_move.
NeilBrown [Sat, 26 Jun 2010 01:38:08 +0000 (11:38 +1000)]
Revise which blocks need N* credits.
I think it is just those that might be phase-flipped.
Let's hope that is right.
I think the others were there due to cleaning issues which
have not been resolved.
NeilBrown [Sat, 26 Jun 2010 01:26:46 +0000 (11:26 +1000)]
Be Careful about cleaning PinPending blocks.
PinPending blocks must never be written to the cleaner segment
as they might still get dirtied and need to be written in this phase,
but the cleaner will have taken their uninc credit.
So if we need to clean a PinPending block, just mark it dirty and
wait for it to be unpinned or written normally.
NeilBrown [Sat, 26 Jun 2010 01:16:10 +0000 (11:16 +1000)]
Change flushing of space-accounting blocks.
Space-accounting blocks need to be flushed very late in the
checkpoint.
We were special casing these, but in an awkward way.
Change it so that these blocks are pinned, but that a checkpoint
doesn't handle them straight away but rather performs a phase_flip
and then queues them for later handling.
This means that we get more consistent behviour of pinned data blocks
and writepage doesn't need to special-base the flushing of segment
usage blocks.
NeilBrown [Fri, 25 Jun 2010 10:34:52 +0000 (20:34 +1000)]
Add debug tracing to unlink.
Have had strange code of unlink failing to find the target
file. So add lots of tracing in the hope that it will happen
again.
It might be sensitive to the hash chosen.
NeilBrown [Fri, 25 Jun 2010 09:17:29 +0000 (19:17 +1000)]
Add cluster list tracking to print_tree
In print tree, where we try to print which 'lru' list a block
is on, also check the write-cluster lists, both in preparation,
and waiting for Writeout.
NeilBrown [Fri, 25 Jun 2010 09:15:32 +0000 (19:15 +1000)]
Add has_ref to help debugging.
It is sometimes helpful to BUG_ON whether a block has a certain
ref or not.
So add "has_ref" which returns -1 if we don't know (debugging
disabled) or 0/1 depending on whether ref is held.
NeilBrown [Wed, 23 Jun 2010 07:00:23 +0000 (17:00 +1000)]
Allow writers to block while the cleaner makes a little progress.
We record how much progress is required, and allow to wait
for that much progress to happen at which point a checkpoint
happens.
Also add 'free_segs' similar to 'free_blocks' (and counting blocks)
which counts the number of blocks in free segs, not including
any current segs. This forced allocators to wait sooner and
may be more appropriate.
NeilBrown [Fri, 25 Jun 2010 06:30:18 +0000 (16:30 +1000)]
Improve flushing of 'cleaner' clusters.
There is no need for the cleaner to ever wait for block which
have been written. Once the write has been requested, the block
will not be Realloc any more, and so will not get back onto the
clean_leaf list anyway.
So just flush out the cluster when everything is done.
Earlier flushes will happen when a segment gets full.
Just to be safe, also flush the cleaner cluster before a checkpoint.
If there is anything awaiting flushing, it will be invisible
to the checkpoint process, but will hold other blocks pinned
so the checkpoint will not be able to proceed.
NeilBrown [Wed, 23 Jun 2010 02:29:57 +0000 (12:29 +1000)]
Don't allow memory flush to write out segusage blocks.
We always write these after a checkpoint, and there is little to be
gained by writing them earlier, and doing so causes their
dirty status to be lost, which is bad.
They should be treated much like PinPending blocks, but they
are not PinPending as they are written later in the checkpoint.
NeilBrown [Wed, 23 Jun 2010 02:23:50 +0000 (12:23 +1000)]
Fix inode_orphan_handler issues.
1/ a stray ';' caused a while loop not to work.
2/ If the for loop finds a block with a 'primary' reference,
just incorporating it won't help. We need to find the last
block so we know it has not primary reference, so it will
get unpinned by the lafs_cluster_allocate call, and so
will remove the primary reference.
NeilBrown [Tue, 22 Jun 2010 05:32:05 +0000 (15:32 +1000)]
Honour EmptyIndex during index lookup.
If we find an EmptyIndex that isn't first in the parent, we must
choose an earlier block.
We must check EmptyIndex after getting the lock.
If we have to drop a lock to do IO, and the unlocked block has a
->parent pointer, then we need to retry from the top.
'next' needs special care as it is could point to an
EmptyIndex block, so it is possible for leafs earlier in the tree to
have higher fileaddr (unlikely but possible).
NeilBrown [Tue, 22 Jun 2010 02:35:32 +0000 (12:35 +1000)]
Add EmptyIndex flag.
This signals that an index block is known to be empty and
should normally be ignored.
It may never be set on an InoIdx block.
Normally it stays set once set. However for the first index block
in a parent, it can be cleared again if any children appear.
NeilBrown [Tue, 22 Jun 2010 02:32:07 +0000 (12:32 +1000)]
Filter empty block from uninc change before incorporation.
It is possible (though unusual) for an uninc chain to have
two block with the same fileaddr, one that is empty and being ignored,
and one that is newly split off and needs to be incorporated.
We need to detect this possibility after sorting and discard the
empty block so it doesn't confuse further incorporation.
NeilBrown [Tue, 22 Jun 2010 02:13:45 +0000 (12:13 +1000)]
Delay hashing of index blocks until they are incorporated.
We don't need an index block in the hash table until its
address is in the parent, as until then we will never try a lookup.
And it is good to delay it as it is possible for there to be two
blocks with the same address, one that is empty and thus ignored
mostly, and one that has since split of an earlier child.
While this is unlikely, we don't want that split-off block to
appear in the hash table until both have been incorporated.
NeilBrown [Mon, 21 Jun 2010 04:55:26 +0000 (14:55 +1000)]
allocate_block fixes.
Try to set up slightly new rules....
1/ Adding an address as unincorporated requires just a spinlock,
the inode private_lock
2/ The block with an address being added is not iolock, but is
Writebehind
3/ incorporation happens under iolock, removing the list of
pending addresses also happens here. so under iolock
addresses can be added to list but not removed (unless I'm doing
the removing).
NeilBrown [Mon, 21 Jun 2010 03:40:16 +0000 (13:40 +1000)]
Revise rule for inode data blocks as leafs.
We cannot process an inode data block as a leaf before processing
the InoIdx block.
Previously we would unpin an inode data block if the InoIdx block
should take priority. But that is problematic.
Instead we simply take the inode data block off the leaf list.
This means we have to put it back on when the InoIdx gets unpinned
or phase-flipped.
At same time, tidy up determination of 'is a leaf' as this is used
both when adding something to a leaf list, and when taking something
off.
NeilBrown [Mon, 21 Jun 2010 01:32:12 +0000 (11:32 +1000)]
Change lafs_phase_flip to take an indexblock
As lafs_phase_flip is only ever passed an indexblock, make that
explicit in the signature, and remove any tests for B_Index and
they will always be True.
NeilBrown [Mon, 21 Jun 2010 01:05:15 +0000 (11:05 +1000)]
Better tracking of whether orphan handling is running.
At unmount we need to wait for all orphan handling to
complete.
Just checking the list of orphan blocks is not enough
as it is empty while the handling is actually happening.
So add a state flag to help out.
NeilBrown [Mon, 21 Jun 2010 00:19:25 +0000 (10:19 +1000)]
Wait for segment-scan to finish before unmount.
It might be best to come up with a way to abort the scan,
but we won't want it to be running when we unmount, so wait
for it to complete and ensure it doesn't restart.
NeilBrown [Sun, 20 Jun 2010 23:41:51 +0000 (09:41 +1000)]
SegRef fixes.
We mustn't hold a SegRef for blocks which aren't going to be accounted
in any segment usage counts.
This means we should never hold SegRef on the Root block, and if we
decide not to account certain block during unmount - as roll-forward
will account them - we should drop SegRef promptly.
NeilBrown [Sun, 20 Jun 2010 23:33:18 +0000 (09:33 +1000)]
Add casts to shifts which should change type width.
Sometimes when we left-shift a value it is possible that the
new value will require more bits to represent.
In those cases we first need to cast the value to the appropriately
sized type.
NeilBrown [Fri, 18 Jun 2010 12:32:29 +0000 (22:32 +1000)]
cleaner: when erasing a datablock, cancel any pending cleaning.
This requires a bit of locking, but ensure that after erase_dblock,
the block is no longer in use, so truncate orphan handling doesn't
find children that it doesn't expect.
NeilBrown [Fri, 18 Jun 2010 11:58:34 +0000 (21:58 +1000)]
unmount: clean up waiting for things.
The unmount thread should run any orphans. That should be
left to the cleaner thread.
It might be useful to wait for the cleaner to finish up.
An alternate might be to release all pending-for-clean blocks.
NeilBrown [Fri, 18 Jun 2010 11:47:27 +0000 (21:47 +1000)]
Incorp: improve setting of address after split.
When we split a block, the address of the second half should be the
first address that won't fit in the original block.
Currently it is the first address that didn't fit. If we end up
adding blocks in reverse address order, this could cause each new
block to require a split.
NeilBrown [Fri, 18 Jun 2010 11:29:20 +0000 (21:29 +1000)]
modify: avoid breaking an PrimaryRef chain
When we insert a new block into a PrimaryRef chain, we need to take
the new refcnt on the new block (which is now primary for the
following block) rather than than the block from which we split (on
which a primary_ref is already held).
NeilBrown [Fri, 18 Jun 2010 11:15:25 +0000 (21:15 +1000)]
segments: fix array sizes.
Heights range from 0 up, so the array must be sized
one larger than that maximum height. So defined
SEG_NUM_HEIGHTS instead of SEG_MAX_HEIGHT, and make it one more.
NeilBrown [Sun, 13 Jun 2010 10:11:56 +0000 (20:11 +1000)]
Handle extents getting added before an indirect block.
Now that indirect blocks are 'mobile' and can have a
start address, it is possible that an extent in uninc_table
starts before the indirect addresses.
In this case the logic that assured us there was room for all
the addresses fall down.
So detect that case earlier and force the indirect block
to either start at the start of the uninc bloc, or become
an extent block. Possibly it will split as part of this.
NeilBrown [Sun, 13 Jun 2010 09:24:17 +0000 (19:24 +1000)]
Fix handling of extents during incorporation.
Modifying the indexblock in walk_extent is bad.
We could be using walk_extent just to examine the block,
so making changes causes corruption.
Also, there were two other places where we weren't making changes,
but later assumed we had.
So don't make any changes, and don't assume any changes
have been made.
NeilBrown [Sun, 13 Jun 2010 08:19:30 +0000 (18:19 +1000)]
walk_extent bug fixes.
If walk_extent aborts, it needs to update the current extent from
the index block as it might....
No, that is bad - we might be checking, so an update is not wanted!
NeilBrown [Sun, 13 Jun 2010 07:51:00 +0000 (17:51 +1000)]
Better handling of upward recursive empty index deletion.
When an empty index blocks makes the parent empty, we recurse
up.
When we do that, we need to again perform the check for there
being children, and for this being an InoIdx block.
So jump further back, and tidy up the exit path to ensure we still
unlock properly.
NeilBrown [Sun, 13 Jun 2010 07:40:18 +0000 (17:40 +1000)]
Remove last trace of setting depth to zero for empty index blocks.
This turned out to be a bad idea, but bits were still hanging around.
All cleaned up now.
Only the InoIdx block can be depth==0, and only when data is stored
in there rather than indexes.
As it is not hashed, changing its depth is safe. Changing the depth
of any other index block is not safe, and is now not done.
As part of this, clean up usage of lafs_clear_index.
NeilBrown [Sun, 13 Jun 2010 07:16:30 +0000 (17:16 +1000)]
unhash iblock when it becomes empty.
Our handling of empty iblocks was flawed - when they become empty they
stayed in the hash table and could be found again - bad.
So unhash them.
To keep things tidy, always use hlist_del_init to removing things from
the hash table, and assert that they are present before removing, and
that they aren't before adding.