NeilBrown [Fri, 18 Jun 2010 11:47:27 +0000 (21:47 +1000)]
Incorp: improve setting of address after split.
When we split a block, the address of the second half should be the
first address that won't fit in the original block.
Currently it is the first address that didn't fit. If we end up
adding blocks in reverse address order, this could cause each new
block to require a split.
NeilBrown [Fri, 18 Jun 2010 11:29:20 +0000 (21:29 +1000)]
modify: avoid breaking an PrimaryRef chain
When we insert a new block into a PrimaryRef chain, we need to take
the new refcnt on the new block (which is now primary for the
following block) rather than than the block from which we split (on
which a primary_ref is already held).
NeilBrown [Fri, 18 Jun 2010 11:15:25 +0000 (21:15 +1000)]
segments: fix array sizes.
Heights range from 0 up, so the array must be sized
one larger than that maximum height. So defined
SEG_NUM_HEIGHTS instead of SEG_MAX_HEIGHT, and make it one more.
NeilBrown [Sun, 13 Jun 2010 10:11:56 +0000 (20:11 +1000)]
Handle extents getting added before an indirect block.
Now that indirect blocks are 'mobile' and can have a
start address, it is possible that an extent in uninc_table
starts before the indirect addresses.
In this case the logic that assured us there was room for all
the addresses fall down.
So detect that case earlier and force the indirect block
to either start at the start of the uninc bloc, or become
an extent block. Possibly it will split as part of this.
NeilBrown [Sun, 13 Jun 2010 09:24:17 +0000 (19:24 +1000)]
Fix handling of extents during incorporation.
Modifying the indexblock in walk_extent is bad.
We could be using walk_extent just to examine the block,
so making changes causes corruption.
Also, there were two other places where we weren't making changes,
but later assumed we had.
So don't make any changes, and don't assume any changes
have been made.
NeilBrown [Sun, 13 Jun 2010 08:19:30 +0000 (18:19 +1000)]
walk_extent bug fixes.
If walk_extent aborts, it needs to update the current extent from
the index block as it might....
No, that is bad - we might be checking, so an update is not wanted!
NeilBrown [Sun, 13 Jun 2010 07:51:00 +0000 (17:51 +1000)]
Better handling of upward recursive empty index deletion.
When an empty index blocks makes the parent empty, we recurse
up.
When we do that, we need to again perform the check for there
being children, and for this being an InoIdx block.
So jump further back, and tidy up the exit path to ensure we still
unlock properly.
NeilBrown [Sun, 13 Jun 2010 07:40:18 +0000 (17:40 +1000)]
Remove last trace of setting depth to zero for empty index blocks.
This turned out to be a bad idea, but bits were still hanging around.
All cleaned up now.
Only the InoIdx block can be depth==0, and only when data is stored
in there rather than indexes.
As it is not hashed, changing its depth is safe. Changing the depth
of any other index block is not safe, and is now not done.
As part of this, clean up usage of lafs_clear_index.
NeilBrown [Sun, 13 Jun 2010 07:16:30 +0000 (17:16 +1000)]
unhash iblock when it becomes empty.
Our handling of empty iblocks was flawed - when they become empty they
stayed in the hash table and could be found again - bad.
So unhash them.
To keep things tidy, always use hlist_del_init to removing things from
the hash table, and assert that they are present before removing, and
that they aren't before adding.
NeilBrown [Thu, 10 Jun 2010 05:57:07 +0000 (15:57 +1000)]
Ensure indexblock is clean when truncate finished.
When truncate finishes and we are about to delete the inode,
we need to have the index block clean and unpinned so it can disappear
cleanly.
lafs_cluster_allocate nicely does this for us.
NeilBrown [Thu, 10 Jun 2010 05:52:45 +0000 (15:52 +1000)]
Avoid deadlock between orphan-truncate and cluster-write
If, during truncate, we find an index block in writeback,
we need to give up on the orphan and return back up
to the cleaner thread so that the writeback can be
properly handled.
Otherwise the cleaner thread is blocked waiting for the cleaner thread
to make progress.
NeilBrown [Wed, 9 Jun 2010 03:23:14 +0000 (13:23 +1000)]
Fix handling of empty index blocks during incorporation.
If an index block becomes empty we incorporate the address space into
the next index block, and so need to update the address of the leading
children and fix the indexes in the parent properly.
NeilBrown [Wed, 9 Jun 2010 02:14:07 +0000 (12:14 +1000)]
Fix problem with mobility of internal index blocks.
As index blocks can change address if they are or become the first
block in the parent, we must not put too much weight on the first
address stored in an internal index block - it must always be treated
as the same as the address of that internal index block.
So pass around the current block's address where needed, and use it in
preference to the first address recorded in the block.
NeilBrown [Tue, 8 Jun 2010 08:13:13 +0000 (18:13 +1000)]
add_extent fixes.
add_extent tries to detect consecutive extents being added
and merges them. But it got it wrong.
So change last* to actually refer to the "next" and get it right.
NeilBrown [Tue, 8 Jun 2010 07:49:04 +0000 (17:49 +1000)]
Set depth to 0 when InoIdx becomes empty.
The code set depth to zero always, which is clearly broken.
We don't incrementally reduce the height of an indexing tree as it
shrinks. Rather the high remains at the max until it is completely
empty, then it suddenly becomes zero.
NeilBrown [Tue, 8 Jun 2010 07:32:57 +0000 (17:32 +1000)]
Clear B_OnFree when removing from freelist.
This isn't strictly necessary as if refile want to use the
LRU for something else, it will clear this bit.
However clearing it both places makes for more consistency and less
confusing data structures.
NeilBrown [Mon, 7 Jun 2010 03:16:45 +0000 (13:16 +1000)]
Fix a BUG_ON in lafs_erase_dblock
As we no longer clear the Valid flag on index blocks when
they become empty, we can no longer trigger a bug if it
is still set.
However depth should definitely be 0, so test for that.
NeilBrown [Mon, 7 Jun 2010 01:30:48 +0000 (11:30 +1000)]
Make sure find_block loads data from inode.
lafs_load_block expects not to need to read block 0
of a depth==0 inode as lafs_find_block is expected to
load that.
However there is a case where find_block (called from lafs_find_block)
skips the load. So remove that case.
When a block becomes empty, we need to remove it from the
indexing tree, causing the next block to take over the address
space if possible, else maybe the previous block.
FORMAT CHANGE - include start address explicitly in indirect blocks.
By storing the address of the first block in an indirect block,
that block can be relocated in the index tree with no adverse effects.
This makes it easier to handle blocks that have become empty.
So an indirect block now contains:
u16 0x01 'indirect' indicator
u32 addr virtual address of first indexed block
u48 paddr physical address of first indexblock block
u48 paddr2 next block ... etc
Don't lose path to index blocks that haven't been incorporated
Before a new index block is incorporated, we can find it only by
finding a sibling and walking the sibling list.
If that primary sibling gets released, the new block becomes
unfindable.
So hold a reference to such primaries until incorporation,
and ensure that there is always a primary in the same parent
by incorporating more promptly after a split.
- in a few places we didn't called dirty_inode when we should have
done.
- after inode_{inc,dec}_link_count we don't need to dirty the inode
because it has already been done.
- handle I_Dirty properly rather than unpinning the block
- allocate NCredits for inodes so that I_Dirty can succeed.
If the Credit has moved to Dirty, we don't need to add
another one during prealloc.
Doing so means that prealloc can fail on a block which is
already adequately allocated.
The InoIdx block probably shouldn't have a phys address,
but we must be certainly it doesn't after a grow caused it to
not be an InoIdx block any more, else accounting will get
confused when we allocate it to another block.
I'm still not sure about this handling of empty index blocks,
but leaving it invalid certainly isn't correct. I have to
initialise it and make it valid.
Avoid inadvertently flush blocks that are in middle of a transaction.
If a directory block has PinPending, then a transaction is in progress
and flushing the block (e.g. pdflush) would be a bad idea, as we could
lose the credit before the final update.
If the orphan handler is going to dirty a block, it must
first pin it (and check for errors), and must afterwards remove
the PinPending.
One of the places which caused change did this. The other didn't.
lafs_dir_handle_orphan no longer requires drop_orphan,
and can drop the reference to the orphan block.
So we need to hold our own so we can test if it is still
and orphan after handling (in which case we must wait).
It seems that this list can be empty fairly often without
any problems caused - I was afraid of infinite loops as it appears
there is nothing to do.
Maybe it is just the waiting that is needed.
When erasing a pinned dblock, it might not be on the leaf list.
The fact that we have it locked can mean that it wasn't put there
yet.
So check leaf list membership explicitly.
Note that it cannot be in write-back, as we should have
waited for any writeback to finish, and won't have started any.
And inode can be an orphan for an extended period of time and
should not be on the pending_orphans list all that time.
So we need to remove it if nothing needs doing for now, and
add it back on when it requires attention.
Requiring the 'handle' routines to leave the orphan on the list and
then removing it if B_Orphan isn't set is not appropriate as we will
want to remove block while they still have B_Orphan set.
So do the list management differently to still ensure a single pass
through without needing the current block to hold our place.
Flush orphans before failing rmdir due to not being empty.
Before we can be sure if a directory is empty, we need to
flush and orphans that might be holding space in the directory
in use. Only then can a check on the file size be meaningful.
NeilBrown [Thu, 27 Aug 2009 01:46:02 +0000 (11:46 +1000)]
Further tidy up for make_orphan
Move the calls to orphan_abort and orphan_commit into make_orphan
to follow the pattern used elsewhere.
This requires moving the i_mutex lock out into make_orphan,
but otherwise seems to be a good improvement.
NeilBrown [Thu, 27 Aug 2009 01:46:00 +0000 (11:46 +1000)]
Use orphan_abort in lafs_orphan_release
lafs_orphan_release currently open-codes all of orphan_abort,
except the i_mutex locking is a bit different. So change locking
rules for orphan_abort, then use that directly in lafs_orphan_release.
NeilBrown [Thu, 27 Aug 2009 01:45:56 +0000 (11:45 +1000)]
Orphan: simplify orphan creation.
We don't need to synchronise orphan creation with other parts of
direct ops etc. We just make the block into an orphan if we
think that might be needed. If this gets committed in a previous
checkpoint, there is no problem.
This simplifies things a lot and lets us get rid of some races.