These are no longer relevant.
Also update README with lots of FIXME notes
and fix some white-space issues.
Signed-off-by: NeilBrown <neilb@suse.de>
- Data 16/0 is dirty but not pinned after final checkpoint - FIXED
-watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
+watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
Unclear on dirtying index blocks.
DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
- 7j/ resolve space allocation issues.
+NOLONGERRELEVENT 7j/ resolve space allocation issues.
Understand why CleanSpace can be tried and failed 1000
times before there is any change.
- 7k/ use B_Async for all async waits, don't depend on B_Orphan to do
+DONE 7k/ use B_Async for all async waits, don't depend on B_Orphan to do
a wakeup.
write lafs_iolock_written_async.
- 7l/ make sure i_blocks is correct.
+DONE 7l/ make sure i_blocks is correct.
+ set on 'import_inode'
+ decreased when lafs_summary_update assigned block to '0'
+ changed when lafs_summary_allocate changes e.g. quota.
+
+ lafs_summary_update is called when a block is assigned to a location,
+ or to zero. It is real usage.
+ lafs_summary_allocate is called when we set Prealloc on phys==0 or
+ clear Prealloc on phys==0
+ So allocate must be followed exactly.
+ update is already counted for setting !=0, so only dec on ==0.
+ So all is good.
+ What about quota? - hidden in quota_allocate / qcommit
DONE 8/ looping in do_checkpoint
root is still i Phase1 because 0/2 is in Phase 1
it isn't locked properly. Don't clear PinPending if someone else might
have set it.
-15a/ Find all FIXMEs and add them here.
+DONE15a/ Find all FIXMEs and add them here.
+
DONE 15b/ Report directory size less confusingly
15i/ separate thread management from 'cleaner' name.
+15j/ review rules in getref_locked - and document them
+
+15k/ newblocks should probably be a count of segments. Review that.
+
+15l/ make sure checkpoint_youth is decayed properly. Review youth decay.
+
+15m/ consider combining .orphans and .cleaning lists. If something is an
+ orphan, we probably don't want to clean it just now(?).
+
+15n/ consider if lafs_pin_dblock should check for iolock. Maybe
+ iolock or PinPending (which must be set under iolock).
+
+15o/ Can there be async blocks when checkpoint starts? Could they
+ pin blocks in old phase? Do I need to check for them?
+
+15p/ Review and remove the 'if cleaner is active then don't checkpoint just
+ yet' think - or somehow avoid the yuckiness.
+
+15q/ check checksums when reading cluster_header
+
+15r/ consider further optimisation in cleaner to avoid lookups.
+
+15s/ memory barrier for i_size check in cleaner???
+
+15t/ review usable-space calculations in clean.
+
+15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
+
+15v/ tidy up all code that fiddles bits and credits - maybe make some
+ common helpers.
+
+15w/ review cluster updates and make sure space used is accounted properly.
+
+15x/ Consider caching result of a failed dir lookup in case we immediately
+ try to create it. Would this actually save anything significant?
+15y/ Don't make dir blocks into orphans if it cannot be needed?
+
+15z/ make sure symlink creation is safe - do I need to log the body??
+
+15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
+
+15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
+ after lock is taken on block?
+
+15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
+ writeout.
+
+15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
+ if more is needed.
+
+15ae/ refile wonders about a race with cluster_allocate which gets IOLock
+ before removing from lru.
+
+15af/ Review all locking in lafs_refile
+
+15ag/ Don't allocate data part of InoIdx block.
+
+15ah/ Is there a problem with lafs_allocated_block putting an
+ about-to-be-truncated block on an uninc list?
+
+15ai/ When allocating a new segment during checkpoint, delay the
+ youth-block update until after the checkpoint
+
+15aj/ When roll-forward finds a new segment, make sure youth number is
+ updated.
+
+15ak/ Load orphan file during roll-forward and make every block an
+ orphan.
+
+15al/ set filesystem update_time somewhere.
+
+15am/ filesystem 'name' needs to be handle uniformly.
+
+15an/ and we be sure 'b' will be non-null in delete_inode?
+
+15ao/ determine what locking is needed to walk the children list
+ in lafs_inode_handle_orphans. Probably the address_space private lock.
+
+15ap/ Make sure write_inode has been cleaned up. See if this apply to
+ rollforward of a symlink (see FIXME)
+
+15aq/ change inode map to be little-endian, not host-endian
+
+15ar/ understand what to do about errors in lafs_truncate
+
+15as/ handle errors from lafs_write_super ???
+
+15at/ More wait_queues to wait for different blocks.
+ An array which we hash in to ??
+
+15au/ How should iocheck_block set the page error?
+ and block_loaded
+
+15av/ ditto for write errors?
+
+15aw/ when lafs_incorporate makes a new block where the
+ only is Realloc, the new should be Realloc too.
+
+15ax/ Think about what happens when we relocate a block
+ in the orphan list (lafs_orphan_release), particularly
+ if the block isn't actually loaded.
+
+15ay/ Wonder if there is any way for orphan_run to get a wakeup
+ when an inode or dir mutex is released.
+
+15az/ Sanity check all values in cluster head during roll-forward
+ i.e. in roll_valid. If the head isn't complete, we can still
+ use this to commit some previous checkpoints.
+
+15ba/ roll forward should not BUG on bad data like inodefile in
+ non-primary filesystem.
+
+15bb/ Do I need to sync something before copying an update over part
+ of an inode, then reloading the inode.
+
+15bc/ Handle DescHole in roll forward.
+
+15bd/ Call lafs_add_block_address from writeback rather than iolock
+ in roll forward, just for consistency.
+
+15be/ Confirm various files loaded at mount time (segusage, orphan ...)
+ are actually the correct type.
+
+15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
+ a lookup - or at least we can test for that.
+ lafs_seg_apply_all has similar problems and needs a good solution.
+
+15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
+ if parent splits. See what to do about that.
+
+15bh/ after roll-forward, check that free_blocks hasn't gone negative.
+ or handle if it has.
+
+15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
+ to twostage.
+
+15bj/ Make sure .last link in segtracker is kepts uptodate, particularly in
+ segdelete.
+
+15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
+
+15bl/ better checks for 'valid state block address' in valid_devblock
+ include that segment_count is credible
+ also in valid_stateblock
+
+15bm/ make sure everything gets free properly on error during mount / lafs_load
+
+15bn/ How does refcountsing of 'struct fs' work with multiple filesets?
+
+15bo/ use put_super to drop last refer to superblocks
+
+15bp/ review all superblock - maybe use more anon??
+
+15bq/ check readonly status in lafs_get_sb
+
+15br/ sync_fs should probably wait for something if 'wait'.
+
+15bs/ set f_fsid properly in lafs_statfs
+
+ - use new write_begin / write_end
+ - review how we ensure that credit remain with block.
+
+
16/ Update locking.doc
17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
25/ use kmem_cache for
datablock
- indexblock
+ indexblock - probably a mempool because we cannot allow failure when
+ splitting an index block.
skippoint (mempool?)
+ segsum - mempool??
others?
26/ Review seg addressing code for 2-D geometries.
- add a device to an live array
- remove a device from a live array
+52/ Review roll-forward completely.
+
26June2010
Investigating 5a
inode I_* - iget / drop_inode
orphan handler, cleaner, segscan - all in the cleaner thread.
+
+ 107 runs,
+ 2 hit 'Show State' with a blocked orphan inode.
+ Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
+ Both NoPhysAddr
+
+ Several runs blocked in cluster_flush or waiting for writeback.
+
+ - first case: looks like cluster flush should run but doesn't.
+ cluster_flush runs:
+ checkpoint, cleaner, cluster_allocate when full, update,
+ writepage, sync_page
+ So we have no timeout or other flush.
+ I guess if we are waiting for writeback, we need to trigger a
+ cluster_flush.
+
+ - other case - cluster_flush was called but is waiting for pending count
+ to go down.
+ Looks like cluster_reset shouldn't be changing pending_next
+
+ New hang. Orphans not being processed:
+ inode, because InoIdx is on leaf and checkpoint isn't pushing
+ it along.
+ dir block 0 is Dirty leaf
+
+ Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
+
+10July2010
+ Over night it looks *very* good.
+ Have one infinite loop with 31770 repeates of
+ ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
+ Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+
+ So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
+ lafs_add_orphan too short.
+ tracing shows after truncate_inode_pages.
+ must be blocked in inode_map_free - maybe use AccountSpace??
+ But why isn't the the truncate progressing?
+ Probably same reason: No ReleaseSpace available.
+ Maybe we aren't cleaning because there is a free segment, and
+ we aren't checkpointing because there aren't enough yet...
+
+ Probably the cleaner has halted while CleanerBlocks - fix that.
+
+ - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
+ Need a checkpoint to release the orphan?
+ ditto for 0/331 - 331/0
+ XX/0 is InoID
+
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice
+day...
+This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
+,UninCredit,PhysValid leaf(1) intable(6) release(1)
+ [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
+Valid leaf(1) intable(6) release(1) Leaf0(0)
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!
+
+Forgetting 0 0
+724 != 7 (st->free.cnt afte segdelete, close_segment, close_all)
+------------[ cut here ]------------
+WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn
+
+we called segdelete on something that was on the freelist.
+This happens when the final cluster starts a new segment.
+Need to improve the fix though.
+
+
+ lafs_inode_handle_orphan can make progress without leaving
+ anything async. Maybe we need a return status:
+ -EAGAIN - try after async
+ -ENOMEM - try some time soon - hope memory will be better
+ 0 we called orphan_release
+ anything else loops.
+
+
+ - we allocate a segment in last checkpoint we don't
+ take references properly.
+
+ - orphan handle spinning on:
+
+ ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+ 26402 calls.
+ stuck in delete_inode?? ?
+
+
+ never-ending cleaning? Maybe just computer slow ??
b[i].b.chain = NULL;
b[i].my_inode = NULL;
-
- /* FIXME what else needs to be initialised? */
}
spin_lock(&ino->i_data.private_lock);
} else
blk = getref(&b->b, MKREF(pindb));
-
if (LAFSI(b->b.inode)->type != TypeSegmentMap) {
/* FIXME test is a hack until we get a better
* phase wait. TypeSegmentMap does it's own
if (!test_and_clear_bit(B_NICredit, &b->b.flags))
LAFS_BUG(1, &b->b); /* ICredit should be set before we dirty
* a block. */
-
- // FIXME Do I need to do something with PinPending??
}
void
void
lafs_dirty_iblock(struct indexblock *b)
{
- /* FIXME is this all I have to do here?
- * Do I need to put it on a list, or lock or something?
-
- * Note, only need to set that phase if locked.
+ /* Note, only need to set the phase if locked.
* Then no-one may change it while in phase transition.
* FIXME maybe check we aren't dirtying a dirty block
* in the previous phase.
static void do_checkpoint(void *data)
{
struct fs *fs = data;
- int oldphase = !fs->phase; /*FIXME could there be a race getting this?*/
+ int oldphase = !fs->phase;
struct block *b;
int loops = 0;
#ifdef DUMP
if (!test_bit(CleanerDisabled, &fs->fsstate))
fs->scan.done = 0;
fs->cleaner.cleaning = 0;
- /* FIXME should I wake someone up? */
- /* FIXME might I now be racing with unmount and module unload??? */
fs->checkpoint_youth = youth;
fs->newblocks = 0;
static void cluster_flush(struct fs *fs, int cnum);
-/* FIXME I don't do prealloc yet */
-
static void skip_discard(struct skippoint *sp)
{
while (sp) {
newpoint = kmalloc(sizeof(struct skippoint), GFP_NOFS);
if (!newpoint)
- return rv; /* FIXME how do we trigger early flush? */
+ return rv;
newpoint->b = target;
for (level = 0; level < height-1; level++) {
newpoint->next[level] = pos.next[level];
row = offset % dv->stride;
table = col / dv->width;
col = col % dv->width;
- if (col) {
- /* FIXME BADBADBAD */
- printk("BAD-BAD seg_Setpos with non-row-start\n");
- BUG();
- }
+ BUG_ON(col);
+
seg->st_table = table;
seg->st_row = row;
lafs_inode_fillblock(myi);
}
- /* FIXME these are no longer needed */
- if (!test_bit(B_Valid, &b->flags))
- printk("Not Valid in allocate %s\n", strblk(b));
- LAFS_BUG(!test_bit(B_Valid, &b->flags), b);
-
mutex_lock(&wc->lock);
/* The following check must be inside the mutex_lock
if (err) {
/* No cleaner segments - this will have to go
* out with a checkpoint.
- * Block is still 'dirty' - just clear
+ * Block is still 'dirty' - just clear
* writeback flag.
*/
lafs_writeback_done(b);
* We first iterate over the block list building the clusterhead.
* Then we finalise and write the clusterhead and iterate again
* writing out each block.
- * We keep a blockcount in the wc, and decrement it on write-completion.
- * Once the blockcount hits zero the cluster is written... but not safe.
- * FIXME this comment needs work
+ * We keep a biassed blockcount in the wc, and decrement it on write-completion.
+ * Once the blockcount hits the bias the cluster is written. Then when
+ * the next (or next-plus-one) cluster head is written we remove the bias
+ * and the data is deemed to be safe.
*
* Update blocks will already have been included in the write cluster.
*
- *
* We reinitialise the write cluster once we are finished.
*/
dir_create_commit(&doh, fs, dir, de->d_name.name, de->d_name.len,
ino->i_ino, DT_REG);
lafs_checkpoint_unlock(fs);
- d_instantiate(de, ino); /* FIXME do I need to iput?? */
+ d_instantiate(de, ino);
clear_bit(B_PinPending, &db->b.flags);
putdref(db, MKREF(inode_new));
return 0;
err = err ?: lafs_pin_dblock(inodb, ReleaseSpace);
if (err == -EAGAIN) {
lafs_checkpoint_unlock_wait(fs);
- goto retry; /* FIXME should I not unlock ?? */
+ goto retry;
}
if (err < 0)
goto abort_unlock;
err = err ?: lafs_pin_dblock(inodb, ReleaseSpace);
if (err == -EAGAIN) {
lafs_checkpoint_unlock_wait(fs);
- goto retry; /* FIXME should I not unlock ?? */
+ goto retry;
}
if (err < 0)
goto abort_unlock;
struct inode *ino;
struct address_space *mapping;
struct fs *fs;
- int bits;
mapping = page->mapping;
if (!mapping)
return;
ino = mapping->host;
fs = fs_from_inode(ino);
- bits = PAGE_SHIFT - ino->i_blkbits;
-#if 0
- for (i = 0; i < (1<<bits); i++) {
- struct datablock *b = lafs_get_block(ino, i, page,
- GFP_KERNEL, MKREF(xx));
- if (!b)
- continue;
- // FIXME this test will flush a little too often.
- // We need some better way to know if a block is in
- // the current cluster queue...
- if (test_bit(B_Writeback, &b->b.flags) &&
- test_bit(B_Dirty, &b->b.flags) &&
- !lafs_cluster_empty(fs, 0)
- ) {
- putdref(b);
- lafs_cluster_flush(fs, 0);
- break;
- }
- putdref(b);
- }
-#else
lafs_cluster_flush(fs, 0);
-#endif
}
const struct file_operations lafs_file_file_operations = {
if (test_bit(B_InoIdx, &p->flags)) {
struct datablock *db = LAFSI(p->inode)->dblock;
- /* FIXME unpin the db if it is in the same phase?? */
spin_unlock(&ino->i_data.private_lock);
lafs_inode_checkpin(p->inode);
ino = db->b.inode;
* on the next checkpoint (at which point there will be
* no credit left
* So if any dirty child is found, set phase. We don't
- * need PinPending as it is already Dirty. All other
+ * need PinPending as it is already Dirty. All other
* requirement for pinning will already be met.
*/
struct block *b;
/* lock hashlock - but we have that */
list_del_init(&b->lru);
/* unlock */
- /* FIXME was that ref counted?? */
}
fs = fs_from_inode(b->inode);
spin_lock(&fs->lock);
list_add(&b->lru, &fs->clean_leafs);
else
list_add(&b->lru, &fs->phase_leafs[ph]);
- /* FIXME this lock is here to sort-of protect
- * ->dblock, But how to we know it isn't
- * already NULL ??
- */
+
if (test_bit(B_Index, &b->flags))
getiref_locked(iblk(b), MKREF(leaf));
else
* Return a counted reference to the index block.
* NOTE: we must hold a reference to ->dblock when we call
* lafs_make_iblock.
- *
- * FIXME don't allocate the 'data' part of the indexblock
- * until there is some evidence that it might be needed.
- * e.g. we are planning to write.
- * Otherwise small files will have lots of wastage
*/
struct indexblock * __must_check
lafs_make_iblock(struct inode *ino, int adopt, int async, REFARG)
if (err == 0)
block_adopt(&ib->b, lai->dblock->b.parent);
}
- /* FIXME: I used to pin the new iblock here, and unpin
- * the dblock if it was pinned. I can no longer justify
- * this. Am I wrong?
- */
-
if (err == 0)
return ib;
putiref(ib, REF);
dprintk("Allocated %s -> %llu\n", strblk(blk), phys);
- /* FIXME if phys is 0, I shouldn't need uninc credit. Should
- * I not demand it? See comment in lafs_erase_dblock.
+ /* If phys is 0, I don't need uninc credit.
* Also, Index block that are not near full do not need UnincCredits
- * so don't warn about them
*/
LAFS_BUG(test_bit(B_InoIdx, &blk->flags), blk);
- if (!test_bit(B_UnincCredit, &blk->flags) && phys)
- printk("no uninc credit %s\n", strblk(blk));
if (!test_bit(B_Dirty, &blk->flags) &&
!test_bit(B_Realloc, &blk->flags) &&
phys != 0)
* thus made available to the parent
*/
- lafs_refile(&p->b, 0); /* FIXME when do I refile 'blk' */
+ lafs_refile(&p->b, 0);
return 0;
}
#include <linux/random.h>
#include <linux/delay.h>
-
/* Supporting an async 'iget' - as required by the cleaner -
* is slightly non-trivial.
* iget*_locked will normally wait for any inode with one
err = lafs_read_block(db);
if (err == 0)
return db;
- /* FIXME there is room for a livelock here. As soon as the read
- * completes we will lose our last reference to the block, and
- * it could disappear before we try again...
- */
+
putdref(db, REF);
return ERR_PTR(err);
}
inode_map_free(fs, LAFSI(ino)->filesys, ino->i_ino);
// FIXME what to do on error (-ENOMEM )
-// lafs_orphan_release(fs, b); // FIXME should start orphan handling..
- // FIXME else what?
dprintk("PUNCH hole for %d\n", (int)b->b.fileaddr);
putdref(b, MKREF(delete_inode));
}
if (fs->checkpointing) {
/* This cannot happen with current code,
- * but leave it in case we ever have
+ * but leave it in case we ever have
* orphan handling parallel with checkpoints
*/
err = -EBUSY; /* Try again after the checkpoint */
* or a child themselves.
* When cluster_allocate eventually gets called, it will update
* the datablock from the inode.
- * If an update has to wait for the next phase, lock_dblock
+ * If an update has to wait for the next phase, lock_dblock
* (e.g. in setattr) will do that.
*/
set_bit(I_Dirty, &LAFSI(ino)->iflags);
u64 virt, int dev, struct wc *wc, int head)
{
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- sector_t sect = 0;/* FIXME this initialisation is
- only needed with -Werror */
+ sector_t uninitialized_var(sect);
int which = wc->pending_next;
virttophys(fs, virt, &dev, §);
* and didn't find enough space.
* So demote 'ib' to a regular Index block and make 'new' a new
* InoIdx block (inode->iblock);
+ * We have IOLock on ib to ensure exclusivity.
*/
- // FIXME what locking?
+
int blksize = ib->b.inode->i_sb->s_blocksize;
int offset = LAFSI(ib->b.inode)->metadata_size;
char *ibuf;
int c = from->credits / 2;
to->credits += c;
from->credits -= c;
+ BUG();
}
}
* as EmptyIndex so index lookups know to avoid it.
* Dirty children also delay empty-handling. This
* is important at the InoIdx level as we cannot drop
- * level to 1 while there are still children that
+ * level to 1 while there are still children that
* might want to be incorporated.
*/
if (ib->depth > 1 && ! lafs_index_empty(ib))
return 0; /* "old" blocks aren't interesting */
if (type == DescIndex)
return 0; /* index blocks aren't interesting either */
- if (type == 0)
+ if (type == DescHole)
return 0; /* FIXME should I punch a hole here? */
dprintk("Roll Block %d/%d/%d/%lu/%llu\n",
int
lafs_mount(struct fs *fs)
{
- /* FIXME */
struct datablock *b;
struct inode *root;
struct inode *rootdir;
}
}
-
static void seg_apply(struct fs *fs, struct segsum *ss)
{
int err;
/* Select and return a free segment.
* The youth value must have been zero, and we set it to the
* current youth number.
- * FIXME we should get a reference on the segment counter as well.
*/
struct datablock *db = NULL;
u16 *youthp;
putdref(ss->youthblk, MKREF(intable));
ss_put(ss, fs);
ss_put(ss, fs);
-
-
}
static u16 segunused(struct segtracker *st)
fs->segtrack->clean.last =
fs->segtrack->clean.first;
fs->segtrack->clean.cnt++;
- }
+ }
if (ss->score != SCORE_ACTIVE) {
/* Must be on the clean list now */
ss->score = 0;
kfree(dv->devblk);
if (dv->sb)
kill_block_super(dv->sb);
- /* FIXME should I kfree dv->sb or something here? */
+ /* FIXME should I kfree dv->sb or something here?
+ * maybe used put_super(dv->sb)*/
dv->sb = NULL;
}
* the cleaner from getting the inode. So after
* the complete the drop we might need to wake the cleaner.
*/
-
+
spin_lock(&inode->i_data.private_lock);
if (LAFSI(inode)->dblock)
db = getdref_locked(LAFSI(inode)->dblock, MKREF(drop));