From: NeilBrown Date: Mon, 9 Aug 2010 02:41:42 +0000 (+1000) Subject: README update X-Git-Url: http://git.neil.brown.name/?a=commitdiff_plain;h=a2406969209b8fa48c622a38bc32f2f68ae7478b;p=LaFS.git README update --- diff --git a/README b/README index 14da9c6..2c2f0ad 100644 --- a/README +++ b/README @@ -5284,6 +5284,13 @@ DONE 52/ NFS export 56/ Review roll-forward completely. +57/ learn about FS_HAS_SUBTYPE and document it. + +58/ Consider embedding symlinks and device files in directory. + Need owner/group/perm for device file, but not for symlink. + Can we create unique inode numbers? + hard links for dev-files would be problematic. + 26June2010 Investigating 5a @@ -5936,3 +5943,365 @@ WritePhase - what is that all about? writeback, do I need to wait, or can I just mark it dirty in commit_write? If no checksum and no duplication applies, this should be fine. + +16July2010 + BUT e.g. dir operations are in particular phases. If the dirblock + is pinned to the old phase, we need to flush it, then wait for io + to complete. So we need lafs_phase_wait as well as iolock_written. + This is already done by pin_dblock. + I wonder if we need a way to accelerate pinned blocks that are being + waited for - probably not, they should be done early. + + So we probably want to iolock after phase_wait in pin_dblock. + Though dir.c pins early. + I need to review all of this and get it right. + + So: + - we aren't allowed to block much holding checkpoint_lock as + checkpoint_start waits for that. However phase_wait will only + block if a new checkpoint has started already, so there is not + chance of phase_wait ever blocking checkpoint_start. + So it is safe to call phase_wait in checkpoint_lock. + phase_wait will wait until block is written, added back to + the lru clean, then found and flipped... I wonder if that is + good - it keeps parent from being a leaf, and so written, until + child write has completed. + We want to phase-flip a block as soon as it is allocated by cluster_flush. + + With directory blocks, i_mutex stops other changes, so an early iolock_written + will leave the block clean and phase won't be an issue. + + With inode-map blocks.. we: + set B_Pinned to ensure no-one writes except for phase change + do that after lock_written so it starts safe. + once we have checkpointlock, wait for phase if needed. + then lock_written again which should be instant but ensures + that block is locked while we change it... + + I think I want + - refile to call phase flip if index is not dirty and is in wrong phase + and has no pinned children in that phase. + - Only clear PinPending if we have i_mutex or refcnt == 0 + - before transaction: + lock_written / set PinPending / unlock + the inside cluster_lock + lock_written pin / change / dirty / unlock + it will only wait for writeout if phase changed. + so don't need phase_wait + but want pre-pin then pindblock + Transactions are: + dir create/delete/update - DONE + inode allocate/deallocate - on inode map DONE + setattr DONE + orphan set/change/discard + + Orphans are a little different as when we compact the + file, the orphan file block 'owned' by the orphan block + can change. As along as we keep them all PinPending it + should be fine though. + I think that every block in the orphan file will always be + PinPending ??? + + OK - done most of that. + Early phase_flip is awkward. We need an iolock to phase_flip, + and we don't have one. The phase_flip could cause incorporation + which cannot happen until the write completes. So I guess + we leave it as it is. + + + FIXME what about inode data block - cluster_allocate is removing + PinPending after making them dirty from the index block.. + + If all free inode numbers a B_Claimed, don't think we allocate + a new block... yes we do, as 'restarted' is local to caller. + + Also + each device has a number of flags + - new metadata can go here + - new data can go here + - clean data can go here + - clean metadata can go here + - non-logged segments allowed + - priority clean - any segment can be cleaned + - dev is shared and read-only - no state-block updates + + state block needs a uuid for an ro-filesystem that this is + layered on. + + Is metadata an issue? + We might want it on a faster device, but ditto for directories + and for some data. So probably skip that. + + Have separate segment tables for: + - can have new data + - can have clean data but not new. (this often empty) + + Clean data can go to new-not-clean if nothing else + new data can go to clean-not-new ?? if not sync?? + Maybe call them 'prefer clean' and 'prefer new' + + I think we want: + 'no sync new' - don't write new data, unless it is in big chunks and + can wait for checkpoint to be 'synced' + 'no write' - never write anything - this is readonly. + used for removing a device from the fs. + + A 'no sync new' device can have single-block segments. + This doesn't allow compression, but avoids any need to clean + In this case we don't store youth and the segusage is 32 bits per segment. + That means - for 1K block size - 0.5% of devices used for segusage. That + feels high. For 4K, 1/1024 so a giga per terabyte. + Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks. + + Other segusage for 29 snaps is 1/million of space used. + So we 'waste' 0.1% of device for no secondary cleaning. + Can still do defrag though. + + clearing a snapshot on a 1TB device writes 1GB of data!! potentially. + as does creating a snapshot. + +18jul2010 + If lafs were cluster enabled we would want multiple checkpoint clusters, + one for each node. When a node crashes some node would need to find and + roll-forward. For single node failure, it is enough to broadcast cluster + address to all others. For whole-cluster failure, need to either list all + in superblock or link from main write cluster. + + When writing to multiple devices we may want multiple write clusters + active for new data. These all need to be findable from checkpoint cluster + so linking sounds good. + Having a single 'fork' link in cluster head might work but does scale to large + cluster. I doesn't need to be committed to other not does checkpoint end, so + that should be ok. + Could have a special group_head to list other clusters for roll forward. + If we put fsnum first, a large value - 0xffffffff - could easily mean + something else + + Or every cluster head could point to an alternate stream, and if we want many + quickly, each simply points to another, so we create a chain across all writers. + + + Another issue... + When we 'sync' we don't wait for blocks until after the checkpoint is started, + and we know that will be driven through to CheckpointEnd which will commit and + release everything. + However 'fsync' doesn't have the same guarantee. The sync_page call will ensure + the data has been written, but we don't know it is safe until the next + header is written. So we need to push out the next cluster promptly. + + So if sync_page is called on a page in writeback, then we mark the cluster as + synchronous. When a sync cluster completes, the next (or even next+1) clusters + are flushed out promptly. Hopefully they won't be empty on a reasonably busy system, + but it is OK if they are. + + If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon + as the write completes the block will be released. + + So: to clarify sync_page: + This can be called when page is in writeback or locked. + If locked there is nothing we can do except maybe unplug the read queue. + If page is in writeback and block is dirty, then it is probably in + a cluster queue and we should flush the cluster and the next. + If page is in writeback and block is not dirty, but is writeback, + just flush one cluster. + But we don't want these cluster flushes to start while the previous is + still outstanding else we stop new requests from being added. + So as soon as the cluster can be flushed we flush, but no sooner. + I guess we use FlushNeeded and make that be less hasty. + +19June2010 + + superblocks.... + We currently have a superblock for each device. + I cannot see a good reason for that. + We can just bdev_claim for 'this' filesystem. + Rather we should have a number of anon superblocks, + one for each fileset, then one for each snapshot. + Do we use different fs types? probably yes + lafs - main filesystem made from devices + lafs_subset - subordinate fileset, given a path to fileset object + can have 'create' option when given an empty directory. + lafs_snap - snapshot - given a path to filesys and textname. + + Cannot create a snap of a subset, only of the whole filesystem + Is it OK to mount eith snap of subset or subset of snap? + It probably does, so need to use the same filesystem type for both. + Maybe lafs_sub or sublafs. Needs path to directory. + can be given 'snap=foo'. + No: a given filesystem may not exist in a snapshot. You need to + mount the snapshot first, then the subset of the snapshot. + So we have three types as above. All subsets as 'lafs_subset', + whether they are subset of main or of snapshot. + + Should we be able to create a snapshot or subset without mounting it? + It doesn't really seem necessary but might be elegant.. + + remount doesn't seem the right way to edit a filesystem as it forces + some cache flushing. + What do we want to edit? + - add device, remove device + - add/remove snapshot by name + - add/remove subset? Not needed, just mkdir/rmdir and mount to convert + empty dir to subset. + - change cleaner settings?? + Could have remount as an option. If problem find other option. + + While cleaning (which is always) we potentially need all superblocks + available as we might need to load blocks in those filesystems to + relocate them. + Unfortunately each super needs to be in a global list so there is a cost + in having them appear and disappear. I guess that is not a big deal. They + are refcounted and will disappear cleanly when the count hits zero. + + So: + DONE - change all prime_sb->blocksize refs to fs->blocksize + DONE - create an anon sb for the main filesystem + DONE - discard the device sbs, just bd_claim the devices and add to list + - use lafs_subset for creating/mounting subsets. + + Changed s_fs_info to point to the TypeInodeFile for the super, but + for root/snapshot that doesn't exist early enough to differentiate the + super in sget. + So we make an inode before the super exists and attach it after. + Need to do all that get_new_inode does. + inode_stat.nr_inodes++ - just don't generic_forget the inode + add to inode_in_use - seems pointless - just set i_list to something + add to sb->s_inodes - if we don't it won't flush - maybe that is good? + add to hash - don't want + i_state == lock|new - only really needed if hashed. + but there is lots of initialisation in alloc_inode that we cannot access!! + + Problem is that we need s_fs_info to uniquely identify the fs with something + that can be set in the spinlock, so allocating an inode is out. + And also to get to the filesystem metadata which is in the inode. + I guess we allocate a little something that stores identifier and later inode. + for lafs we use uuid + for subset we use just the inode + for snapshot we use fs and number + + +25July2010 + superblocks: + - sget gives us an active super_block. We need to attach to a vfsmnt + using simple_set_mnt, or call deactivate_locked_super. + - sget's set should call set_anon_super + - kill_sb (called by deactive_super) should then call kill_anon_super + + If we have a vfsmnt, we have an active reference, so we can atomic_inc + s_active safely. So use this to allow snapshots and subsets to hold a + ref on the prime_sb and thence on the 'fs'. + +26July2010 + - DONE need to set MS_ACTIVE somewhere!! + - FIXME if an inode is being dropped when iget comes in, it gets confused + and the inode appears to be deleted. + + We cannot really break the dblock <-> inode link until after write_inode_now, + but there is no call-back before generic_detach_inode is complete. + The last is write_inode which is only calledif I_DIRTY_something. + Maybe when writeback completes on an inode dblock, we should check if + the inode is I_WILL_FREE and if so, we break the link... + + Or maybe when we find my_inode set we can check the block and if it isn't + dirty or being deleted we break the link directly... That makes more sense. + + So... what is the deal with freeing inodes??? + ->iblock is like a hashtable reference. It is not refcounted + It gets set under private_lock + iblock is freed by memory pressure or lafs_release_index from + destroy_inode + when refcount of iblock is non-zero, ->dblock ref is counted, + else it is not. + dblock is set to NULL if I_Destroyed, or when dblock is discarded, + (under lafs_hash_lock) + and set to 'b' in lafs_iget and lafs_inode_dblock + + We can drop the dblock link as soon as iblock has no reference + + probably get clear_inode to break the link if possible, which it should + be on 'forget_inode'. Then lafs_iget can wait on the bit_waitqueue. + or maybe do clear_inode itself + + FIXME when we drop dblock we must clear iblock! as getiref iblock assumes + dblock is not NULL. + +28July2010 + So: ->dblock and ->my_inode need to be clarified. + + Neither is a counted reference - the idea is that either can be freed and + will destroy the pointer at the time so if the pointer is there, the + object must be ... but we need locking for that. + ->dblock is reasonably protected by private_lock, though if ->iblock exists + we hold a ref of ->dblock so we can access it more safely. + + Need to check getiref_locked knows ->dblock exists when called on iblock + and lafs_inode_fillblock + yes, both safe! + + But ->my_inode needs locking too so the inode can safely disappear without + having to wait for the data block to go. After all data blocks some in sets, + and one shouldn't keep others with inodes. + So something light-weight like rcu might work. + We use call_rcu to free the inode and rcu_readlock to access ->my_inode + + Yes, that will work. Occasionally we will want an igrab to, but not + often. + Should look into rcu for index hash table and ->iblock as well. + Current ->iblock is only cleared when the block is freed .. I guess that is fine... + + +31Jul2010 + rcu protection of ->my_inode + A/ orphan inodes - are they protected? + B/ orphan blocks - are the inodes of those protected? Probably... + + inodes are 'orphan' for two reasons + 1/ a truncate is in progress + 2/ there are no remaining links, so inode should be truncated/deleted + on restart. + + The second precludes us from holding a refcount on any orphan inode, + else it would never get deleted. + So we must assert that an inode with I_Deleting or I_Trunc has an implied + reference and so delete must be delayed... not quite. + If we set I_Trunc but not I_Deleting, then we igrab the inode until + I_Trunc is cleared. While we hold the igrab, I_Deleting cannot possibly + be set as that is set when last ref is dropped. + +01Aug2010 + FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC. + .. and in lafs_orphan_release + Well... only iolock_written can be a problem, and our rules require that + only phase-change writeout can set writeback. So the cleaner can never + wait for writeout here. Maybe it can wait for a lock, and maybe we don't + really need a lock, just 'wait_writeback'. +08Aug2010 + So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written + It is writeback waiting on 74/BIGNUM fromm file.c:329. So writepage + tried to write a block in a directory .. but it is PinPending so that + must have been set after writepage got it... + lafs_dir_handle_orphan gets an async lock, then sets PinPending. + If write_page is before that, it will have the lock and dir_handle will try later. + If write_page is after it will block on the lock, or see PinPending and + release the lock. + So someone else must be clearing PinPending! + - checkpoint clears and re-sets under the lock, so that is safe + - dir.c clears under i_mutex + dir_handle_orphans always hold i_mutex ... or does it. + - refile drops when the last non-lru reference goes. + - inode_map_new_abort clears for inode + No, not that - just bad test on result lof iolock_written_async ;-( + + Now have an interesting deadlock. + rm in lafs_delete_inode in inode_map_free is waiting for the block to + flush which requires the cleaner. + The cleaner thread in inode-handle_orphan is calling erase_dblock + on the same inode which blocks while inode_map_free has it locked.... + no, not same block - just waiting for writeout which requires cleaner. + lafs_erase_dblock from inode_map_free must be async! + pin_dblock in lafs_orphan_release must too.... no - only the setting of + PinPending needs to be async or out side of cleaner, which it is. + + Ok, got that fixed. All seems happy again, time for a commit. + +