56/ Review roll-forward completely.
+57/ learn about FS_HAS_SUBTYPE and document it.
+
+58/ Consider embedding symlinks and device files in directory.
+ Need owner/group/perm for device file, but not for symlink.
+ Can we create unique inode numbers?
+ hard links for dev-files would be problematic.
+
26June2010
Investigating 5a
writeback, do I need to wait, or can I just mark it dirty in
commit_write? If no checksum and no duplication applies, this should
be fine.
+
+16July2010
+ BUT e.g. dir operations are in particular phases. If the dirblock
+ is pinned to the old phase, we need to flush it, then wait for io
+ to complete. So we need lafs_phase_wait as well as iolock_written.
+ This is already done by pin_dblock.
+ I wonder if we need a way to accelerate pinned blocks that are being
+ waited for - probably not, they should be done early.
+
+ So we probably want to iolock after phase_wait in pin_dblock.
+ Though dir.c pins early.
+ I need to review all of this and get it right.
+
+ So:
+ - we aren't allowed to block much holding checkpoint_lock as
+ checkpoint_start waits for that. However phase_wait will only
+ block if a new checkpoint has started already, so there is not
+ chance of phase_wait ever blocking checkpoint_start.
+ So it is safe to call phase_wait in checkpoint_lock.
+ phase_wait will wait until block is written, added back to
+ the lru clean, then found and flipped... I wonder if that is
+ good - it keeps parent from being a leaf, and so written, until
+ child write has completed.
+ We want to phase-flip a block as soon as it is allocated by cluster_flush.
+
+ With directory blocks, i_mutex stops other changes, so an early iolock_written
+ will leave the block clean and phase won't be an issue.
+
+ With inode-map blocks.. we:
+ set B_Pinned to ensure no-one writes except for phase change
+ do that after lock_written so it starts safe.
+ once we have checkpointlock, wait for phase if needed.
+ then lock_written again which should be instant but ensures
+ that block is locked while we change it...
+
+ I think I want
+ - refile to call phase flip if index is not dirty and is in wrong phase
+ and has no pinned children in that phase.
+ - Only clear PinPending if we have i_mutex or refcnt == 0
+ - before transaction:
+ lock_written / set PinPending / unlock
+ the inside cluster_lock
+ lock_written pin / change / dirty / unlock
+ it will only wait for writeout if phase changed.
+ so don't need phase_wait
+ but want pre-pin then pindblock
+ Transactions are:
+ dir create/delete/update - DONE
+ inode allocate/deallocate - on inode map DONE
+ setattr DONE
+ orphan set/change/discard
+
+ Orphans are a little different as when we compact the
+ file, the orphan file block 'owned' by the orphan block
+ can change. As along as we keep them all PinPending it
+ should be fine though.
+ I think that every block in the orphan file will always be
+ PinPending ???
+
+ OK - done most of that.
+ Early phase_flip is awkward. We need an iolock to phase_flip,
+ and we don't have one. The phase_flip could cause incorporation
+ which cannot happen until the write completes. So I guess
+ we leave it as it is.
+
+
+ FIXME what about inode data block - cluster_allocate is removing
+ PinPending after making them dirty from the index block..
+
+ If all free inode numbers a B_Claimed, don't think we allocate
+ a new block... yes we do, as 'restarted' is local to caller.
+
+ Also
+ each device has a number of flags
+ - new metadata can go here
+ - new data can go here
+ - clean data can go here
+ - clean metadata can go here
+ - non-logged segments allowed
+ - priority clean - any segment can be cleaned
+ - dev is shared and read-only - no state-block updates
+
+ state block needs a uuid for an ro-filesystem that this is
+ layered on.
+
+ Is metadata an issue?
+ We might want it on a faster device, but ditto for directories
+ and for some data. So probably skip that.
+
+ Have separate segment tables for:
+ - can have new data
+ - can have clean data but not new. (this often empty)
+
+ Clean data can go to new-not-clean if nothing else
+ new data can go to clean-not-new ?? if not sync??
+ Maybe call them 'prefer clean' and 'prefer new'
+
+ I think we want:
+ 'no sync new' - don't write new data, unless it is in big chunks and
+ can wait for checkpoint to be 'synced'
+ 'no write' - never write anything - this is readonly.
+ used for removing a device from the fs.
+
+ A 'no sync new' device can have single-block segments.
+ This doesn't allow compression, but avoids any need to clean
+ In this case we don't store youth and the segusage is 32 bits per segment.
+ That means - for 1K block size - 0.5% of devices used for segusage. That
+ feels high. For 4K, 1/1024 so a giga per terabyte.
+ Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
+
+ Other segusage for 29 snaps is 1/million of space used.
+ So we 'waste' 0.1% of device for no secondary cleaning.
+ Can still do defrag though.
+
+ clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
+ as does creating a snapshot.
+
+18jul2010
+ If lafs were cluster enabled we would want multiple checkpoint clusters,
+ one for each node. When a node crashes some node would need to find and
+ roll-forward. For single node failure, it is enough to broadcast cluster
+ address to all others. For whole-cluster failure, need to either list all
+ in superblock or link from main write cluster.
+
+ When writing to multiple devices we may want multiple write clusters
+ active for new data. These all need to be findable from checkpoint cluster
+ so linking sounds good.
+ Having a single 'fork' link in cluster head might work but does scale to large
+ cluster. I doesn't need to be committed to other not does checkpoint end, so
+ that should be ok.
+ Could have a special group_head to list other clusters for roll forward.
+ If we put fsnum first, a large value - 0xffffffff - could easily mean
+ something else
+
+ Or every cluster head could point to an alternate stream, and if we want many
+ quickly, each simply points to another, so we create a chain across all writers.
+
+
+ Another issue...
+ When we 'sync' we don't wait for blocks until after the checkpoint is started,
+ and we know that will be driven through to CheckpointEnd which will commit and
+ release everything.
+ However 'fsync' doesn't have the same guarantee. The sync_page call will ensure
+ the data has been written, but we don't know it is safe until the next
+ header is written. So we need to push out the next cluster promptly.
+
+ So if sync_page is called on a page in writeback, then we mark the cluster as
+ synchronous. When a sync cluster completes, the next (or even next+1) clusters
+ are flushed out promptly. Hopefully they won't be empty on a reasonably busy system,
+ but it is OK if they are.
+
+ If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
+ as the write completes the block will be released.
+
+ So: to clarify sync_page:
+ This can be called when page is in writeback or locked.
+ If locked there is nothing we can do except maybe unplug the read queue.
+ If page is in writeback and block is dirty, then it is probably in
+ a cluster queue and we should flush the cluster and the next.
+ If page is in writeback and block is not dirty, but is writeback,
+ just flush one cluster.
+ But we don't want these cluster flushes to start while the previous is
+ still outstanding else we stop new requests from being added.
+ So as soon as the cluster can be flushed we flush, but no sooner.
+ I guess we use FlushNeeded and make that be less hasty.
+
+19June2010
+
+ superblocks....
+ We currently have a superblock for each device.
+ I cannot see a good reason for that.
+ We can just bdev_claim for 'this' filesystem.
+ Rather we should have a number of anon superblocks,
+ one for each fileset, then one for each snapshot.
+ Do we use different fs types? probably yes
+ lafs - main filesystem made from devices
+ lafs_subset - subordinate fileset, given a path to fileset object
+ can have 'create' option when given an empty directory.
+ lafs_snap - snapshot - given a path to filesys and textname.
+
+ Cannot create a snap of a subset, only of the whole filesystem
+ Is it OK to mount eith snap of subset or subset of snap?
+ It probably does, so need to use the same filesystem type for both.
+ Maybe lafs_sub or sublafs. Needs path to directory.
+ can be given 'snap=foo'.
+ No: a given filesystem may not exist in a snapshot. You need to
+ mount the snapshot first, then the subset of the snapshot.
+ So we have three types as above. All subsets as 'lafs_subset',
+ whether they are subset of main or of snapshot.
+
+ Should we be able to create a snapshot or subset without mounting it?
+ It doesn't really seem necessary but might be elegant..
+
+ remount doesn't seem the right way to edit a filesystem as it forces
+ some cache flushing.
+ What do we want to edit?
+ - add device, remove device
+ - add/remove snapshot by name
+ - add/remove subset? Not needed, just mkdir/rmdir and mount to convert
+ empty dir to subset.
+ - change cleaner settings??
+ Could have remount as an option. If problem find other option.
+
+ While cleaning (which is always) we potentially need all superblocks
+ available as we might need to load blocks in those filesystems to
+ relocate them.
+ Unfortunately each super needs to be in a global list so there is a cost
+ in having them appear and disappear. I guess that is not a big deal. They
+ are refcounted and will disappear cleanly when the count hits zero.
+
+ So:
+ DONE - change all prime_sb->blocksize refs to fs->blocksize
+ DONE - create an anon sb for the main filesystem
+ DONE - discard the device sbs, just bd_claim the devices and add to list
+ - use lafs_subset for creating/mounting subsets.
+
+ Changed s_fs_info to point to the TypeInodeFile for the super, but
+ for root/snapshot that doesn't exist early enough to differentiate the
+ super in sget.
+ So we make an inode before the super exists and attach it after.
+ Need to do all that get_new_inode does.
+ inode_stat.nr_inodes++ - just don't generic_forget the inode
+ add to inode_in_use - seems pointless - just set i_list to something
+ add to sb->s_inodes - if we don't it won't flush - maybe that is good?
+ add to hash - don't want
+ i_state == lock|new - only really needed if hashed.
+ but there is lots of initialisation in alloc_inode that we cannot access!!
+
+ Problem is that we need s_fs_info to uniquely identify the fs with something
+ that can be set in the spinlock, so allocating an inode is out.
+ And also to get to the filesystem metadata which is in the inode.
+ I guess we allocate a little something that stores identifier and later inode.
+ for lafs we use uuid
+ for subset we use just the inode
+ for snapshot we use fs and number
+
+
+25July2010
+ superblocks:
+ - sget gives us an active super_block. We need to attach to a vfsmnt
+ using simple_set_mnt, or call deactivate_locked_super.
+ - sget's set should call set_anon_super
+ - kill_sb (called by deactive_super) should then call kill_anon_super
+
+ If we have a vfsmnt, we have an active reference, so we can atomic_inc
+ s_active safely. So use this to allow snapshots and subsets to hold a
+ ref on the prime_sb and thence on the 'fs'.
+
+26July2010
+ - DONE need to set MS_ACTIVE somewhere!!
+ - FIXME if an inode is being dropped when iget comes in, it gets confused
+ and the inode appears to be deleted.
+
+ We cannot really break the dblock <-> inode link until after write_inode_now,
+ but there is no call-back before generic_detach_inode is complete.
+ The last is write_inode which is only calledif I_DIRTY_something.
+ Maybe when writeback completes on an inode dblock, we should check if
+ the inode is I_WILL_FREE and if so, we break the link...
+
+ Or maybe when we find my_inode set we can check the block and if it isn't
+ dirty or being deleted we break the link directly... That makes more sense.
+
+ So... what is the deal with freeing inodes???
+ ->iblock is like a hashtable reference. It is not refcounted
+ It gets set under private_lock
+ iblock is freed by memory pressure or lafs_release_index from
+ destroy_inode
+ when refcount of iblock is non-zero, ->dblock ref is counted,
+ else it is not.
+ dblock is set to NULL if I_Destroyed, or when dblock is discarded,
+ (under lafs_hash_lock)
+ and set to 'b' in lafs_iget and lafs_inode_dblock
+
+ We can drop the dblock link as soon as iblock has no reference
+
+ probably get clear_inode to break the link if possible, which it should
+ be on 'forget_inode'. Then lafs_iget can wait on the bit_waitqueue.
+ or maybe do clear_inode itself
+
+ FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
+ dblock is not NULL.
+
+28July2010
+ So: ->dblock and ->my_inode need to be clarified.
+
+ Neither is a counted reference - the idea is that either can be freed and
+ will destroy the pointer at the time so if the pointer is there, the
+ object must be ... but we need locking for that.
+ ->dblock is reasonably protected by private_lock, though if ->iblock exists
+ we hold a ref of ->dblock so we can access it more safely.
+
+ Need to check getiref_locked knows ->dblock exists when called on iblock
+ and lafs_inode_fillblock
+ yes, both safe!
+
+ But ->my_inode needs locking too so the inode can safely disappear without
+ having to wait for the data block to go. After all data blocks some in sets,
+ and one shouldn't keep others with inodes.
+ So something light-weight like rcu might work.
+ We use call_rcu to free the inode and rcu_readlock to access ->my_inode
+
+ Yes, that will work. Occasionally we will want an igrab to, but not
+ often.
+ Should look into rcu for index hash table and ->iblock as well.
+ Current ->iblock is only cleared when the block is freed .. I guess that is fine...
+
+
+31Jul2010
+ rcu protection of ->my_inode
+ A/ orphan inodes - are they protected?
+ B/ orphan blocks - are the inodes of those protected? Probably...
+
+ inodes are 'orphan' for two reasons
+ 1/ a truncate is in progress
+ 2/ there are no remaining links, so inode should be truncated/deleted
+ on restart.
+
+ The second precludes us from holding a refcount on any orphan inode,
+ else it would never get deleted.
+ So we must assert that an inode with I_Deleting or I_Trunc has an implied
+ reference and so delete must be delayed... not quite.
+ If we set I_Trunc but not I_Deleting, then we igrab the inode until
+ I_Trunc is cleared. While we hold the igrab, I_Deleting cannot possibly
+ be set as that is set when last ref is dropped.
+
+01Aug2010
+ FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
+ .. and in lafs_orphan_release
+ Well... only iolock_written can be a problem, and our rules require that
+ only phase-change writeout can set writeback. So the cleaner can never
+ wait for writeout here. Maybe it can wait for a lock, and maybe we don't
+ really need a lock, just 'wait_writeback'.
+08Aug2010
+ So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
+ It is writeback waiting on 74/BIGNUM fromm file.c:329. So writepage
+ tried to write a block in a directory .. but it is PinPending so that
+ must have been set after writepage got it...
+ lafs_dir_handle_orphan gets an async lock, then sets PinPending.
+ If write_page is before that, it will have the lock and dir_handle will try later.
+ If write_page is after it will block on the lock, or see PinPending and
+ release the lock.
+ So someone else must be clearing PinPending!
+ - checkpoint clears and re-sets under the lock, so that is safe
+ - dir.c clears under i_mutex
+ dir_handle_orphans always hold i_mutex ... or does it.
+ - refile drops when the last non-lru reference goes.
+ - inode_map_new_abort clears for inode
+ No, not that - just bad test on result lof iolock_written_async ;-(
+
+ Now have an interesting deadlock.
+ rm in lafs_delete_inode in inode_map_free is waiting for the block to
+ flush which requires the cleaner.
+ The cleaner thread in inode-handle_orphan is calling erase_dblock
+ on the same inode which blocks while inode_map_free has it locked....
+ no, not same block - just waiting for writeout which requires cleaner.
+ lafs_erase_dblock from inode_map_free must be async!
+ pin_dblock in lafs_orphan_release must too.... no - only the setting of
+ PinPending needs to be async or out side of cleaner, which it is.
+
+ Ok, got that fixed. All seems happy again, time for a commit.
+
+