From 1fed726e9f053cebfc43e32cd5ad938e84cfee9d Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 4 Mar 2011 12:47:31 +1100 Subject: [PATCH] README update Signed-off-by: NeilBrown --- README | 106 ++++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 87 insertions(+), 19 deletions(-) diff --git a/README b/README index a5e15f4..6c3d454 100644 --- a/README +++ b/README @@ -5127,39 +5127,39 @@ DONE 15az/ Sanity check all values in cluster head during roll-forward i.e. in roll_valid. If the head isn't complete, we can still use this to commit some previous checkpoints. -15ba/ roll forward should not BUG on bad data like inodefile in +DONE 15ba/ roll forward should not BUG on bad data like inodefile in non-primary filesystem. -15bb/ Do I need to sync something before copying an update over part +DONE 15bb/ Do I need to sync something before copying an update over part of an inode, then reloading the inode. -15bc/ Handle DescHole in roll forward. +DONE 15bc/ Handle DescHole in roll forward. -15bd/ Call lafs_add_block_address from writeback rather than iolock +DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock in roll forward, just for consistency. -15be/ Confirm various files loaded at mount time (segusage, orphan ...) +DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...) are actually the correct type. -15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing +DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing a lookup - or at least we can test for that. lafs_seg_apply_all has similar problems and needs a good solution. -15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent +DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent if parent splits. See what to do about that. -15bh/ after roll-forward, check that free_blocks hasn't gone negative. +DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative. or handle if it has. DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first. to twostage. -15bj/ Make sure .last link in segtracker is kepts uptodate, particularly in +DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in segdelete. -15bk/ make sure get_cleanable doesn't lose a race before calling add_clean +DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean -15bl/ better checks for 'valid state block address' in valid_devblock +DONE 15bl/ better checks for 'valid state block address' in valid_devblock include that segment_count is credible also in valid_stateblock @@ -5173,12 +5173,13 @@ DONE 15bp/ review all superblocks - maybe use more anon?? 15bq/ check readonly status in lafs_get_sb -15br/ sync_fs should probably wait for something if 'wait'. +DONE 15br/ sync_fs should probably wait for something if 'wait'. -15bs/ set f_fsid properly in lafs_statfs +DONE 15bs/ set f_fsid properly in lafs_statfs - - use new write_begin / write_end - - review how we ensure that credit remain with block. +DONE - use new write_begin / write_end + +15bt/ - review how we ensure that credit remain with block. 15ca/ When pin inode data block, pin it as well as index block I think It is still kept of the leaf list until the index block is done with @@ -5205,7 +5206,10 @@ DONE 15bp/ review all superblocks - maybe use more anon?? 15cc/ free any stray B_ASync block found in destroy_inode 15cd/ Some code assumes a cluster header does not exceed 1 page. - Is this safe? Is in true? Is it enforced? + Is this safe? Is in true? Is it enforced?p + roll-forward now handles large cluster_head. + Need cleaner to handle it, and need to possibly write large + cluster head when making new clusters. 15ce/ classify BUGs as - internal logic errors @@ -5300,7 +5304,7 @@ DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather than just iolock, for consistency... DONE 36f/ What to do if table becomes full when add_block_address in roll_block ?? -36g/ Write roll_mini for directories. +DONE 36g/ Write roll_mini for directories. DONE 36h/ In roll_one, use the cluster counting code to find block number and make sure we don't exceed the segment. DONE 36i/ add more general error checking to lafs_mount - @@ -5362,8 +5366,8 @@ DONE 52/ NFS export Need owner/group/perm for device file, but not for symlink. Can we create unique inode numbers? hard links for dev-files would be problematic. - What do we gain? Maybe something for sort symlinks. - 40 seems a ood length to et 70% of symlinks. + What do we gain? Maybe something for short symlinks. + 40 seems a good length to get 70% of symlinks. 59/ Fix NeedFlush handling so we don't drop-then-retake a mutex as that isn't sensible. @@ -5396,6 +5400,8 @@ DONE 52/ NFS export If a cross-directory rename happens care is needed: either flush updates first or ensure that a flush does happen before the cross-directory update is flushed. + Note that if the target of a rename is a directory, it must also be fully + flushed before the rename can proceed. 26June2010 Investigating 5a @@ -6973,3 +6979,65 @@ WritePhase - what is that all about? It is OK to delay the write-out of these until an fsync, and not bother if a checkpoint happens. So add that to th TODO list - item 66. + +28feb2010 + - roll forward directory updates ... I wonder if I got it right :-)(untested). + + + I don't seem to have easy-access notes about the various meaning of + 'width' and 'stride' + + width: The number of independent devices across which the (virtual) device + is placed. The normal goal is to write 'width' blocks on every single write. + On a RAID4/5/6 this will avoid the need to pre-read for parity calculations, + and it will keep all devices equally busy with writes. + The 'width' blocks probably aren't consecutive. + + There are two different layouts - one with width*stride <= segment_size + and one with width*stride > segment_size. + + width*stride <= segment_size + This is a traditional striped layout like RAID0/4/5/6. + The 'stride' is the chunk size, so 'width*stride' is the stripe size, + and segment_size must be a multiple of this. + In this case all addresses in a single segment are contigious. We don't + necessarily write them in order if we want to write less than one stripe. + segment_offset will normally be a multiple of width*stride though this isn't + enforced as one could have a partition with an non-aligned start. + + width*stride > segment_size + This implies a catentated layout. If parity-redundancy is in use when the + blocks which combine to form a stripe are 'stride' blocks apart. + The benefit of this layout is that an extra drive can be added by simply + zeroing it and joining it to the array - no re-stripe needed. + This will make all stripes slightly larger so at first the space will not + be available. As cleaning happens the space will gradually become + available. This still requires restriping, but unlike a normal + raid5 restripe, the space becomes available in small amounts immediately, + when there is no demand for more space, the re-striping (cleaning) can happen + at a very low priority with no cost. + + In this case the blocks in a segment are not contiguous. + 'segment_size/width' are, then there is a large gap (in virtual address + space) to the next chunk. + + The segment_offset is an amount of space which is free at the start of + each device. 0..segment_offset and stride..stride+segment_offset etc + do not contain data and can be used for metadata. + + When width > 1 it makes sense to replicate each state block across + every device - as we want to write the whole stripe anyway. + For now we only write and read the first two copies at the beginning, and + the last two at the end... + + Question: what do we want to do about metadata on flash devices? We really + don't want a small number of locations to store the metadata, but a large + number that we search through - possibly a binary search. + These could be all at start/end or scattered throughout the device. + The later would make it impossible to find efficiently - there is no way to + create useful linkage without writing something else at start of end. + As many devices optimise for random writes where the FAT table would be, + it make sense to just put the metadata there and not at the end. + We should allow one 'page' for each metadatum, which probably meanss + 32K. + So we should allow all state blocks to be near the start. -- 2.39.5