26 Sep, 2014

1 commit

  • commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

    aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     

09 Aug, 2013

1 commit


30 Jul, 2013

2 commits

  • This patch fixes mishandling of the sbi->n_orphans variable.

    If users request lots of f2fs_unlink(), check_orphan_space() could be contended.
    In such the case, sbi->n_orphans can be read incorrectly so that f2fs_unlink()
    would fall into the wrong state which results in the failure of
    add_orphan_inode().

    So, let's increment sbi->n_orphans virtually prior to the actual orphan inode
    stuffs. After that, let's release sbi->n_orphans by calling release_orphan_inode
    or remove_orphan_inode.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • As we remove the target single node, so list_for_each is enought, in order to
    clean up, we use list_for_each_entry instead.

    Signed-off-by: Gu Zheng
    Signed-off-by: Jaegeuk Kim

    Gu Zheng
     

02 Jul, 2013

1 commit

  • While calculating CRC for the checkpoint block, we use __u32, but when storing
    the crc value to the disk, we use __le32.

    Let's fix the inconsistency.

    Reported-and-Tested-by: Oded Gabbay
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

07 Jun, 2013

1 commit

  • It is possible that iput is skipped after iget during the recovery.

    In recover_dentry(),
    dir = f2fs_iget();
    ...
    if (de && inode->i_ino == le32_to_cpu(de->ino))
    goto out;

    In this case, this dir is not able to be added in dirty_dir_inode_list.
    The actual linking is done only when set_page_dirty() is called.

    So let's add this newly got inode into the list explicitly, and put it at the
    end of the recovery routine.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

28 May, 2013

4 commits

  • - iget/iput flow in the dentry recovery process

    1. *dir* = f2fs_iget
    2. set FI_DELAY_IPUT to *dir*
    3. add *dir* to the dirty_dir_list
    - __f2fs_add_link
    - recover_dentry)
    4. iput *dir* by remove_dirty_dir_inode
    - sync_dirty_dir_inodes
    - write_chekcpoint

    If *dir*'s i_count is not 1 (i.e., root dir), remove_dirty_dir_inode is called
    later and then iput is triggered again due to the FI_DELAY_IPUT flag.
    So, let's unset the flag properly once iput is triggered.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • If there remains some unwritten blocks from the recovery, we should not call
    iput on that directory inode.
    Otherwise, we can loose some dentry blocks after the recovery.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Some, counters are needed only for the statistical information
    while debugging.
    So, those can be controlled using CONFIG_F2FS_STAT_FS,
    pushing the usage for few variables under this flag.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Amit Sahrawat
    Signed-off-by: Jaegeuk Kim

    Namjae Jeon
     
  • During the dentry recovery routine, recover_inode() triggers __f2fs_add_link
    with its directory inode.

    In the following scenario, a bug is captured.
    1. dir = f2fs_iget(pino)
    2. __f2fs_add_link(dir, name)
    3. iput(dir)
    -> f2fs_evict_inode() faces with BUG_ON(atomic_read(fi->dirty_dents))

    Kernel BUG at ffffffffa01c0676 [verbose debug info unavailable]
    [] f2fs_evict_inode+0x276/0x300 [f2fs]
    Call Trace:
    [] evict+0xb0/0x1b0
    [] iput+0x105/0x190
    [] recover_fsync_data+0x3bc/0x1070 [f2fs]
    [] ? io_schedule+0xaa/0xd0
    [] ? __wait_on_bit_lock+0x7b/0xc0
    [] ? __lock_page+0x67/0x70
    [] ? kmem_cache_alloc+0x31/0x140
    [] ? __d_instantiate+0x92/0xf0
    [] ? security_d_instantiate+0x1b/0x30
    [] ? d_instantiate+0x54/0x70

    This means that we should flush all the dentry pages between iget and iput().
    But, during the recovery routine, it is unallowed due to consistency, so we
    have to wait the whole recovery process.
    And then, write_checkpoint flushes all the dirty dentry blocks, and nicely we
    can put the stale dir inodes from the dirty_dir_inode_list.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

29 Apr, 2013

1 commit


26 Apr, 2013

1 commit

  • Previously, background GC submits many 4KB read requests to load victim blocks
    and/or its (i)node blocks.

    ...
    f2fs_gc : f2fs_readpage: ino = 1, page_index = 0xb61, blkaddr = 0x3b964ed
    f2fs_gc : block_rq_complete: 8,16 R () 499854968 + 8 [0]
    f2fs_gc : f2fs_readpage: ino = 1, page_index = 0xb6f, blkaddr = 0x3b964ee
    f2fs_gc : block_rq_complete: 8,16 R () 499854976 + 8 [0]
    f2fs_gc : f2fs_readpage: ino = 1, page_index = 0xb79, blkaddr = 0x3b964ef
    f2fs_gc : block_rq_complete: 8,16 R () 499854984 + 8 [0]
    ...

    However, by the fact that many IOs are sequential, we can give a chance to merge
    the IOs by IO scheduler.
    In order to do that, let's use blk_plug.

    ...
    f2fs_gc : f2fs_iget: ino = 143
    f2fs_gc : f2fs_readpage: ino = 143, page_index = 0x1c6, blkaddr = 0x2e6ee
    f2fs_gc : f2fs_iget: ino = 143
    f2fs_gc : f2fs_readpage: ino = 143, page_index = 0x1c7, blkaddr = 0x2e6ef
    : block_rq_complete: 8,16 R () 1519616 + 8 [0]
    : block_rq_complete: 8,16 R () 1519848 + 8 [0]
    : block_rq_complete: 8,16 R () 1520432 + 96 [0]
    : block_rq_complete: 8,16 R () 1520536 + 104 [0]
    : block_rq_complete: 8,16 R () 1521008 + 112 [0]
    : block_rq_complete: 8,16 R () 1521440 + 152 [0]
    : block_rq_complete: 8,16 R () 1521688 + 144 [0]
    : block_rq_complete: 8,16 R () 1522128 + 192 [0]
    : block_rq_complete: 8,16 R () 1523256 + 328 [0]
    ...

    Note that this issue should be addressed in checkpoint, and some readahead
    flows too.

    Reviewed-by: Namjae Jeon
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

23 Apr, 2013

1 commit


09 Apr, 2013

1 commit

  • In the previous version, f2fs uses global locks according to the usage types,
    such as directory operations, block allocation, block write, and so on.

    Reference the following lock types in f2fs.h.
    enum lock_type {
    RENAME, /* for renaming operations */
    DENTRY_OPS, /* for directory operations */
    DATA_WRITE, /* for data write */
    DATA_NEW, /* for data allocation */
    DATA_TRUNC, /* for data truncate */
    NODE_NEW, /* for node allocation */
    NODE_TRUNC, /* for node truncate */
    NODE_WRITE, /* for node write */
    NR_LOCK_TYPE,
    };

    In that case, we lose the performance under the multi-threading environment,
    since every types of operations must be conducted one at a time.

    In order to address the problem, let's share the locks globally with a mutex
    array regardless of any types.
    So, let users grab a mutex and perform their jobs in parallel as much as
    possbile.

    For this, I propose a new global lock scheme as follows.

    0. Data structure
    - f2fs_sb_info -> mutex_lock[NR_GLOBAL_LOCKS]
    - f2fs_sb_info -> node_write

    1. mutex_lock_op(sbi)
    - try to get an avaiable lock from the array.
    - returns the index of the gottern lock variable.

    2. mutex_unlock_op(sbi, index of the lock)
    - unlock the given index of the lock.

    3. mutex_lock_all(sbi)
    - grab all the locks in the array before the checkpoint.

    4. mutex_unlock_all(sbi)
    - release all the locks in the array after checkpoint.

    5. block_operations()
    - call mutex_lock_all()
    - sync_dirty_dir_inodes()
    - grab node_write
    - sync_node_pages()

    Note that,
    the pairs of mutex_lock_op()/mutex_unlock_op() and
    mutex_lock_all()/mutex_unlock_all() should be used together.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

03 Apr, 2013

1 commit

  • This patch removes a bitmap for victim segments selected by foreground GC, and
    modifies the other bitmap for victim segments selected by background GC.

    1) foreground GC bitmap
    : We don't need to manage this, since we just only one previous victim section
    number instead of the whole victim history.
    The f2fs uses the victim section number in order not to allocate currently
    GC'ed section to current active logs.

    2) background GC bitmap
    : This bitmap is used to avoid selecting victims repeatedly by background GCs.
    In addition, the victims are able to be selected by foreground GCs, since
    there is no need to read victim blocks during foreground GCs.

    By the fact that the foreground GC reclaims segments in a section unit, it'd
    be better to manage this bitmap based on the section granularity.

    Reviewed-by: Namjae Jeon
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

20 Mar, 2013

1 commit

  • This patch reduces redundant locking and unlocking pages during read operations.
    In f2fs_readpage, let's use wait_on_page_locked() instead of lock_page.
    And then, when we need to modify any data finally, let's lock the page so that
    we can avoid lock contention.

    [readpage rule]
    - The f2fs_readpage returns unlocked page, or released page too in error cases.
    - Its caller should handle read error, -EIO, after locking the page, which
    indicates read completion.
    - Its caller should check PageUptodate after grab_cache_page.

    Signed-off-by: Changman Lee
    Reviewed-by: Namjae Jeon
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

12 Feb, 2013

4 commits

  • This patch makes clearer the ambiguous f2fs_gc flow as follows.

    1. Remove intermediate checkpoint condition during f2fs_gc
    (i.e., should_do_checkpoint() and GC_BLOCKED)

    2. Remove unnecessary return values of f2fs_gc because of #1.
    (i.e., GC_NODE, GC_OK, etc)

    3. Simplify write_checkpoint() because of #2.

    4. Clarify the main f2fs_gc flow.
    o monitor how many freed sections during one iteration of do_garbage_collect().
    o do GC more without checkpoints if we can't get enough free sections.
    o do checkpoint once we've got enough free sections through forground GCs.

    5. Adopt thread-logging (Slack-Space-Recycle) scheme more aggressively on data
    log types. See. get_ssr_segement()

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • F2FS_SET_SB_DIRT is called in inc_page_count and
    it is directly called one more time in the next line.

    Signed-off-by: Changman Lee
    Signed-off-by: Jaegeuk Kim

    Changman Lee
     
  • For the code
    > prev = list_entry(orphan->list.prev, typeof(*prev), list);
    if orphan->list.prev == head, it can't get the right prev.
    And we can use the parameter 'this' to add.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jaegeuk Kim

    majianpeng
     
  • This patch enhances the checkpoint routine to cope with IO errors.

    Basically f2fs detects IO errors from end_io_write, and the errors are able to
    be occurred during one of data, node, and meta page writes.

    In the previous code, when an IO error is occurred during writes, f2fs sets a
    flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
    Afterwards, write_checkpoint() will check the flag and remount f2fs as a
    read-only (ro) mode.

    However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
    freely able to be written to disk by flusher or kswapd in background.
    In such a case, after cold reboot, f2fs would restore the checkpoint data having
    CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
    a ro mode again.

    Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
    occurred, and remount f2fs as a ro mode right away at that moment.

    Reported-by: Oliver Winker
    Signed-off-by: Jaegeuk Kim
    Reviewed-by: Namjae Jeon

    Jaegeuk Kim
     

22 Jan, 2013

1 commit


04 Jan, 2013

1 commit


11 Dec, 2012

3 commits

  • As pointed out by Randy Dunlap, this patch removes all usage of "/**" for comment
    blocks. Instead, just use "/*".

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch should resolve the bugs reported by the sparse tool.
    Initial reports were written by "kbuild test robot" managed by fengguang.wu.

    In my local machines, I've tested also by running:
    > make C=2 CF="-D__CHECK_ENDIAN__"

    Accordingly, I've found lots of warnings and bugs related to the endian
    conversion. And I've fixed all at this moment.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This adds functions required by the checkpoint operations.

    Basically, f2fs adopts a roll-back model with checkpoint blocks written in the
    CP area. The checkpoint procedure includes as follows.

    - write_checkpoint()
    1. block_operations() freezes VFS calls.
    2. submit cached bios.
    3. flush_nat_entries() writes NAT pages updated by dirty NAT entries.
    4. flush_sit_entries() writes SIT pages updated by dirty SIT entries.
    5. do_checkpoint() writes,
    - checkpoint block (#0)
    - orphan inode blocks
    - summary blocks made by active logs
    - checkpoint block (copy of #0)
    6. unblock_opeations()

    In order to provide an address space for meta pages, f2fs_sb_info has a special
    inode, namely meta_inode. This patch also adds the address space operations for
    meta_inode.

    Signed-off-by: Chul Lee
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim