14 Jan, 2016

1 commit

  • Pull f2fs updates from Jaegeuk Kim:
    "This series adds two ioctls to control cached data and fragmented
    files. Most of the rest fixes missing error cases and bugs that we
    have not covered so far. Summary:

    Enhancements:
    - support an ioctl to execute online file defragmentation
    - support an ioctl to flush cached data
    - speed up shrinking of extent_cache entries
    - handle broken superblock
    - refector dirty inode management infra
    - revisit f2fs_map_blocks to handle more cases
    - reduce global lock coverage
    - add detecting user's idle time

    Major bug fixes:
    - fix data race condition on cached nat entries
    - fix error cases of volatile and atomic writes"

    * tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (87 commits)
    f2fs: should unset atomic flag after successful commit
    f2fs: fix wrong memory condition check
    f2fs: monitor the number of background checkpoint
    f2fs: detect idle time depending on user behavior
    f2fs: introduce time and interval facility
    f2fs: skip releasing nodes in chindless extent tree
    f2fs: use atomic type for node count in extent tree
    f2fs: recognize encrypted data in f2fs_fiemap
    f2fs: clean up f2fs_balance_fs
    f2fs: remove redundant calls
    f2fs: avoid unnecessary f2fs_balance_fs calls
    f2fs: check the page status filled from disk
    f2fs: introduce __get_node_page to reuse common code
    f2fs: check node id earily when readaheading node page
    f2fs: read isize while holding i_mutex in fiemap
    Revert "f2fs: check the node block address of newly allocated nid"
    f2fs: cover more area with nat_tree_lock
    f2fs: introduce max_file_blocks in sbi
    f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink
    f2fs: introduce zombie list for fast shrinking extent trees
    ...

    Linus Torvalds
     

09 Jan, 2016

2 commits


31 Dec, 2015

1 commit

  • Otherwise, we can get mismatched largest extent information.

    One example is:
    1. mount f2fs w/ extent_cache
    2. make a small extent
    3. umount
    4. mount f2fs w/o extent_cache
    5. update the largest extent
    6. umount
    7. mount f2fs w/ extent_cache
    8. get the old extent made by #2

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

17 Dec, 2015

1 commit


16 Dec, 2015

1 commit


09 Dec, 2015

1 commit

  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

10 Oct, 2015

1 commit


25 Aug, 2015

2 commits

  • In following call stack, if unfortunately we lose all chances to truncate
    inode page in remove_inode_page, eventually we will add the nid allocated
    previously into free nid cache, this nid is with NID_NEW status and with
    NEW_ADDR in its blkaddr pointer:

    - f2fs_create
    - f2fs_add_link
    - __f2fs_add_link
    - init_inode_metadata
    - new_inode_page
    - new_node_page
    - set_node_addr(, NEW_ADDR)
    - f2fs_init_acl failed
    - remove_inode_page failed
    - handle_failed_inode
    - remove_inode_page failed
    - iput
    - f2fs_evict_inode
    - remove_inode_page failed
    - alloc_nid_failed cache a nid with valid blkaddr: NEW_ADDR

    This may not only cause resource leak of previous inode, but also may cause
    incorrect use of the previous blkaddr which is located in NO.nid node entry
    when this nid is reused by others.

    This patch tries to add this inode to orphan list if we fail to truncate
    inode, so that we can obtain a second chance to release it in orphan
    recovery flow.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • According to commit 5f16f3225b06 ("ext4: atomically set inode->i_flags in
    ext4_set_inode_flags()").

    Signed-off-by: Zhang Zhen
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Zhang Zhen
     

05 Aug, 2015

5 commits

  • If we clear inline data/dentry flag in handle_failed_inode, we will fail
    to decline the stat count of inline data/dentry in f2fs_evict_inode due
    to no flag in inode. So remove the wrong clearing.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • In handle_failed_inode, there is a potential deadlock which can happen
    in below call path:

    - f2fs_create
    - f2fs_lock_op down_read(cp_rwsem)
    - f2fs_add_link
    - __f2fs_add_link
    - init_inode_metadata
    - f2fs_init_security failed
    - truncate_blocks failed
    - handle_failed_inode
    - f2fs_truncate
    - truncate_blocks(..,true)
    - write_checkpoint
    - block_operations
    - f2fs_lock_all down_write(cp_rwsem)
    - f2fs_lock_op down_read(cp_rwsem)

    So in this path, we pass parameter to f2fs_truncate to make sure
    cp_rwsem in truncate_blocks will not be locked again.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch adds to stat the number of inline xattr inode for
    showing in debugfs.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • We don't need to handle the duplicate extent information.

    The integrated rule is:
    - update on-disk extent with largest one tracked by in-memory extent_cache
    - destroy extent_tree for the truncation case
    - drop per-inode extent_cache by shrinker

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Before iput is called, the inode number used by a bad inode can be reassigned
    to other new inode, resulting in any abnormal behaviors on the new inode.
    This should not happen for the new inode.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

02 Jun, 2015

1 commit

  • This patch applies the following ext4 patch:

    ext4 crypto: use per-inode tfm structure

    As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
    we read or write a page. Instead we can use a single tfm hanging off
    the inode's crypt_info structure for all of our encryption needs for
    that inode, since the tfm can be used by multiple crypto requests in
    parallel.

    Also use cmpxchg() to avoid races that could result in crypt_info
    structure getting doubly allocated or doubly freed.

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

29 May, 2015

2 commits

  • This patch implements encryption support for symlink.

    Signed-off-by: Uday Savagaonkar
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch activates the following APIs for encryption support.

    The rules quoted by ext4 are:
    - An unencrypted directory may contain encrypted or unencrypted files
    or directories.
    - All files or directories in a directory must be protected using the
    same key as their containing directory.
    - Encrypted inode for regular file should not have inline_data.
    - Encrypted symlink and directory may have inline_data and inline_dentry.

    This patch activates the following APIs.
    1. f2fs_link : validate context
    2. f2fs_lookup : ''
    3. f2fs_rename : ''
    4. f2fs_create/f2fs_mkdir : inherit its dir's context
    5. f2fs_direct_IO : do buffered io for regular files
    6. f2fs_open : check encryption info
    7. f2fs_file_mmap : ''
    8. f2fs_setattr : ''
    9. f2fs_file_write_iter : '' (Called by sys_io_submit)
    10. f2fs_fallocate : do not support fcollapse
    11. f2fs_evict_inode : free_encryption_info

    Signed-off-by: Michael Halcrow
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

11 Apr, 2015

4 commits

  • This patch fixes the below warning.

    sparse warnings: (new ones prefixed by >>)

    >> fs/f2fs/inode.c:56:23: sparse: restricted __le32 degrades to integer
    >> fs/f2fs/inode.c:56:52: sparse: restricted __le32 degrades to integer

    Reported-by: kbuild test robot
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch tries to preserve last extent info in extent tree cache into on-disk
    inode, so this can help us to reuse the last extent info next time for
    performance.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • With normal extent info cache, we records largest extent mapping between logical
    block and physical block into extent info, and we persist extent info in on-disk
    inode.

    When we enable extent tree cache, if extent info of on-disk inode is exist, and
    the extent is not a small fragmented mapping extent. We'd better to load the
    extent info into extent tree cache when inode is loaded. By this way we can have
    more chance to hit extent tree cache rather than taking more time to read dnode
    page for block address.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch is to avoid some punch_hole overhead when releasing volatile data.
    If volatile data was not written yet, we just can make the first page as zero.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

04 Mar, 2015

2 commits

  • This patch enables rb-tree based extent cache in f2fs.

    When we mount with "-o extent_cache", f2fs will try to add recently accessed
    page-block mappings into rb-tree based extent cache as much as possible, instead
    of original one extent info cache.

    By this way, f2fs can support more effective cache between dnode page cache and
    disk. It will supply high hit ratio in the cache with fewer memory when dnode
    page cache are reclaimed in environment of low memory.

    Storage: Sandisk sd card 64g
    1.append write file (offset: 0, size: 128M);
    2.override write file (offset: 2M, size: 1M);
    3.override write file (offset: 4M, size: 1M);
    ...
    4.override write file (offset: 48M, size: 1M);
    ...
    5.override write file (offset: 112M, size: 1M);
    6.sync
    7.echo 3 > /proc/sys/vm/drop_caches
    8.read file (size:128M, unit: 4k, count: 32768)
    (time dd if=/mnt/f2fs/128m bs=4k count=32768)

    Extent Hit Ratio:
    before patched
    Hit Ratio 121 / 1071 1071 / 1071

    Performance:
    before patched
    real 0m37.051s 0m35.556s
    user 0m0.040s 0m0.026s
    sys 0m2.990s 0m2.251s

    Memory Cost:
    before patched
    Tree Count: 0 1 (size: 24 bytes)
    Node Count: 0 45 (size: 1440 bytes)

    v3:
    o retest and given more details of test result.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Move ext_lock out of struct extent_info, then in the following patches we can
    use variables with struct extent_info type as a parameter to pass pure data.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

10 Jan, 2015

2 commits

  • We use kzalloc to allocate memory in __recover_inline_status, and use this
    all-zero memory to check the inline date content of inode page by comparing
    them. This is low effective and not needed, let's check inline date content
    directly.

    Signed-off-by: Chao Yu
    [Jaegeuk Kim: make the code more neat]
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch adds two new ioctls to release inmemory pages grabbed by atomic
    writes.
    o f2fs_ioc_abort_volatile_write
    - If transaction was failed, all the grabbed pages and data should be written.
    o f2fs_ioc_release_volatile_write
    - This is to enhance the performance of PERSIST mode in sqlite.

    In order to avoid huge memory consumption which causes OOM, this patch changes
    volatile writes to use normal dirty pages, instead blocked flushing to the disk
    as long as system does not suffer from memory pressure.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

09 Dec, 2014

1 commit


05 Nov, 2014

1 commit

  • This patch simplifies the inline_data usage with the following rule.
    1. inline_data is set during the file creation.
    2. If new data is requested to be written ranges out of inline_data,
    f2fs converts that inode permanently.
    3. There is no cases which converts non-inline_data inode to inline_data.
    4. The inline_data flag should be changed under inode page lock.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

04 Nov, 2014

3 commits


08 Oct, 2014

1 commit

  • This patch adds support for volatile writes which keep data pages in memory
    until f2fs_evict_inode is called by iput.

    For instance, we can use this feature for the sqlite database as follows.
    While supporting atomic writes for main database file, we can keep its journal
    data temporarily in the page cache by the following sequence.

    1. open
    -> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
    2. writes
    : keep all the data in the page cache.
    3. flush to the database file with atomic writes
    a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
    b. writes
    c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
    4. close
    -> drop the cached data

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

07 Oct, 2014

1 commit

  • This patch introduces a very limited functionality for atomic write support.
    In order to support atomic write, this patch adds two ioctls:
    o F2FS_IOC_START_ATOMIC_WRITE
    o F2FS_IOC_COMMIT_ATOMIC_WRITE

    The database engine should be aware of the following sequence.
    1. open
    -> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
    2. writes
    : all the written data will be treated as atomic pages.
    3. commit
    -> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
    : this flushes all the data blocks to the disk, which will be shown all or
    nothing by f2fs recovery procedure.
    4. repeat to #2.

    The IO pattens should be:

    ,- START_ATOMIC_WRITE ,- COMMIT_ATOMIC_WRITE
    CP | D D D D D D | FSYNC | D D D D | FSYNC ...
    `- COMMIT_ATOMIC_WRITE

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

01 Oct, 2014

1 commit


16 Sep, 2014

1 commit


10 Sep, 2014

1 commit


04 Sep, 2014

1 commit


05 Aug, 2014

1 commit

  • When inode is evicted, all the page cache belong to this inode should be
    released including the xattr node page. But previously we didn't do this, this
    patch fixed this issue.

    v2:
    o reposition invalidate_mapping_pages() to the right place suggested by
    Jaegeuk Kim.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

29 Jul, 2014

1 commit

  • This patch introduces a inode number list in which represents inodes having
    appended data writes or updated data writes after last checkpoint.
    This will be used at fsync to determine whether the recovery information
    should be written or not.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

25 Jul, 2014

1 commit

  • Andrey Tsyvarev reported:
    "Using memory error detector reveals the following use-after-free error
    in 3.15.0:

    AddressSanitizer: heap-use-after-free in f2fs_evict_inode
    Read of size 8 by thread T22279:
    [] f2fs_evict_inode+0x102/0x2e0 [f2fs]
    [] evict+0x15f/0x290
    [< inlined >] iput+0x196/0x280 iput_final
    [] iput+0x196/0x280
    [] f2fs_put_super+0xd6/0x170 [f2fs]
    [] generic_shutdown_super+0xc5/0x1b0
    [] kill_block_super+0x4d/0xb0
    [] deactivate_locked_super+0x66/0x80
    [] deactivate_super+0x68/0x80
    [] mntput_no_expire+0x198/0x250
    [< inlined >] SyS_umount+0xe9/0x1a0 SYSC_umount
    [] SyS_umount+0xe9/0x1a0
    [] system_call_fastpath+0x16/0x1b

    Freed by thread T3:
    [] f2fs_i_callback+0x27/0x30 [f2fs]
    [< inlined >] rcu_process_callbacks+0x2d6/0x930 __rcu_reclaim
    [< inlined >] rcu_process_callbacks+0x2d6/0x930 rcu_do_batch
    [< inlined >] rcu_process_callbacks+0x2d6/0x930 invoke_rcu_callbacks
    [< inlined >] rcu_process_callbacks+0x2d6/0x930 __rcu_process_callbacks
    [] rcu_process_callbacks+0x2d6/0x930
    [] __do_softirq+0x142/0x380
    [] run_ksoftirqd+0x30/0x50
    [] smpboot_thread_fn+0x197/0x280
    [] kthread+0x148/0x160
    [] ret_from_fork+0x7c/0xb0

    Allocated by thread T22276:
    [] f2fs_alloc_inode+0x2d/0x170 [f2fs]
    [] iget_locked+0x10a/0x230
    [] f2fs_iget+0x35/0xa80 [f2fs]
    [] f2fs_fill_super+0xb53/0xff0 [f2fs]
    [] mount_bdev+0x1de/0x240
    [] f2fs_mount+0x10/0x20 [f2fs]
    [] mount_fs+0x55/0x220
    [] vfs_kern_mount+0x66/0x200
    [< inlined >] do_mount+0x2b4/0x1120 do_new_mount
    [] do_mount+0x2b4/0x1120
    [< inlined >] SyS_mount+0xb2/0x110 SYSC_mount
    [] SyS_mount+0xb2/0x110
    [] system_call_fastpath+0x16/0x1b

    The buggy address ffff8800587866c8 is located 48 bytes inside
    of 680-byte region [ffff880058786698, ffff880058786940)

    Memory state around the buggy address:
    ffff880058786100: ffffffff ffffffff ffffffff ffffffff
    ffff880058786200: ffffffff ffffffff ffffffrr rrrrrrrr
    ffff880058786300: rrrrrrrr rrffffff ffffffff ffffffff
    ffff880058786400: ffffffff ffffffff ffffffff ffffffff
    ffff880058786500: ffffffff ffffffff ffffffff fffffffr
    >ffff880058786600: rrrrrrrr rrrrrrrr rrrfffff ffffffff
    ^
    ffff880058786700: ffffffff ffffffff ffffffff ffffffff
    ffff880058786800: ffffffff ffffffff ffffffff ffffffff
    ffff880058786900: ffffffff rrrrrrrr rrrrrrrr rrrr....
    ffff880058786a00: ........ ........ ........ ........
    ffff880058786b00: ........ ........ ........ ........
    Legend:
    f - 8 freed bytes
    r - 8 redzone bytes
    . - 8 allocated bytes
    x=1..7 - x allocated bytes + (8-x) redzone bytes

    Investigation shows, that f2fs_evict_inode, when called for
    'meta_inode', uses invalidate_mapping_pages() for 'node_inode'.
    But 'node_inode' is deleted before 'meta_inode' in f2fs_put_super via
    iput().

    It seems that in common usage scenario this use-after-free is benign,
    because 'node_inode' remains partially valid data even after
    kmem_cache_free().
    But things may change if, while 'meta_inode' is evicted in one f2fs
    filesystem, another (mounted) f2fs filesystem requests inode from cache,
    and formely
    'node_inode' of the first filesystem is returned."

    Nids for both meta_inode and node_inode are reservation, so it's not necessary
    for us to invalidate pages which will never be allocated.
    To fix this issue, let's skipping needlessly invalidating pages for
    {meta,node}_inode in f2fs_evict_inode.

    Reported-by: Andrey Tsyvarev
    Tested-by: Andrey Tsyvarev
    Signed-off-by: Gu Zheng
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu