22 Dec, 2020

3 commits

  • We will add a new "compress_mode" mount option to control file
    compression mode. This supports "fs" and "user". In "fs" mode (default),
    f2fs does automatic compression on the compression enabled files.
    In "user" mode, f2fs disables the automaic compression and gives the
    user discretion of choosing the target file and the timing. It means
    the user can do manual compression/decompression on the compression
    enabled files using ioctls.

    Signed-off-by: Daeho Jeong
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Daeho Jeong
     
  • section is dirty, but dirty_secmap may not set

    Reported-by: Jia Yang
    Fixes: da52f8ade40b ("f2fs: get the right gc victim section when section has several segments")
    Cc:
    Signed-off-by: Jack Qiu
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jack Qiu
     
  • Lei Li reported a issue: if foreground operations are frequent, background
    checkpoint may be always skipped due to below check, result in losing more
    data after sudden power-cut.

    f2fs_balance_fs_bg()
    ...
    if (!is_idle(sbi, REQ_TIME) &&
    (!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi)))
    return;

    E.g:
    cp_interval = 5 second
    idle_interval = 2 second
    foreground operation interval = 1 second (append 1 byte per second into file)

    In such case, no matter when it calls f2fs_balance_fs_bg(), is_idle(, REQ_TIME)
    returns false, result in skipping background checkpoint.

    This patch changes as below to make trigger condition being more reasonable:
    - trigger sync_fs() if dirty_{nats,nodes} and prefree segs exceeds threshold;
    - skip triggering sync_fs() if there is any background inflight IO or there is
    foreground operation recently and meanwhile cp_rwsem is being held by someone;

    Reported-by: Lei Li
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

14 Oct, 2020

2 commits

  • This patch changes f2fs_flush_device_cache() to skip issuing flush for
    nobarrier case.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • First problem is we hit BUG_ON() in f2fs_get_sum_page given EIO on
    f2fs_get_meta_page_nofail().

    Quick fix was not to give any error with infinite loop, but syzbot caught
    a case where it goes to that loop from fuzzed image. In turned out we abused
    f2fs_get_meta_page_nofail() like in the below call stack.

    - f2fs_fill_super
    - f2fs_build_segment_manager
    - build_sit_entries
    - get_current_sit_page

    INFO: task syz-executor178:6870 can't die for more than 143 seconds.
    task:syz-executor178 state:R
    stack:26960 pid: 6870 ppid: 6869 flags:0x00004006
    Call Trace:

    Showing all locks held in the system:
    1 lock held by khungtaskd/1179:
    #0: ffffffff8a554da0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6242
    1 lock held by systemd-journal/3920:
    1 lock held by in:imklog/6769:
    #0: ffff88809eebc130 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:930
    1 lock held by syz-executor178/6870:
    #0: ffff8880925120e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0x201/0xaf0 fs/super.c:229

    Actually, we didn't have to use _nofail in this case, since we could return
    error to mount(2) already with the error handler.

    As a result, this patch tries to 1) remove _nofail callers as much as possible,
    2) deal with error case in last remaining caller, f2fs_get_sum_page().

    Reported-by: syzbot+ee250ac8137be41d7b13@syzkaller.appspotmail.com
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

15 Sep, 2020

1 commit

  • After commit 0b6d4ca04a86 ("f2fs: don't return vmalloc() memory from
    f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed
    memory, so clean up to use kfree() instead of kvfree() to free
    vmalloc()'ed memory.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

12 Sep, 2020

1 commit

  • There are several issues in current background GC algorithm:
    - valid blocks is one of key factors during cost overhead calculation,
    so if segment has less valid block, however even its age is young or
    it locates hot segment, CB algorithm will still choose the segment as
    victim, it's not appropriate.
    - GCed data/node will go to existing logs, no matter in-there datas'
    update frequency is the same or not, it may mix hot and cold data
    again.
    - GC alloctor mainly use LFS type segment, it will cost free segment
    more quickly.

    This patch introduces a new algorithm named age threshold based
    garbage collection to solve above issues, there are three steps
    mainly:

    1. select a source victim:
    - set an age threshold, and select candidates beased threshold:
    e.g.
    0 means youngest, 100 means oldest, if we set age threshold to 80
    then select dirty segments which has age in range of [80, 100] as
    candiddates;
    - set candidate_ratio threshold, and select candidates based the
    ratio, so that we can shrink candidates to those oldest segments;
    - select target segment with fewest valid blocks in order to
    migrate blocks with minimum cost;

    2. select a target victim:
    - select candidates beased age threshold;
    - set candidate_radius threshold, search candidates whose age is
    around source victims, searching radius should less than the
    radius threshold.
    - select target segment with most valid blocks in order to avoid
    migrating current target segment.

    3. merge valid blocks from source victim into target victim with
    SSR alloctor.

    Test steps:
    - create 160 dirty segments:
    * half of them have 128 valid blocks per segment
    * left of them have 384 valid blocks per segment
    - run background GC

    Benefit: GC count and block movement count both decrease obviously:

    - Before:
    - Valid: 86
    - Dirty: 1
    - Prefree: 11
    - Free: 6001 (6001)

    GC calls: 162 (BG: 220)
    - data segments : 160 (160)
    - node segments : 2 (2)
    Try to move 41454 blocks (BG: 41454)
    - data blocks : 40960 (40960)
    - node blocks : 494 (494)

    IPU: 0 blocks
    SSR: 0 blocks in 0 segments
    LFS: 41364 blocks in 81 segments

    - After:

    - Valid: 87
    - Dirty: 0
    - Prefree: 4
    - Free: 6008 (6008)

    GC calls: 75 (BG: 76)
    - data segments : 74 (74)
    - node segments : 1 (1)
    Try to move 12813 blocks (BG: 12813)
    - data blocks : 12544 (12544)
    - node blocks : 269 (269)

    IPU: 0 blocks
    SSR: 12032 blocks in 77 segments
    LFS: 855 blocks in 2 segments

    Signed-off-by: Chao Yu
    [Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

11 Sep, 2020

6 commits

  • then, we can add specified entry into rb-tree with 64-bits segment time
    as key.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Don't let f2fs inner GC ruins original aging degree of segment.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Previously, once we update one block in segment, we will update mtime of
    segment to last time, making aged segment becoming freshest, result in
    that GC with cost benefit algorithm missing such segment, So this patch
    changes to record mtime as average block updating time instead of last
    updating time.

    It's not needed to reset mtime for prefree segment, as se->valid_blocks
    is zero, then old se->mtime won't take any weight with below calculation:

    se->mtime = div_u64(se->mtime * se->valid_blocks + mtime,
    se->valid_blocks + 1);

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Previous implementation of aligned pinfile allocation will:
    - allocate new segment on cold data log no matter whether last used
    segment is partially used or not, it makes IOs more random;
    - force concurrent cold data/GCed IO going into warm data area, it
    can make a bad effect on hot/cold data separation;

    In this patch, we introduce a new type of log named 'inmem curseg',
    the differents from normal curseg is:
    - it reuses existed segment type (CURSEG_XXX_NODE/DATA);
    - it only exists in memory, its segno, blkofs, summary will not b
    persisted into checkpoint area;

    With this new feature, we can enhance scalability of log, special
    allocators can be created for purposes:
    - pure lfs allocator for aligned pinfile allocation or file
    defragmentation
    - pure ssr allocator for later feature

    So that, let's update aligned pinfile allocation to use this new
    inmem curseg fwk.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Since DUMMY_WRITTEN_PAGE and ATOMIC_WRITTEN_PAGE have already been
    converted as unsigned long type, we don't need do type casting again.

    Signed-off-by: Xiaojun Wang
    Reported-by: Jack Qiu
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Xiaojun Wang
     
  • NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
    Zone-capacity indicates the maximum number of sectors that are usable in
    a zone beginning from the first sector of the zone. This makes the sectors
    sectors after the zone-capacity till zone-size to be unusable.
    This patch set tracks zone-size and zone-capacity in zoned devices and
    calculate the usable blocks per segment and usable segments per section.

    If zone-capacity is less than zone-size mark only those segments which
    start before zone-capacity as free segments. All segments at and beyond
    zone-capacity are treated as permanently used segments. In cases where
    zone-capacity does not align with segment size the last segment will start
    before zone-capacity and end beyond the zone-capacity of the zone. For
    such spanning segments only sectors within the zone-capacity are used.

    During writes and GC manage the usable segments in a section and usable
    blocks per segment. Segments which are beyond zone-capacity are never
    allocated, and do not need to be garbage collected, only the segments
    which are before zone-capacity needs to garbage collected.
    For spanning segments based on the number of usable blocks in that
    segment, write to blocks only up to zone-capacity.

    Zone-capacity is device specific and cannot be configured by the user.
    Since NVMe ZNS device zones are sequentially write only, a block device
    with conventional zones or any normal block device is needed along with
    the ZNS device for the metadata operations of F2fs.

    A typical nvme-cli output of a zoned device shows zone start and capacity
    and write pointer as below:

    SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
    SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
    SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ

    Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
    are in EMPTY state. For each zone, only zone start + 49MB is usable area,
    any lba/sector after 49MB cannot be read or written to, the drive will fail
    any attempts to read/write. So, the second zone starts at 64MB and is
    usable till 113MB (64 + 49) and the range between 113 and 128MB is
    again unusable. The next zone starts at 128MB, and so on.

    Signed-off-by: Aravind Ramesh
    Signed-off-by: Damien Le Moal
    Signed-off-by: Niklas Cassel
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Aravind Ramesh
     

09 Sep, 2020

1 commit

  • Commit da52f8ade40b ("f2fs: get the right gc victim section when section
    has several segments") added code to count blocks of each section using
    variables with type 'unsigned short', which has 2 bytes size in many
    systems. However, the counts can be larger than the 2 bytes range and
    type conversion results in wrong values. Especially when the f2fs
    sections have blocks as many as USHRT_MAX + 1, the count is handled as 0.
    This triggers eternal loop in init_dirty_segmap() at mount system call.
    Fix this by changing the type of the variables to block_t.

    Fixes: da52f8ade40b ("f2fs: get the right gc victim section when section has several segments")
    Signed-off-by: Shin'ichiro Kawasaki
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Shin'ichiro Kawasaki
     

04 Aug, 2020

1 commit


08 Jul, 2020

6 commits

  • Added a new gc_urgent mode, GC_URGENT_LOW, in which mode
    F2FS will lower the bar of checking idle in order to
    process outstanding discard commands and GC a little bit
    aggressively.

    Signed-off-by: Daeho Jeong
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Daeho Jeong
     
  • to two independent functions:
    - f2fs_allocate_new_segment() for specified type segment allocation
    - f2fs_allocate_new_segments() for all data type segments allocation

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • When f2fs_ioc_gc_range performs multiple segments gc ops, the return
    value of f2fs_ioc_gc_range is determined by the last segment gc ops.
    If its ops failed, the f2fs_ioc_gc_range will be considered to be failed
    despite some of previous segments gc ops succeeded. Therefore, so we
    fix: Redefine the return value of getting victim ops and add exception
    handle for f2fs_gc. In particular, 1).if target has no valid block, it
    will go on. 2).if target sectoion has valid block(s), but it is current
    section, we will reminder the caller.

    Signed-off-by: Qilong Zhang
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Qilong Zhang
     
  • Use validation of @fio to inidcate whether caller want to serialize IOs
    in io.io_list or not, then @add_list will be redundant, remove it.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • - to avoid race between checkpoint and quota file writeback, it
    just needs to hold read lock of node_write in writeback path.
    - node_write lock has covered all LFS data write paths, it's not
    necessary, we only need to hold node_write lock at write path of
    quota file.

    This refactors commit ca7f76e68074 ("f2fs: fix wrong discard space").

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • to avoid polluting global symbol namespace.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

19 Jun, 2020

1 commit

  • Assume each section has 4 segment:
    .___________________________.
    |_Segment0_|_..._|_Segment3_|
    . .
    . .
    .__________.
    |_section0_|

    Segment 0~2 has 0 valid block, segment 3 has 512 valid blocks.
    It will fail if we want to gc section0 in this scenes,
    because all 4 segments in section0 is not dirty.
    So we should use dirty section bitmap instead of dirty segment bitmap
    to get right victim section.

    Signed-off-by: Jack Qiu
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jack Qiu
     

30 May, 2020

1 commit

  • Under heavy fsstress, we may triggle panic while issuing discard,
    because __check_sit_bitmap() detects that discard command may earse
    valid data blocks, the root cause is as below race stack described,
    since we removed lock when flushing quota data, quota data writeback
    may race with write_checkpoint(), so that it causes inconsistency in
    between cached discard entry and segment bitmap.

    - f2fs_write_checkpoint
    - block_operations
    - set_sbi_flag(sbi, SBI_QUOTA_SKIP_FLUSH)
    - f2fs_flush_sit_entries
    - add_discard_addrs
    - __set_bit_le(i, (void *)de->discard_map);
    - f2fs_write_data_pages
    - f2fs_write_single_data_page
    : inode is quota one, cp_rwsem won't be locked
    - f2fs_do_write_data_page
    - f2fs_allocate_data_block
    - f2fs_wait_discard_bio
    : discard entry has not been added yet.
    - update_sit_entry
    - f2fs_clear_prefree_segments
    - f2fs_issue_discard
    : add discard entry

    In order to fix this, this patch uses node_write to serialize
    f2fs_allocate_data_block and checkpoint.

    Fixes: 435cbab95e39 ("f2fs: fix quota_sync failure due to f2fs_lock_op")
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

29 May, 2020

1 commit


18 Apr, 2020

3 commits

  • When a discard_cmd needs to be split due to dpolicy->max_requests, then
    for the remaining length it will be either merged into another cmd or a
    new discard_cmd will be created. In this case, there is double
    accounting of dcc->undiscard_blks for the remaining len, due to which
    it shows incorrect value in stats.

    Signed-off-by: Sahitya Tummala
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Sahitya Tummala
     
  • In case a discard_cmd is split into several bios, the dc->error
    must not be overwritten once an error is reported by a bio. Also,
    move it under dc->lock.

    Signed-off-by: Sahitya Tummala
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Sahitya Tummala
     
  • F2FS already has a default timeout of 5 secs for discards that
    can be issued during umount, but it can take more than the 5 sec
    timeout if the underlying UFS device queue is already full and there
    are no more available free tags to be used. Fix this by submitting a
    small batch of discard requests so that it won't cause the device
    queue to be full at any time and thus doesn't incur its wait time
    in the umount context.

    Signed-off-by: Sahitya Tummala
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Sahitya Tummala
     

04 Apr, 2020

1 commit


31 Mar, 2020

1 commit

  • Data flush can generate heavy IO and cause long latency during
    flush, so it's not appropriate to trigger it in foreground
    operation.

    And also, we may face below potential deadlock during data flush:
    - f2fs_write_multi_pages
    - f2fs_write_raw_pages
    - f2fs_write_single_data_page
    - f2fs_balance_fs
    - f2fs_balance_fs_bg
    - f2fs_sync_dirty_inodes
    - filemap_fdatawrite -- stuck on flush same cluster

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

20 Mar, 2020

4 commits

  • In order to avoid polluting global slab cache namespace.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • As Geert Uytterhoeven reported:

    for parameter HZ/50 in congestion_wait(BLK_RW_ASYNC, HZ/50);

    On some platforms, HZ can be less than 50, then unexpected 0 timeout
    jiffies will be set in congestion_wait().

    This patch introduces a macro DEFAULT_IO_TIMEOUT to wrap a determinate
    value with msecs_to_jiffies(20) to instead HZ/50 to avoid such issue.

    Quoted from Geert Uytterhoeven:

    "A timeout of HZ means 1 second.
    HZ/50 means 20 ms, but has the risk of being zero, if HZ < 50.

    If you want to use a timeout of 20 ms, you best use msecs_to_jiffies(20),
    as that takes care of the special cases, and never returns 0."

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch removes F2FS_MOUNT_ADAPTIVE and F2FS_MOUNT_LFS mount options,
    and add F2FS_OPTION.fs_mode with below two status to indicate filesystem
    mode.

    enum {
    FS_MODE_ADAPTIVE, /* use both lfs/ssr allocation */
    FS_MODE_LFS, /* use lfs allocation only */
    };

    It can enhance code readability and fs mode's scalability.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Let's show mounted time.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

18 Jan, 2020

3 commits

  • Mutex lock won't serialize callers, in order to avoid starving of unlucky
    caller, let's use rwsem lock instead.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Setting 0x40 in /sys/fs/f2fs/dev/ipu_policy gives a way to turn off
    bio cache, which is useufl to check whether block layer using hardware
    encryption engine merges IOs correctly.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch tries to support compression in f2fs.

    - New term named cluster is defined as basic unit of compression, file can
    be divided into multiple clusters logically. One cluster includes 4 << n
    (n >= 0) logical pages, compression size is also cluster size, each of
    cluster can be compressed or not.

    - In cluster metadata layout, one special flag is used to indicate cluster
    is compressed one or normal one, for compressed cluster, following metadata
    maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
    data including compress header and compressed data.

    - In order to eliminate write amplification during overwrite, F2FS only
    support compression on write-once file, data can be compressed only when
    all logical blocks in file are valid and cluster compress ratio is lower
    than specified threshold.

    - To enable compression on regular inode, there are three ways:
    * chattr +c file
    * chattr +c dir; touch dir/file
    * mount w/ -o compress_extension=ext; touch file.ext

    Compress metadata layout:
    [Dnode Structure]
    +-----------------------------------------------+
    | cluster 1 | cluster 2 | ......... | cluster N |
    +-----------------------------------------------+
    . . . .
    . . . .
    . Compressed Cluster . . Normal Cluster .
    +----------+---------+---------+---------+ +---------+---------+---------+---------+
    |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
    +----------+---------+---------+---------+ +---------+---------+---------+---------+
    . .
    . .
    . .
    +-------------+-------------+----------+----------------------------+
    | data length | data chksum | reserved | compressed data |
    +-------------+-------------+----------+----------------------------+

    Changelog:

    20190326:
    - fix error handling of read_end_io().
    - remove unneeded comments in f2fs_encrypt_one_page().

    20190327:
    - fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
    - don't jump into loop directly to avoid uninitialized variables.
    - add TODO tag in error path of f2fs_write_cache_pages().

    20190328:
    - fix wrong merge condition in f2fs_read_multi_pages().
    - check compressed file in f2fs_post_read_required().

    20190401
    - allow overwrite on non-compressed cluster.
    - check cluster meta before writing compressed data.

    20190402
    - don't preallocate blocks for compressed file.

    - add lz4 compress algorithm
    - process multiple post read works in one workqueue
    Now f2fs supports processing post read work in multiple workqueue,
    it shows low performance due to schedule overhead of multiple
    workqueue executing orderly.

    20190921
    - compress: support buffered overwrite
    C: compress cluster flag
    V: valid block address
    N: NEW_ADDR

    One cluster contain 4 blocks

    before overwrite after overwrite

    - VVVV -> CVNN
    - CVNN -> VVVV

    - CVNN -> CVNN
    - CVNN -> CVVV

    - CVVV -> CVNN
    - CVVV -> CVVV

    20191029
    - add kconfig F2FS_FS_COMPRESSION to isolate compression related
    codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
    note that: will remove lzo backend if Jaegeuk agreed that too.
    - update codes according to Eric's comments.

    20191101
    - apply fixes from Jaegeuk

    20191113
    - apply fixes from Jaegeuk
    - split workqueue for fsverity

    20191216
    - apply fixes from Jaegeuk

    20200117
    - fix to avoid NULL pointer dereference

    [Jaegeuk Kim]
    - add tracepoint for f2fs_{,de}compress_pages()
    - fix many bugs and add some compression stats
    - fix overwrite/mmap bugs
    - address 32bit build error, reported by Geert.
    - bug fixes when handling errors and i_compressed_blocks

    Reported-by:
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

16 Jan, 2020

3 commits

  • Remove duplicate sbi->aw_cnt stats counter that tracks
    the number of atomic files currently opened (it also shows
    incorrect value sometimes). Use more relit lable sbi->atomic_files
    to show in the stats.

    Signed-off-by: Sahitya Tummala
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Sahitya Tummala
     
  • To catch f2fs bugs in write pointer handling code for zoned block
    devices, check write pointers of non-open zones that current segments do
    not point to. Do this check at mount time, after the fsync data recovery
    and current segments' write pointer consistency fix. Or when fsync data
    recovery is disabled by mount option, do the check when there is no fsync
    data.

    Check two items comparing write pointers with valid block maps in SIT.
    The first item is check for zones with no valid blocks. When there is no
    valid blocks in a zone, the write pointer should be at the start of the
    zone. If not, next write operation to the zone will cause unaligned write
    error. If write pointer is not at the zone start, reset the write pointer
    to place at the zone start.

    The second item is check between the write pointer position and the last
    valid block in the zone. It is unexpected that the last valid block
    position is beyond the write pointer. In such a case, report as a bug.
    Fix is not required for such zone, because the zone is not selected for
    next write operation until the zone get discarded.

    Signed-off-by: Shin'ichiro Kawasaki
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Shin'ichiro Kawasaki
     
  • On sudden f2fs shutdown, write pointers of zoned block devices can go
    further but f2fs meta data keeps current segments at positions before the
    write operations. After remounting the f2fs, this inconsistency causes
    write operations not at write pointers and "Unaligned write command"
    error is reported.

    To avoid the error, compare current segments with write pointers of open
    zones the current segments point to, during mount operation. If the write
    pointer position is not aligned with the current segment position, assign
    a new zone to the current segment. Also check the newly assigned zone has
    write pointer at zone start. If not, reset write pointer of the zone.

    Perform the consistency check during fsync recovery. Not to lose the
    fsync data, do the check after fsync data gets restored and before
    checkpoint commit which flushes data at current segment positions. Not to
    cause conflict with kworker's dirfy data/node flush, do the fix within
    SBI_POR_DOING protection.

    Signed-off-by: Shin'ichiro Kawasaki
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Shin'ichiro Kawasaki
     

01 Dec, 2019

1 commit

  • Pull f2fs updates from Jaegeuk Kim:
    "In this round, we've introduced fairly small number of patches as below.

    Enhancements:
    - improve the in-place-update IO flow
    - allocate segment to guarantee no GC for pinned files

    Bug fixes:
    - fix updatetime in lazytime mode
    - potential memory leak in f2fs_listxattr
    - record parent inode number in rename2 correctly
    - fix deadlock in f2fs_gc along with atomic writes
    - avoid needless data migration in GC"

    * tag 'f2fs-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
    f2fs: stop GC when the victim becomes fully valid
    f2fs: expose main_blkaddr in sysfs
    f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
    f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
    f2fs: show f2fs instance in printk_ratelimited
    f2fs: fix potential overflow
    f2fs: fix to update dir's i_pino during cross_rename
    f2fs: support aligned pinned file
    f2fs: avoid kernel panic on corruption test
    f2fs: fix wrong description in document
    f2fs: cache global IPU bio
    f2fs: fix to avoid memory leakage in f2fs_listxattr
    f2fs: check total_segments from devices in raw_super
    f2fs: update multi-dev metadata in resize_fs
    f2fs: mark recovery flag correctly in read_raw_super_block()
    f2fs: fix to update time in lazytime mode

    Linus Torvalds