09 Oct, 2020

1 commit

  • KoWei reported crash during raid5 reshape:

    [ 1032.252932] Oops: 0002 [#1] SMP PTI
    [...]
    [ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10
    [...]
    [ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286
    [ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000
    [ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000
    [ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8
    [ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8
    [ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
    [ 1032.252958] FS: 0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000
    [ 1032.252959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0
    [ 1032.252962] Call Trace:
    [ 1032.252969] ? async_memcpy+0x179/0x1000 [async_memcpy]
    [ 1032.252977] ? raid5_release_stripe+0x8e/0x110 [raid456]
    [ 1032.252982] handle_stripe_expansion+0x15a/0x1f0 [raid456]
    [ 1032.252988] handle_stripe+0x592/0x1270 [raid456]
    [ 1032.252993] handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456]
    [ 1032.252999] raid5d+0x35c/0x550 [raid456]
    [ 1032.253002] ? schedule+0x42/0xb0
    [ 1032.253006] ? schedule_timeout+0x10e/0x160
    [ 1032.253011] md_thread+0x97/0x160
    [ 1032.253015] ? wait_woken+0x80/0x80
    [ 1032.253019] kthread+0x104/0x140
    [ 1032.253022] ? md_start_sync+0x60/0x60
    [ 1032.253024] ? kthread_park+0x90/0x90
    [ 1032.253027] ret_from_fork+0x35/0x40

    This is because cache_size_mutex was unlocked too early in resize_stripes,
    which races with grow_one_stripe() that grow_one_stripe() allocates a
    stripe with wrong pool_size.

    Fix this issue by unlocking cache_size_mutex after updating pool_size.

    Cc: # v4.4+
    Reported-by: KoWei Sung
    Signed-off-by: Song Liu

    Song Liu
     

25 Sep, 2020

11 commits

  • When try to resize stripe_size, we also need to free old
    shared page array and allocate new.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • When reshape array, we try to reuse shared pages of old stripe_head,
    and allocate more for the new one if needed.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • In current implementation, grow_buffers() uses alloc_page() to
    allocate the buffers for each stripe_head, i.e. allocate a page
    for each dev[i] in stripe_head.

    After setting stripe_size as a configurable value by writing
    sysfs entry, it means that we always allocate 64K buffers, but
    just use 4K of them when stripe_size is 4K in 64KB arm64.

    To avoid wasting memory, we try to let multiple sh->dev share
    one real page. That means, multiple sh->dev[i].page will point
    to the only page with different offset. Example of 64K PAGE_SIZE
    and 4K stripe_size as following:

    64K PAGE_SIZE
    +---+---+---+---+------------------------------+
    | | | | |
    | | | | |
    +-+-+-+-+-+-+-+-+------------------------------+
    ^ ^ ^ ^
    | | | +----------------------------+
    | | | |
    | | +-------------------+ |
    | | | |
    | +----------+ | |
    | | | |
    +-+ | | |
    | | | |
    +-----+-----+------+-----+------+-----+------+------+
    sh | offset(0) | offset(4K) | offset(8K) | offset(12K) |
    + +-----------+------------+------------+-------------+
    +----> dev[0].page dev[1].page dev[2].page dev[3].page

    A new 'pages' array will be added into stripe_head to record shared
    page used by this stripe_head. Allocate them when grow_buffers()
    and free them when shrink_buffers().

    After trying to share page, the users of sh->dev[i].page need to take
    care of the related page offset: page of issued bio and page passed
    to xor compution functions. But thanks for previous different page offset
    supported. Here, we just need to set correct dev[i].offset.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • For now, asynchronous raid6 recovery calculate functions are require
    common offset for pages. But, we expect them to support different page
    offset after introducing stripe shared page. Do that by simplily adding
    page offset where each page address are referred. Then, replace the
    old interface with the new ones in raid6 and raid6test.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • For now, syndrome compute functions require common offset in the pages
    array. However, we expect them to support different offset when try to
    use shared page in the following. Simplily covert them by adding page
    offset where each page address are referred.

    Since the only caller of async_gen_syndrome() and async_syndrome_val()
    are in raid6, we don't want to reserve the old interface but modify the
    interface directly. After that, replacing old interfaces with new ones
    for raid6 and raid6test.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • We try to replace async_xor() and async_xor_val() with the new
    introduced interface async_xor_offs() and async_xor_val_offs()
    for raid456.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • ops_run_biofill() and ops_run_biodrain() will call async_copy_data()
    to copy sh->dev[i].page from or to bio page. For now, it implies the
    offset of dev[i].page is 0. But we want to support different page offset
    in the following.

    Thus, pass page offset to these functions and replace 'page_offset'
    with 'page_offset + poff'.

    No functional change.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • Add a new member of offset into struct r5dev. It indicates the
    offset of related dev[i].page. For now, since each device have a
    privated page, the value is always 0. Thus, we set offset as 0
    when allcate page in grow_buffers() and resize_stripes().

    To support following different page offset, we try to use the page
    offset rather than '0' directly for async_memcpy() and ops_run_io().

    We try to support different page offset for xor compution functions
    in the following. To avoid repeatly allocate a new array each time,
    we add a memory region into scribble buffer to record offset.

    No functional change.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers shouldn't really mess with the readahead size, as that is a VM
    concept. Instead set it based on the optimal I/O size by lifting the
    algorithm from the md driver when registering the disk. Also set
    bdi->io_pages there as well by applying the same scheme based on
    max_sectors. To ensure the limits work well for stacking drivers a
    new helper is added to update the readahead limits from the block
    limits, which is also called from disk_stack_limits.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Jan Kara
    Reviewed-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Acked-by: Coly Li
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The raid5 and raid10 drivers currently update the read-ahead size,
    but not the optimal I/O size on reshape. To prepare for deriving the
    read-ahead size from the optimal I/O size make sure it is updated
    as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Acked-by: Song Liu
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

29 Aug, 2020

1 commit

  • Pull block fixes from Jens Axboe:

    - nbd timeout fix (Hou)

    - device size fix for loop LOOP_CONFIGURE (Martijn)

    - MD pull from Song with raid5 stripe size fix (Yufen)

    * tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
    md/raid5: make sure stripe_size as power of two
    loop: Set correct device size when using LOOP_CONFIGURE
    nbd: restore default timeout when setting it to zero

    Linus Torvalds
     

28 Aug, 2020

1 commit

  • Commit 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs
    entry") make stripe_size as a configurable value. It just requires
    stripe_size as multiple of 4KB.

    In fact, we should make sure stripe_size as power of two. Otherwise,
    stripe_shift which is the result of ilog2 can not represent the real
    stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
    get unexpected value.

    Fixes: 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry")
    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

11 Aug, 2020

1 commit

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     

06 Aug, 2020

1 commit

  • Pull block driver updates from Jens Axboe:

    - NVMe:
    - ZNS support (Aravind, Keith, Matias, Niklas)
    - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
    Dongli, Max, Sagi)

    - null_blk zone capacity support (Aravind)

    - MD:
    - raid5/6 fixes (ChangSyun)
    - Warning fixes (Damien)
    - raid5 stripe fixes (Guoqing, Song, Yufen)
    - sysfs deadlock fix (Junxiao)
    - raid10 deadlock fix (Vitaly)

    - struct_size conversions (Gustavo)

    - Set of bcache updates/fixes (Coly)

    * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
    md/raid5: Allow degraded raid6 to do rmw
    md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
    raid5: don't duplicate code for different paths in handle_stripe
    raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
    md: print errno in super_written
    md/raid5: remove the redundant setting of STRIPE_HANDLE
    md: register new md sysfs file 'uuid' read-only
    md: fix max sectors calculation for super 1.0
    nvme-loop: remove extra variable in create ctrl
    nvme-loop: set ctrl state connecting after init
    nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
    nvme-multipath: fix logic for non-optimized paths
    nvme-rdma: fix controller reset hang during traffic
    nvme-tcp: fix controller reset hang during traffic
    nvmet: introduce the passthru Kconfig option
    nvmet: introduce the passthru configfs interface
    nvmet: Add passthru enable/disable helpers
    nvmet: add passthru code to process commands
    nvme: export nvme_find_get_ns() and nvme_put_ns()
    nvme: introduce nvme_ctrl_get_by_path()
    ...

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

03 Aug, 2020

4 commits

  • Degraded raid6 always do reconstruct-write now. With raid6 xor supported,
    we can do rmw in degraded raid6. This patch can reduce many read IOs to
    improve performance.

    If the failed disk is P, Q or the disk we want to write to, we may need to
    do reconstruct-write in max degraded raid6. In this situation we can not
    read enough data from handle_stripe_dirtying() so we have to set force_rcw
    in handle_stripe_fill() to read all data.

    Reviewed-by: Alex Wu
    Reviewed-by: BingJing Chang
    Reviewed-by: Danny Shih
    Signed-off-by: ChangSyun Peng
    Signed-off-by: Song Liu

    ChangSyun Peng
     
  • In degraded raid5, we need to read parity to do reconstruct-write when
    data disks fail. However, we can not read parity from
    handle_stripe_dirtying() in force reconstruct-write mode.

    Reproducible Steps:

    1. Create degraded raid5
    mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
    2. Set rmw_level to 0
    echo 0 > /sys/block/md2/md/rmw_level
    3. IO to raid5

    Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
    the parity in this situation.

    Cc: # v4.4+
    Reviewed-by: Alex Wu
    Reviewed-by: BingJing Chang
    Reviewed-by: Danny Shih
    Signed-off-by: ChangSyun Peng
    Signed-off-by: Song Liu

    ChangSyun Peng
     
  • As we can see, R5_LOCKED is set and s.locked is increased whether
    R5_ReWrite is set or not, so move it to common path.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • The flag is already set before compare rcw with rmw, so it is
    not necessary to do it again.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     

29 Jul, 2020

1 commit

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_spinlock_t data type, which allows to associate a
    spinlock with the sequence counter. This enables lockdep to verify that
    the spinlock used for writer serialization is held when the write side
    critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Song Liu
    Link: https://lkml.kernel.org/r/20200720155530.1173732-20-a.darwish@linutronix.de

    Ahmed S. Darwish
     

23 Jul, 2020

1 commit


22 Jul, 2020

3 commits

  • Adding a new 'stripe_size' sysfs entry to set and show stripe_size.
    stripe_size should not be bigger than PAGE_SIZE, and it requires to
    be multiple of 4096. We can adjust stripe_size by writing value into
    sysfs entry, likely, set stripe_size as 16KB:

    echo 16384 > /sys/block/md1/md/stripe_size

    Show current stripe_size value:

    cat /sys/block/md1/md/stripe_size

    For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • In RAID5, if issued bio size is bigger than stripe_size, it will be
    split in the unit of stripe_size and process them one by one. Even
    for size less then stripe_size, RAID5 also request data from disk at
    least of stripe_size.

    Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem
    usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE
    as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data
    while RAID5 issue IO at least stripe_size (64KB) each time. That will
    waste resource of disk bandwidth and compute xor.

    To avoding the waste, we want to make stripe_size configurable. This
    patch just set default stripe_size as 4096. User can also set the value
    bigger than 4KB for some special requirements, such as we know the
    issued io size is more than 4KB.

    To evaluate the new feature, we create raid5 device '/dev/md5' with
    4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.

    1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
    configure on /mnt directory. Then, trying to test it by dbench with
    command: dbench -D /mnt -t 1000 10. Result show as:

    'stripe_size = 64KB'

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 9805011 0.021 64.728
    Close 7202525 0.001 0.120
    Rename 415213 0.051 44.681
    Unlink 1980066 0.079 93.147
    Deltree 240 1.793 6.516
    Mkdir 120 0.004 0.007
    Qpathinfo 8887512 0.007 37.114
    Qfileinfo 1557262 0.001 0.030
    Qfsinfo 1629582 0.012 0.152
    Sfileinfo 798756 0.040 57.641
    Find 3436004 0.019 57.782
    WriteX 4887239 0.021 57.638
    ReadX 15370483 0.005 37.818
    LockX 31934 0.003 0.022
    UnlockX 31933 0.001 0.021
    Flush 687205 13.302 530.088

    Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms
    -------------------------------------------------------

    'stripe_size = 4KB'

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 11999166 0.021 36.380
    Close 8814128 0.001 0.122
    Rename 508113 0.051 29.169
    Unlink 2423242 0.070 38.141
    Deltree 300 1.885 7.155
    Mkdir 150 0.004 0.006
    Qpathinfo 10875921 0.007 35.485
    Qfileinfo 1905837 0.001 0.032
    Qfsinfo 1994304 0.012 0.125
    Sfileinfo 977450 0.029 26.489
    Find 4204952 0.019 9.361
    WriteX 5981890 0.019 27.804
    ReadX 18809742 0.004 33.491
    LockX 39074 0.003 0.025
    UnlockX 39074 0.001 0.014
    Flush 841022 10.712 458.848

    Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms
    -------------------------------------------------------

    It show that setting stripe_size as 4KB has higher thoughput, i.e.
    (376.777 vs 307.799) and has smaller latency than that setting as 64KB.

    2) We try to evaluate IO throughput for /dev/md5 by fio with config:

    [4KB randwrite]
    direct=1
    numjob=2
    iodepth=64
    ioengine=libaio
    filename=/dev/md5
    bs=4KB
    rw=randwrite

    [64KB write]
    direct=1
    numjob=2
    iodepth=64
    ioengine=libaio
    filename=/dev/md5
    bs=1MB
    rw=write

    The result as follow:

    + +
    | stripe_size(64KB) | stripe_size(4KB)
    +----------------------------------------------------+
    4KB randwrite | 15MB/s | 100MB/s
    +----------------------------------------------------+
    1MB write | 1000MB/s | 700MB/s

    The result show that when size of io is bigger than 4KB (64KB),
    64KB stripe_size has much higher IOPS. But for 4KB randwrite, that
    means, size of io issued to device are smaller, 4KB stripe_size
    have better performance.

    Normally, default value (4096) can get relatively good performance.
    But if each issued io is bigger than 4096, setting value more than
    4096 may get better performance.

    Here, we just set default stripe_size as 4096, and we will try to
    support setting different stripe_size by sysfs interface in the
    following patch.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     
  • Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to
    RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT().

    This patch is prepare for the following adjustable stripe_size.
    It will not change any existing functionality.

    Signed-off-by: Yufen Yu
    Signed-off-by: Song Liu

    Yufen Yu
     

17 Jul, 2020

4 commits

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     
  • We can't guarntee the batched stripe to be set with STRIPE_HANDLE since
    there are lots of functions could set the flag, such as sync_request,
    ops_complete_* and end_{read,write}_request etc.

    Also clear_batch_ready called in handle_stripe ensures the batched list
    can't continue to be handled by handle_stripe.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • To make people understand the function well, let's put the comment to
    the right place.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • We tried to only put the head sh of batch list to handle_list, then the
    handle_stripe doesn't handle other members in the batch list. However,
    we still got the calltrace in break_stripe_batch_list.

    [593764.644269] stripe state: 2003
    kernel: [593764.644299] ------------[ cut here ]------------
    kernel: [593764.644308] WARNING: CPU: 12 PID: 856 at drivers/md/raid5.c:4625 break_stripe_batch_list+0x203/0x240 [raid456]
    [...]
    kernel: [593764.644363] Call Trace:
    kernel: [593764.644370] handle_stripe+0x907/0x20c0 [raid456]
    kernel: [593764.644376] ? __wake_up_common_lock+0x89/0xc0
    kernel: [593764.644379] handle_active_stripes.isra.57+0x35f/0x570 [raid456]
    kernel: [593764.644382] ? raid5_wakeup_stripe_thread+0x96/0x1f0 [raid456]
    kernel: [593764.644385] raid5d+0x480/0x6a0 [raid456]
    kernel: [593764.644390] ? md_thread+0x11f/0x160
    kernel: [593764.644392] md_thread+0x11f/0x160
    kernel: [593764.644394] ? wait_woken+0x80/0x80
    kernel: [593764.644396] kthread+0xfc/0x130
    kernel: [593764.644398] ? find_pers+0x70/0x70
    kernel: [593764.644399] ? kthread_create_on_node+0x70/0x70
    kernel: [593764.644401] ret_from_fork+0x1f/0x30

    As we can see, the stripe was set with STRIPE_ACTIVE and STRIPE_HANDLE,
    and only handle_stripe could set those flags then return. And since the
    stipe was already in the batch list, we need to return earlier before
    set the two flags.

    And after dig a little about git history especially commit 3664847d95e6
    ("md/raid5: fix a race condition in stripe batch"), it seems the batched
    stipe still could be handled by handle_stipe, then handle_stipe needs to
    return earlier if clear_batch_ready to return true.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     

16 Jul, 2020

1 commit

  • Remove the if statement around the calls to sysfs_link_rdev() to avoid
    the compilation warning "suggest braces around empty body in an ‘if’
    statement" when compiling with W=1.

    Also fix function description comments to avoid kdoc format warnings.

    Signed-off-by: Damien Le Moal
    Signed-off-by: Song Liu

    Damien Le Moal
     

15 Jul, 2020

1 commit

  • The following deadlock was captured. The first process is holding 'kernfs_mutex'
    and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
    this pending bio list would be flushed by second process 'md127_raid1', but
    it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
    sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
    path, removed all of them.

    PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file"
    #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
    #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
    #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
    #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
    #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
    #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
    #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
    #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
    #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
    #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
    #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
    #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
    #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
    #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
    #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
    #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
    #16 [ffffb87c4df37958] new_slab at ffffffff9a251430
    #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
    #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
    #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
    #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
    #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
    #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
    #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
    #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
    #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
    #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
    #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
    #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
    #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
    #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
    #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
    #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
    #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
    RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246
    RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905
    RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890
    RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c
    R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0
    R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff
    ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

    PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1"
    #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
    #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
    #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
    #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
    #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
    #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
    #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
    #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
    #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
    #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
    #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
    #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
    #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
    #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
    #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344

    Signed-off-by: Junxiao Bi
    Signed-off-by: Song Liu

    Junxiao Bi
     

09 Jul, 2020

1 commit

  • Except for pktdvd, the only places setting congested bits are file
    systems that allocate their own backing_dev_info structures. And
    pktdvd is a deprecated driver that isn't useful in stack setup
    either. So remove the dead congested_fn stacking infrastructure.

    Signed-off-by: Christoph Hellwig
    Acked-by: Song Liu
    Acked-by: David Sterba
    [axboe: fixup unused variables in bcache/request.c]
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jul, 2020

1 commit


14 May, 2020

2 commits

  • Code comments of scribble_alloc() is outdated for a while. This patch
    update the comments in function header for the new parameter list.

    Suggested-by: Song Liu
    Signed-off-by: Coly Li
    Signed-off-by: Song Liu

    Coly Li
     
  • Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
    not have the expected behavior. kvmalloc_array() inside scribble_alloc()
    which receives the GFP_NOIO flag will eventually call kmalloc_node() to
    allocate physically continuous pages.

    Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
    prevent memory reclaim I/Os during raid array suspend context, calling
    to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
    I/O as expected.

    This patch removes the useless gfp flags from parameters list of
    scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
    incorrect GFP_NOIO flag does not exist anymore.

    Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
    Suggested-by: Michal Hocko
    Signed-off-by: Coly Li
    Signed-off-by: Song Liu

    Coly Li
     

14 Jan, 2020

1 commit


12 Dec, 2019

1 commit

  • With commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ("raid5: don't set
    STRIPE_HANDLE to stripe which is in batch list"), we don't want to set
    STRIPE_HANDLE flag for sh which is already in batch list.

    However, the stripe which is the head of batch list should set this flag,
    otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head),
    it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved.

    Thanks for Xiao's effort to verify the change.

    Fixes: 6ce220dd2f8ea ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list")
    Reported-by: Xiao Ni
    Tested-by: Xiao Ni
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     

15 Nov, 2019

1 commit