30 Dec, 2019

1 commit

  • We ran into a problem with a mpt3sas based controller, where we would
    see random (and hard to reproduce) file corruption). The issue seemed
    specific to this controller, but wasn't specific to the file system.
    After a lot of debugging, we find out that it's caused by segments
    spanning a 4G memory boundary. This shouldn't happen, as the default
    setting for segment boundary masks is 4G.

    Turns out there are two issues in get_max_segment_size():

    1) The default segment boundary mask is bypassed

    2) The segment start address isn't taken into account when checking
    segment boundary limit

    Fix these two issues by removing the bypass of the segment boundary
    check even if the mask is set to the default value, and taking into
    account the actual start address of the request when checking if a
    segment needs splitting.

    Cc: stable@vger.kernel.org # v5.1+
    Reviewed-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Signed-off-by: Ming Lei

    Dropped const on the page pointer, ppc page_to_phys() doesn't mark the
    page as const...

    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Dec, 2019

1 commit

  • Some filesystem, such as vfat, may send bio which crosses device boundary,
    and the worse thing is that the IO request starting within device boundaries
    can contain more than one segment past EOD.

    Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    tries to fix this issue by returning -EIO for this situation. However,
    this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
    may hang for ever.

    Also the current truncating on last segment is dangerous by updating the
    last bvec, given bvec table becomes not immutable any more, and fs bio
    users may not retrieve the truncated pages via bio_for_each_segment_all() in
    its .end_io callback.

    Fixes this issue by supporting multi-segment truncating. And the
    approach is simpler:

    - just update bio size since block layer can make correct bvec with
    the updated bio size. Then bvec table becomes really immutable.

    - zero all truncated segments for read bio

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

21 Dec, 2019

7 commits

  • These were added to blkdev_ioctl() in linux-5.5 but not
    blkdev_compat_ioctl, so add them now.

    Cc: # v4.4+
    Fixes: bbd3e064362e ("block: add an API for Persistent Reservations")
    Signed-off-by: Arnd Bergmann

    Fold in followup patch from Arnd with missing pr.h header include.

    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • These were added to blkdev_ioctl() in linux-5.5 but not
    blkdev_compat_ioctl, so add them now.

    Fixes: e876df1fe0ad ("block: add zone open, close and finish ioctl support")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • These were added to blkdev_ioctl() in v4.20 but not blkdev_compat_ioctl,
    so add them now.

    Cc: # v4.20+
    Fixes: 72cd87576d1d ("block: Introduce BLKGETZONESZ ioctl")
    Fixes: 65e4e3eee83d ("block: Introduce BLKGETNRZONES ioctl")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • These were added to blkdev_ioctl() but not blkdev_compat_ioctl,
    so add them now.

    Cc: # v4.10+
    Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • When I doing fuzzy test, get the memleak report:

    BUG: memory leak
    unreferenced object 0xffff88837af80000 (size 4096):
    comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ...............
    backtrace:
    [] bio_alloc_bioset+0x393/0x590
    [] bio_copy_user_iov+0x300/0xcd0
    [] blk_rq_map_user_iov+0x2f1/0x5f0
    [] blk_rq_map_user+0xf2/0x160
    [] sg_common_write.isra.21+0x1094/0x1870
    [] sg_write.part.25+0x5d9/0x950
    [] sg_write+0x5f/0x8c
    [] __vfs_write+0x7c/0x100
    [] vfs_write+0x1c3/0x500
    [] ksys_write+0xf9/0x200
    [] do_syscall_64+0x9f/0x4f0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
    the bio(s) which is allocated before this failing will leak. The
    refcount of the bio(s) is init to 1 and increased to 2 by calling
    bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
    the bio cannot be freed. Fix it by calling blk_rq_unmap_user().

    Reviewed-by: Bob Liu
    Reported-by: Hulk Robot
    Signed-off-by: Yang Yingliang
    Signed-off-by: Jens Axboe

    Yang Yingliang
     
  • Avoid that running test nvme/012 from the blktests suite triggers the
    following false positive lockdep complaint:

    ============================================
    WARNING: possible recursive locking detected
    5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
    --------------------------------------------
    ksoftirqd/1/16 is trying to acquire lock:
    000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    but task is already holding lock:
    00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&fq->mq_flush_lock)->rlock);
    lock(&(&fq->mq_flush_lock)->rlock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by ksoftirqd/1/16:
    #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    stack backtrace:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    dump_stack+0x67/0x90
    __lock_acquire.cold.45+0x2b4/0x313
    lock_acquire+0x98/0x160
    _raw_spin_lock_irqsave+0x3b/0x80
    flush_end_io+0x4e/0x1d0
    blk_mq_complete_request+0x76/0x110
    nvmet_req_complete+0x15/0x110 [nvmet]
    nvmet_bio_done+0x27/0x50 [nvmet]
    blk_update_request+0xd7/0x2d0
    blk_mq_end_request+0x1a/0x100
    blk_flush_complete_seq+0xe5/0x350
    flush_end_io+0x12f/0x1d0
    blk_done_softirq+0x9f/0xd0
    __do_softirq+0xca/0x440
    run_ksoftirqd+0x24/0x50
    smpboot_thread_fn+0x113/0x1e0
    kthread+0x121/0x140
    ret_from_fork+0x3a/0x50

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch fixes the following sparse warnings:

    block/bsg-lib.c:269:19: warning: incorrect type in initializer (different base types)
    block/bsg-lib.c:269:19: expected int sts
    block/bsg-lib.c:269:19: got restricted blk_status_t [usertype]
    block/bsg-lib.c:286:16: warning: incorrect type in return expression (different base types)
    block/bsg-lib.c:286:16: expected restricted blk_status_t
    block/bsg-lib.c:286:16: got int [assigned] sts

    Cc: Martin Wilck
    Fixes: d46fe2cb2dce ("block: drop device references in bsg_queue_rq()")
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

18 Dec, 2019

1 commit

  • Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat
    request gracefully on -EAGAIN error.

    The problem is well reproduced using io_uring:

    mkfs.ext4 /dev/ram0
    mount /dev/ram0 /mnt

    # Preallocate a file
    dd if=/dev/zero of=/mnt/file bs=1M count=1

    # Start fio with io_uring and get -EIO
    fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/file

    Signed-off-by: Roman Penyaev
    Signed-off-by: Jens Axboe

    Roman Penyaev
     

17 Dec, 2019

1 commit

  • When over-budget IOs are force-issued through root cgroup,
    iocg_kick_delay() adjusts the async delay accordingly but doesn't
    actually schedule async throttle for the issuing task. This bug is
    pretty well masked because sooner or later the offending threads are
    gonna get directly throttled on regular IOs or have async delay
    scheduled by mem_cgroup_throttle_swaprate().

    However, it can affect control quality on filesystem metadata heavy
    operations. Let's fix it by invoking blkcg_schedule_throttle() when
    iocg_kick_delay() says async delay is needed.

    Signed-off-by: Tejun Heo
    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Cc: stable@vger.kernel.org
    Reported-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Dec, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - stable fix for the bi_size overflow. Not a corruption issue, but a
    case wher we could merge but disallowed (Andreas)

    - NVMe pull request via Keith, with various fixes.

    - MD pull request from Song.

    - Merge window regression fix for the rq passthrough stats (Logan)

    - Remove unused blkcg_drain_queue() function (Guoqing)

    * tag 'for-linus-20191212' of git://git.kernel.dk/linux-block:
    blk-cgroup: remove blkcg_drain_queue
    block: fix NULL pointer dereference in account statistics with IDE
    md: make sure desc_nr less than MD_SB_DISKS
    md: raid1: check rdev before reference in raid1_sync_request func
    raid5: need to set STRIPE_HANDLE for batch head
    block: fix "check bi_size overflow before merge"
    nvme/pci: Fix read queue count
    nvme/pci Limit write queue sizes to possible cpus
    nvme/pci: Fix write and poll queue types
    nvme/pci: Remove last_cq_head
    nvme: Namepace identification descriptor list is optional
    nvme-fc: fix double-free scenarios on hw queues
    nvme: else following return is not needed
    nvme: add error message on mismatching controller ids
    nvme_fc: add module to ops template to allow module references
    nvmet-loop: Avoid preallocating big SGL for data
    nvme-fc: Avoid preallocating big SGL for data
    nvme-rdma: Avoid preallocating big SGL for data

    Linus Torvalds
     

13 Dec, 2019

1 commit


12 Dec, 2019

1 commit

  • The IDE driver creates some passthru requests which never get
    submitted to the block layer in such a way that blk_account_io_start()
    gets called. However, the driver still calls __blk_mq_end_request() in
    ide_end_rq() which will call blk_account_io_completion() which tries
    to dereferences req->part which is never set. See ide_prep_sense() for
    an example of where these requests come from.

    To fix this, blk_account_io_completion() and blk_account_io_done()
    should do nothing if req->part is not set.

    The back trace of this bug is:

    BUG: kernel NULL pointer dereference, address: 000002ac
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    *pde = 00000000
    Oops: 0002 [#1]
    CPU: 0 PID: 237 Comm: kworker/0:1H Not tainted
    5.4.0-rc2-00011-g48d9b0d43105e #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
    04/01/2014
    Workqueue: kblockd drive_rq_insert_work
    EIP: blk_account_io_completion+0x7a/0xf0
    Code: 89 54 24 08 31 d2 89 4c 24 04 31 c9 c7 04 24 02 00 00 00 c1 ee
    09 e8 f5 21 a6 ff e8 70 5c a7 ff 8b 53 60 8d 04 bd 00 00 00 00 b4
    02 ac 02 00 00 8b 9a 88 02 00 00 85 db 74 11 85 d2 74 51 8b
    EAX: 00000000 EBX: f5b80000 ECX: 00000000 EDX: 00000000
    ESI: 00000000 EDI: 00000000 EBP: f3031e70 ESP: f3031e54
    DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010046
    CR0: 80050033 CR2: 000002ac CR3: 03c25000 CR4: 000406d0
    Call Trace:

    blk_update_request+0x85/0x420
    ide_end_rq+0x38/0xa0
    ide_complete_rq+0x3d/0x70
    cdrom_newpc_intr+0x258/0xba0
    ide_intr+0x135/0x250
    __handle_irq_event_percpu+0x3e/0x250
    handle_irq_event_percpu+0x1f/0x50
    handle_irq_event+0x32/0x60
    handle_level_irq+0x6c/0x110
    handle_irq+0x72/0xa0

    do_IRQ+0x45/0xad
    common_interrupt+0x115/0x11c

    Fixes: 48d9b0d43105 ("block: account statistics for passthrough requests")
    Reported-by: kernel test robot
    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Jens Axboe

    Logan Gunthorpe
     

10 Dec, 2019

2 commits

  • This partially reverts commit e3a5d8e386c3fb973fa75f2403622a8f3640ec06.

    Commit e3a5d8e386c3 ("check bi_size overflow before merge") adds a bio_full
    check to __bio_try_merge_page. This will cause __bio_try_merge_page to fail
    when the last bi_io_vec has been reached. Instead, what we want here is only
    the bi_size overflow check.

    Fixes: e3a5d8e386c3 ("block: check bi_size overflow before merge")
    Cc: stable@vger.kernel.org # v5.4+
    Reviewed-by: Ming Lei
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Jens Axboe

    Andreas Gruenbacher
     
  • Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
    at places where these are defined. Later patches will remove the unused
    definition of FIELD_SIZEOF().

    This patch is generated using following script:

    EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

    git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
    do

    if [[ "$file" =~ $EXCLUDE_FILES ]]; then
    continue
    fi
    sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
    done

    Signed-off-by: Pankaj Bharadiya
    Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook
    Acked-by: David Miller # for net

    Pankaj Bharadiya
     

06 Dec, 2019

1 commit

  • 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
    bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
    and bio_endio(). This way looks wrong because bio may be freed
    without calling bio_endio(), for example, blk_rq_unprep_clone() is
    called from dm_mq_queue_rq() when the underlying queue of dm-mpath
    is busy.

    So memory leak of bio integrity data is caused by commit 7c20f11680a4.

    Fixes this issue by re-adding bio_integrity_free() to bio_uninit().

    Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by Justin Tee

    Add commit log, and simplify/fix the original patch wroten by Justin.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Justin Tee
     

05 Dec, 2019

1 commit

  • bio->bi_blkg will be NULL when the issue of the request
    has bypassed the block layer as shown in the following oops:

    Internal error: Oops: 96000005 [#1] SMP
    CPU: 17 PID: 2996 Comm: scsi_id Not tainted 5.4.0 #4
    Call trace:
    percpu_counter_add_batch+0x38/0x4c8
    bfqg_stats_update_legacy_io+0x9c/0x280
    bfq_insert_requests+0xbac/0x2190
    blk_mq_sched_insert_request+0x288/0x670
    blk_execute_rq_nowait+0x140/0x178
    blk_execute_rq+0x8c/0x140
    sg_io+0x604/0x9c0
    scsi_cmd_ioctl+0xe38/0x10a8
    scsi_cmd_blk_ioctl+0xac/0xe8
    sd_ioctl+0xe4/0x238
    blkdev_ioctl+0x590/0x20e0
    block_ioctl+0x60/0x98
    do_vfs_ioctl+0xe0/0x1b58
    ksys_ioctl+0x80/0xd8
    __arm64_sys_ioctl+0x40/0x78
    el0_svc_handler+0xc4/0x270

    so ensure its validity before using it.

    Fixes: fd41e60331b1 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     

04 Dec, 2019

1 commit

  • The current zone revalidation code has a major problem in that it
    doesn't update the zone size and q->nr_zones atomically, leading
    to a short window where an out of bounds access to the zone arrays
    is possible.

    To fix this move the setting of the zone size into the crticial
    sections blk_revalidate_disk_zones so that it gets updated together
    with the zone bitmaps and q->nr_zones. This also slightly simplifies
    the caller as it deducts the zone size from the report_zones.

    This change also allows to check for a power of two zone size in generic
    code.

    Reported-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Dec, 2019

5 commits


02 Dec, 2019

1 commit

  • Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
    "As part of the cleanup of some remaining y2038 issues, I came to
    fs/compat_ioctl.c, which still has a couple of commands that need
    support for time64_t.

    In completely unrelated work, I spent time on cleaning up parts of
    this file in the past, moving things out into drivers instead.

    After Al Viro reviewed an earlier version of this series and did a lot
    more of that cleanup, I decided to try to completely eliminate the
    rest of it and move it all into drivers.

    This series incorporates some of Al's work and many patches of my own,
    but in the end stops short of actually removing the last part, which
    is the scsi ioctl handlers. I have patches for those as well, but they
    need more testing or possibly a rewrite"

    * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
    scsi: sd: enable compat ioctls for sed-opal
    pktcdvd: add compat_ioctl handler
    compat_ioctl: move SG_GET_REQUEST_TABLE handling
    compat_ioctl: ppp: move simple commands into ppp_generic.c
    compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
    compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
    compat_ioctl: unify copy-in of ppp filters
    tty: handle compat PPP ioctls
    compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
    compat_ioctl: handle SIOCOUTQNSD
    af_unix: add compat_ioctl support
    compat_ioctl: reimplement SG_IO handling
    compat_ioctl: move WDIOC handling into wdt drivers
    fs: compat_ioctl: move FITRIM emulation into file systems
    gfs2: add compat_ioctl support
    compat_ioctl: remove unused convert_in_user macro
    compat_ioctl: remove last RAID handling code
    compat_ioctl: remove /dev/raw ioctl translation
    compat_ioctl: remove PCI ioctl translation
    compat_ioctl: remove joystick ioctl translation
    ...

    Linus Torvalds
     

26 Nov, 2019

3 commits

  • Pull disk revalidation updates from Jens Axboe:
    "This continues the work that Jan Kara started to thoroughly cleanup
    and consolidate how we handle rescans and revalidations"

    * tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block:
    block: move clearing bd_invalidated into check_disk_size_change
    block: remove (__)blkdev_reread_part as an exported API
    block: fix bdev_disk_changed for non-partitioned devices
    block: move rescan_partitions to fs/block_dev.c
    block: merge invalidate_partitions into rescan_partitions
    block: refactor rescan_partitions

    Linus Torvalds
     
  • Pull zoned block device update from Jens Axboe:
    "Enhancements and improvements to the zoned device support"

    * tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block:
    scsi: sd_zbc: Remove set but not used variable 'buflen'
    block: rework zone reporting
    scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()
    null_blk: Add zone_nr_conv to features
    null_blk: clean up report zones
    null_blk: clean up the block device operations
    block: Remove partition support for zoned block devices
    block: Simplify report zones execution
    block: cleanup the !zoned case in blk_revalidate_disk_zones
    block: Enhance blk_revalidate_disk_zones()

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "Due to more granular branches, this one is small and will be followed
    with other core branches that add specific features. I meant to just
    have a core and drivers branch, but external dependencies we ended up
    adding a few more that are also core.

    The changes are:

    - Fixes and improvements for the zoned device support (Ajay, Damien)

    - sed-opal table writing and datastore UID (Revanth)

    - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun)

    - Improvements to the block stats tracking (Pavel)

    - Fix for overruning sysfs buffer for large number of CPUs (Ming)

    - Optimization for small IO (Ming, Christoph)

    - Fix typo in RWH lifetime hint (Eugene)

    - Dead code removal and documentation (Bart)

    - Reduction in memory usage for queue and tag set (Bart)

    - Kerneldoc header documentation (André)

    - Device/partition revalidation fixes (Jan)

    - Stats tracking for flush requests (Konstantin)

    - Various other little fixes here and there (et al)"

    * tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits)
    Revert "block: split bio if the only bvec's length is > SZ_4K"
    block: add iostat counters for flush requests
    block,bfq: Skip tracing hooks if possible
    block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64'
    blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1
    block: Don't disable interrupts in trigger_softirq()
    sbitmap: Delete sbitmap_any_bit_clear()
    blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()
    block: split bio if the only bvec's length is > SZ_4K
    block: still try to split bio if the bvec crosses pages
    blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT
    blk-cgroup: reimplement basic IO stats using cgroup rstat
    blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()
    blk-throtl: stop using blkg->stat_bytes and ->stat_ios
    bfq-iosched: stop using blkg->stat_bytes and ->stat_ios
    bfq-iosched: relocate bfqg_*rwstat*() helpers
    block: add zone open, close and finish ioctl support
    block: add zone open, close and finish operations
    block: Simplify REQ_OP_ZONE_RESET_ALL handling
    block: Remove REQ_OP_ZONE_RESET plugging
    ...

    Linus Torvalds
     

22 Nov, 2019

2 commits

  • We really don't need this, as the slow path will do the right thing
    anyway.

    This reverts commit 6952a7f8446ee85ea9d10ab87b64797a031eaae3.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Requests that triggers flushing volatile writeback cache to disk (barriers)
    have significant effect to overall performance.

    Block layer has sophisticated engine for combining several flush requests
    into one. But there is no statistics for actual flushes executed by disk.
    Requests which trigger flushes usually are barriers - zero-size writes.

    This patch adds two iostat counters into /sys/class/block/$dev/stat and
    /proc/diskstats - count of completed flush requests and their total time.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

21 Nov, 2019

1 commit

  • In most cases blk_tracing is not active, but bfq_log_bfqq macro
    generate pid_str unconditionally, which result in significant overhead.

    ## Test
    modprobe null_blk
    echo bfq > /sys/block/nullb0/queue/scheduler
    fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
    --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k

    # Results
    | | baseline | w/ patch | gain |
    | iops | 113.19K | 126.42K | +11% |

    Acked-by: Paolo Valente
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

19 Nov, 2019

1 commit

  • In function 'activate_lsp', rather than hard-coding the short atom
    header(0x83), we need to let the function 'add_short_atom_header' append
    the header based on the parameter being appended.

    The parameter has been defined in Section 3.1.2.1 of
    https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_Single_User_Mode_v1-00_r1-00-Final.pdf

    Reviewed-by: Jon Derrick
    Signed-off-by: Revanth Rajashekar
    Signed-off-by: Jens Axboe

    Revanth Rajashekar
     

18 Nov, 2019

1 commit


15 Nov, 2019

1 commit


14 Nov, 2019

6 commits

  • In general drivers should never mess with partition tables directly.
    Unfortunately s390 and loop do for somewhat historic reasons, but they
    can use bdev_disk_changed directly instead when we export it as they
    satisfy the sanity checks we have in __blkdev_reread_part.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Stefan Haberland [dasd]
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We still have to set the capacity to 0 if invalidating or call
    revalidate_disk if not even if the disk has no partitions. Fix
    that by merging rescan_partitions into bdev_disk_changed and just
    stubbing out blk_add_partitions and blk_drop_partitions for
    non-partitioned devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Large parts of rescan_partitions aren't about partitions, and
    moving it to block_dev.c will allow for some further cleanups by
    merging it into its only caller.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • A lot of the logic in invalidate_partitions and rescan_partitions is
    shared. Merge the two functions to simplify things. There is a small
    behavior change in that we now send the kevent change notice also if we
    were not invalidating but no partitions were found, which seems like
    the right thing to do.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Split out a helper that adds one single partition, and another one
    calling that dealing with the parsed_partitions state. This makes
    it much more obvious how we clean up all state and start again when
    using the rescan label.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
    they deserve I/O plugging"), to prevent the service guarantees of a
    bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
    scheduled for service, even if empty (see comments in
    __bfq_bfqq_expire() for details). But, if no process will send
    requests to the bfq_queue any longer, then there is no point in
    keeping the bfq_queue scheduled for service.

    In addition, keeping the bfq_queue scheduled for service, but with no
    process reference any longer, may cause the bfq_queue to be freed when
    descheduled from service. But this is assumed to never happen, and
    causes a UAF if it happens. This, in turn, caused crashes [1, 2].

    This commit fixes this issue by descheduling an empty bfq_queue when
    it remains with not process reference.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
    [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Reported-by: Chris Evich
    Reported-by: Patrick Dung
    Reported-by: Thorsten Schubert
    Tested-by: Thorsten Schubert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente