07 Dec, 2015

2 commits

  • The following commit which went into mainline through networking tree

    3b13758f51de ("cgroups: Allow dynamically changing net_classid")

    conflicts in net/core/netclassid_cgroup.c with the following pending
    fix in cgroup/for-4.4-fixes.

    1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

    The former separates out update_classid() from cgrp_attach() and
    updates it to walk all fds of all tasks in the target css so that it
    can be used from both migration and config change paths. The latter
    drops @css from cgrp_attach().

    Resolve the conflict by making cgrp_attach() call update_classid()
    with the css from the first task. We can revive @tset walking in
    cgrp_attach() but given that net_cls is v1 only where there always is
    only one target css during migration, this is fine.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Cc: Nina Schiff

    Tejun Heo
     
  • Pull SCSI fixes from James Bottomley:
    "This is quite a bumper crop of fixes: three from Arnd correcting
    various build issues in some configurations, a lock recursion in
    qla2xxx. Two potentially exploitable issues in hpsa and mvsas, a
    potential null deref in st, a revert of a bdi registration fix that
    turned out to cause even more problems, a set of fixes to allow people
    who only defined MPT2SAS to still work after the mpt2/mpt3sas merger
    and a couple of fixes for issues turned up by the hyper-v storvsc
    driver"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility
    Revert "scsi: Fix a bdi reregistration race"
    mpt3sas: Add dummy Kconfig option for backwards compatibility
    Fix a memory leak in scsi_host_dev_release()
    block/sd: Fix device-imposed transfer length limits
    scsi_debug: fix prevent_allow+verify regressions
    MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.
    sd: Make discard granularity match logical block size when LBPRZ=1
    scsi: hpsa: select CONFIG_SCSI_SAS_ATTR
    scsi: advansys needs ISA dma api for ISA support
    scsi_sysfs: protect against double execution of __scsi_remove_device()
    st: fix potential null pointer dereference.
    scsi: report 'INQUIRY result too short' once per host
    advansys: fix big-endian builds
    qla2xxx: Fix rwlock recursion
    hpsa: logical vs bitwise AND typo
    mvsas: don't allow negative timeouts
    mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag

    Linus Torvalds
     

04 Dec, 2015

1 commit


03 Dec, 2015

1 commit

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

01 Dec, 2015

1 commit


30 Nov, 2015

1 commit

  • When a cloned request is retried on other queues it always needs
    to be checked against the queue limits of that queue.
    Otherwise the calculations for nr_phys_segments might be wrong,
    leading to a crash in scsi_init_sgtable().

    To clarify this the patch renames blk_rq_check_limits()
    to blk_cloned_rq_check_limits() and removes the symbol
    export, as the new function should only be used for
    cloned requests and never exported.

    Cc: Mike Snitzer
    Cc: Ewan Milne
    Cc: Jeff Moyer
    Signed-off-by: Hannes Reinecke
    Fixes: e2a60da74 ("block: Clean up special command handling logic")
    Cc: stable@vger.kernel.org # 3.7+
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

26 Nov, 2015

3 commits

  • Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
    partition of sda is mounted (and will fail with EINVAL if pointed
    at a partition). But it will pass if the entire block device is
    formatted with a filesystem and mounted. I don't think this makes
    sense; partitioning should surely not ever change out from under
    a mounted device.

    So check for bdev->bd_super, and fail that with -EBUSY as well.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Eric Sandeen
     
  • Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
    had the unfortunate side-effect of removing an implicit clamp to
    BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
    code. This caused problems for some SMR drives.

    Debugging this issue revealed a few problems with the existing
    infrastructure since the block layer didn't know how to deal with
    device-imposed limits, only limits set by the I/O controller.

    - Introduce a new queue limit, max_dev_sectors, which is used by the
    ULD to signal the maximum sectors for a REQ_TYPE_FS request.

    - Ensure that max_dev_sectors is correctly stacked and taken into
    account when overriding max_sectors through sysfs.

    - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
    values for later processing.

    - In sd_revalidate() set the queue's max_dev_sectors based on the
    MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
    is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
    field size.

    - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
    VPD--if reported and sane--to signal the preferred device transfer
    size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

    - blk_limits_max_hw_sectors() is no longer used and can be removed.

    Signed-off-by: Martin K. Petersen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
    Reviewed-by: Christoph Hellwig
    Tested-by: sweeneygj@gmx.com
    Tested-by: Arzeets
    Tested-by: David Eisner
    Tested-by: Mario Kicherer
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     
  • This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.

    Jan writes:

    --

    Thanks for report! After some investigation I found out we allocate
    elevator specific data in __get_request() only for non-flush requests. And
    this is actually required since the flush machinery uses the space in
    struct request for something else. Doh. So my patch is just wrong and not
    easy to fix since at the time __get_request() is called we are not sure
    whether the flush machinery will be used in the end. Jens, please revert
    1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!

    I'm somewhat surprised that you can reliably hit the race where flushing
    gets disabled for the device just while the request is in flight. But I
    guess during boot it makes some sense.

    --

    So let's just revert it, we can fix the queue run manually after the
    fact. This race is rare enough that it didn't trigger in testing, it
    requires the specific disable-while-in-flight scenario to trigger.

    Jens Axboe
     

25 Nov, 2015

2 commits

  • We only added the request to the request list for the !blk-mq case,
    so we should only delete it in that case as well.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pull block layer fixes from Jens Axboe:
    "A round of fixes/updates for the current series.

    This looks a little bigger than it is, but that's mainly because we
    pushed the lightnvm enabled null_blk change out of the merge window so
    it could be updated a bit. The rest of the volume is also mostly
    lightnvm. In particular:

    - Lightnvm. Various fixes, additions, updates from Matias and
    Javier, as well as from Wenwei Tao.

    - NVMe:
    - Fix for potential arithmetic overflow from Keith.
    - Also from Keith, ensure that we reap pending completions from
    a completion queue before deleting it. Fixes kernel crashes
    when resetting a device with IO pending.
    - Various little lightnvm related tweaks from Matias.

    - Fixup flushes to go through the IO scheduler, for the cases where a
    flush is not required. Fixes a case in CFQ where we would be
    idling and not see this request, hence not break the idling. From
    Jan Kara.

    - Use list_{first,prev,next} in elevator.c for cleaner code. From
    Gelian Tang.

    - Fix for a warning trigger on btrfs and raid on single queue blk-mq
    devices, where we would flush plug callbacks with preemption
    disabled. From me.

    - A mac partition validation fix from Kees Cook.

    - Two merge fixes from Ming, marked stable. A third part is adding a
    new warning so we'll notice this quicker in the future, if we screw
    up the accounting.

    - Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"

    * 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
    blk-merge: warn if figured out segment number is bigger than nr_phys_segments
    blk-merge: fix blk_bio_segment_split
    block: fix segment split
    blk-mq: fix calling unplug callbacks with preempt disabled
    mac: validate mac_partition is within sector
    mtip32xx: use formatting capability of kthread_create_on_node
    NVMe: reap completion entries when deleting queue
    lightnvm: add free and bad lun info to show luns
    lightnvm: keep track of block counts
    nvme: lightnvm: use admin queues for admin cmds
    lightnvm: missing free on init error
    lightnvm: wrong return value and redundant free
    null_blk: do not del gendisk with lightnvm
    null_blk: use device addressing mode
    null_blk: use ppa_cache pool
    NVMe: Fix possible arithmetic overflow for max segments
    blk-flush: Queue through IO scheduler when flush not required
    null_blk: register as a LightNVM device
    elevator: use list_{first,prev,next}_entry
    lightnvm: cleanup queue before target removal
    ...

    Linus Torvalds
     

24 Nov, 2015

3 commits

  • We had seen lots of reports of this kind issue, so add one
    warnning in blk-merge, then it can be triggered easily and
    avoid to depend on warning/bug from drivers.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit bdced438acd83a(block: setup bi_phys_segments after
    splitting) introduces function of computing bio->bi_phys_segments
    during bio splitting.

    Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
    arn't computed, so too many physical segments may be obtained
    for one request since both the two are used to check if one segment
    across two bios can be possible.

    This patch fixes the issue by computing the two variables in
    blk_bio_segment_split().

    Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
    Reported-by: Michael Ellerman
    Reported-by: Mark Salter
    Tested-by: Laurent Dufour
    Tested-by: Mark Salter
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
    always points to the iterator local variable, which is obviously
    wrong, so fix it by pointing to the local variable of 'bvprv'.

    Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
    Cc: stable@kernel.org #4.3
    Reported-by: Michael Ellerman
    Reported-by: Mark Salter
    Tested-by: Laurent Dufour
    Tested-by: Mark Salter
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

21 Nov, 2015

1 commit

  • Liu reported that running certain parts of xfstests threw the
    following error:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
    in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
    3 locks held by kworker/u16:0/6:
    #0: ("writeback"){++++.+}, at: [] process_one_work+0x173/0x730
    #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x173/0x730
    #2: (&type->s_umount_key#44){+++++.}, at: [] trylock_super+0x25/0x60
    CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
    Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
    Workqueue: writeback wb_workfn (flush-btrfs-108)
    ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
    0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
    ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
    Call Trace:
    [] dump_stack+0x4f/0x74
    [] ___might_sleep+0x185/0x240
    [] __might_sleep+0x52/0x90
    [] __alloc_pages_nodemask+0x268/0x410
    [] ? sched_clock_local+0x1c/0x90
    [] ? local_clock+0x21/0x40
    [] ? __lock_release+0x420/0x510
    [] ? __lock_acquired+0x16c/0x3c0
    [] alloc_pages_current+0xc5/0x210
    [] ? rbio_is_full+0x55/0x70 [btrfs]
    [] ? mark_held_locks+0x78/0xa0
    [] ? _raw_spin_unlock_irqrestore+0x40/0x60
    [] full_stripe_write+0x5a/0xc0 [btrfs]
    [] __raid56_parity_write+0x39/0x60 [btrfs]
    [] run_plug+0x11b/0x140 [btrfs]
    [] btrfs_raid_unplug+0x23/0x70 [btrfs]
    [] blk_flush_plug_list+0x82/0x1f0
    [] blk_sq_make_request+0x1f9/0x740
    [] ? generic_make_request_checks+0x222/0x7c0
    [] ? blk_queue_enter+0x124/0x310
    [] ? blk_queue_enter+0x92/0x310
    [] generic_make_request+0x172/0x2c0
    [] ? generic_make_request+0x164/0x2c0
    [] submit_bio+0x70/0x140
    [] ? rbio_add_io_page+0x99/0x150 [btrfs]
    [] finish_rmw+0x4d9/0x600 [btrfs]
    [] full_stripe_write+0x9c/0xc0 [btrfs]
    [] raid56_parity_write+0xef/0x160 [btrfs]
    [] btrfs_map_bio+0xe3/0x2d0 [btrfs]
    [] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
    [] submit_one_bio+0x74/0xb0 [btrfs]
    [] submit_extent_page+0xe5/0x1c0 [btrfs]
    [] __extent_writepage_io+0x408/0x4c0 [btrfs]
    [] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
    [] __extent_writepage+0x218/0x3a0 [btrfs]
    [] ? mark_held_locks+0x78/0xa0
    [] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
    [] extent_writepages+0x52/0x70 [btrfs]
    [] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
    [] btrfs_writepages+0x27/0x30 [btrfs]
    [] do_writepages+0x23/0x40
    [] __writeback_single_inode+0x89/0x4d0
    [] ? writeback_sb_inodes+0x260/0x480
    [] ? writeback_sb_inodes+0x260/0x480
    [] ? writeback_sb_inodes+0x15f/0x480
    [] writeback_sb_inodes+0x2d2/0x480
    [] ? down_read_trylock+0x57/0x60
    [] ? trylock_super+0x25/0x60
    [] ? rcu_read_lock_sched_held+0x4f/0x90
    [] __writeback_inodes_wb+0x8c/0xc0
    [] wb_writeback+0x2b5/0x500
    [] ? mark_held_locks+0x78/0xa0
    [] ? __local_bh_enable_ip+0x68/0xc0
    [] ? wb_do_writeback+0x62/0x310
    [] wb_do_writeback+0xc1/0x310
    [] ? set_worker_desc+0x79/0x90
    [] wb_workfn+0x92/0x330
    [] process_one_work+0x223/0x730
    [] ? process_one_work+0x173/0x730
    [] ? worker_thread+0x18f/0x430
    [] worker_thread+0x11d/0x430
    [] ? maybe_create_worker+0xf0/0xf0
    [] ? maybe_create_worker+0xf0/0xf0
    [] kthread+0xef/0x110
    [] ? schedule_tail+0x1e/0xd0
    [] ? __init_kthread_worker+0x70/0x70
    [] ret_from_fork+0x3f/0x70
    [] ? __init_kthread_worker+0x70/0x70

    The issue is that we've got the software context pinned while
    calling blk_flush_plug_list(), which flushes callbacks that
    are allowed to sleep. btrfs and raid has such callbacks.

    Flip the checks around a bit, so we can enable preempt a bit
    earlier and flush plugs without having preempt disabled.

    This only affects blk-mq driven devices, and only those that
    register a single queue.

    Reported-by: Liu Bo
    Tested-by: Liu Bo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Nov, 2015

2 commits

  • If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
    512 byte sector would be read (secsize / 512). However the partition
    structure would be located past the end of the buffer (secsize % 512).

    Signed-off-by: Kees Cook
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Kees Cook
     
  • Fix use after free crashes like the following:

    general protection fault: 0000 [#1] SMP
    Call Trace:
    [] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
    [] pmem_rw_page+0x42/0x80 [nd_pmem]
    [] bdev_read_page+0x50/0x60
    [] do_mpage_readpage+0x510/0x770
    [] ? I_BDEV+0x20/0x20
    [] ? lru_cache_add+0x1c/0x50
    [] mpage_readpages+0x107/0x170
    [] ? I_BDEV+0x20/0x20
    [] ? I_BDEV+0x20/0x20
    [] blkdev_readpages+0x1d/0x20
    [] __do_page_cache_readahead+0x28f/0x310
    [] ? __do_page_cache_readahead+0x169/0x310
    [] ? pagecache_get_page+0x2d/0x1d0
    [] filemap_fault+0x396/0x530
    [] __do_fault+0x4e/0xf0
    [] handle_mm_fault+0x11bd/0x1b50

    Cc:
    Cc: Jens Axboe
    Cc: Alexander Viro
    Reported-by: kbuild test robot
    Acked-by: Matthew Wilcox
    [willy: symmetry fixups]
    Signed-off-by: Dan Williams

    Dan Williams
     

17 Nov, 2015

2 commits

  • Currently blk_insert_flush() just adds flush request to q->queue_head
    when flush is not required. That completely bypasses IO scheduler so
    e.g. CFQ can be idling waiting for new request to arrive and will idle
    through the whole window unnecessarily. Luckily this only happens in
    rare cases as usually checks in generic_make_request_checks() clear
    FLUSH and FUA flags early if they are not needed.

    When no flushing is actually required, we can easily fix the problem by
    properly queueing the request through the IO scheduler. Ideally IO
    scheduler should be also made aware of requests queued via
    blk_flush_queue_rq(). However inserting flush request through IO
    scheduler can have unwanted side-effects since due to flush batching
    delaying the flush request in IO scheduler will delay all flush requests
    possibly coming from other processes. So we keep adding the request
    directly to q->queue_head.

    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • To make the intention clearer, use list_{first,prev,next}_entry
    instead of list_entry.

    Signed-off-by: Geliang Tang
    Signed-off-by: Jens Axboe

    Geliang Tang
     

12 Nov, 2015

2 commits


11 Nov, 2015

1 commit

  • Pull block IO poll support from Jens Axboe:
    "Various groups have been doing experimentation around IO polling for
    (really) fast devices. The code has been reviewed and has been
    sitting on the side for a few releases, but this is now good enough
    for coordinated benchmarking and further experimentation.

    Currently O_DIRECT sync read/write are supported. A framework is in
    the works that allows scalable stats tracking so we can auto-tune
    this. And we'll add libaio support as well soon. Fow now, it's an
    opt-in feature for test purposes"

    * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
    direct-io: be sure to assign dio->bio_bdev for both paths
    directio: add block polling support
    NVMe: add blk polling support
    block: add block polling support
    blk-mq: return tag/queue combo in the make_request_fn handlers
    block: change ->make_request_fn() and users to return a queue cookie

    Linus Torvalds
     

08 Nov, 2015

3 commits

  • Add basic support for polling for specific IO to complete. This uses
    the cookie that blk-mq passes back, which enables the block layer
    to pass this cookie to the driver to spin for a specific request.

    This will be combined with request latency tracking, so we can make
    qualified decisions about when to poll and when not to. For now, for
    benchmark purposes, we add a sysfs file that controls whether polling
    is enabled or not.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     
  • Return a cookie, blk_qc_t, from the blk-mq make request functions, that
    allows a later caller to uniquely identify a specific IO. The cookie
    doesn't mean anything to the caller, but the caller can use it to later
    pass back to the block layer. The block layer can then identify the
    hardware queue and request from that cookie.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     
  • No functional changes in this patch, but it prepares us for returning
    a more useful cookie related to the IO that was queued up.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     

07 Nov, 2015

3 commits

  • setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
    the current pid namespace. This is in contrast to both the other modes of
    setpriority and the example of kill(-1). Fix this. getpriority and
    ioprio have the same failure mode, fix them too.

    Eric said:

    : After some more thinking about it this patch sounds justifiable.
    :
    : My goal with namespaces is not to build perfect isolation mechanisms
    : as that can get into ill defined territory, but to build well defined
    : mechanisms. And to handle the corner cases so you can use only
    : a single namespace with well defined results.
    :
    : In this case you have found the two interfaces I am aware of that
    : identify processes by uid instead of by pid. Which quite frankly is
    : weird. Unfortunately the weird unexpected cases are hard to handle
    : in the usual way.
    :
    : I was hoping for a little more information. Changes like this one we
    : have to be careful of because someone might be depending on the current
    : behavior. I don't think they are and I do think this make sense as part
    : of the pid namespace.

    Signed-off-by: Ben Segall
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Ambrose Feinstein
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Segall
     
  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

05 Nov, 2015

3 commits

  • Pull block reservation support from Jens Axboe:
    "This adds support for persistent reservations, both at the core level,
    as well as for sd and NVMe"

    [ Background from the docs: "Persistent Reservations allow restricting
    access to block devices to specific initiators in a shared storage
    setup. All implementations are expected to ensure the reservations
    survive a power loss and cover all connections in a multi path
    environment" ]

    * 'for-4.4/reservations' of git://git.kernel.dk/linux-block:
    NVMe: Precedence error in nvme_pr_clear()
    nvme: add missing endianess annotations in nvme_pr_command
    NVMe: Add persistent reservation ops
    sd: implement the Persistent Reservation API
    block: add an API for Persistent Reservations
    block: cleanup blkdev_ioctl

    Linus Torvalds
     
  • Pull block integrity updates from Jens Axboe:
    ""This is the joint work of Dan and Martin, cleaning up and improving
    the support for block data integrity"

    * 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
    block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
    block: blk_flush_integrity() for bio-based drivers
    block: move blk_integrity to request_queue
    block: generic request_queue reference counting
    nvme: suspend i/o during runtime blk_integrity_unregister
    md: suspend i/o during runtime blk_integrity_unregister
    md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
    block: Inline blk_integrity in struct gendisk
    block: Export integrity data interval size in sysfs
    block: Reduce the size of struct blk_integrity
    block: Consolidate static integrity profile properties
    block: Move integrity kobject to struct gendisk

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "This is the core block pull request for 4.4. I've got a few more
    topic branches this time around, some of them will layer on top of the
    core+drivers changes and will come in a separate round. So not a huge
    chunk of changes in this round.

    This pull request contains:

    - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

    - Unused prototype removal in blk-mq from Christoph.

    - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
    xchg()'s, from Davidlohr.

    - A plug flush fix from Jeff.

    - Also from Jeff, a fix that means we don't have to update shared tag
    sets at init time unless we do a state change. This cuts down boot
    times on thousands of devices a lot with scsi/blk-mq.

    - blk-mq waitqueue barrier fix from Kosuke.

    - Various fixes from Ming:

    - Fixes for segment merging and splitting, and checks, for
    the old core and blk-mq.

    - Potential blk-mq speedup by marking ctx pending at the end
    of a plug insertion batch in blk-mq.

    - direct-io no page dirty on kernel direct reads.

    - A WRITE_SYNC fix for mpage from Roman"

    * 'for-4.4/core' of git://git.kernel.dk/linux-block:
    blk-mq: avoid excessive boot delays with large lun counts
    blktrace: re-write setting q->blk_trace
    blk-mq: mark ctx as pending at batch in flush plug path
    blk-mq: fix for trace_block_plug()
    block: check bio_mergeable() early before merging
    blk-mq: check bio_mergeable() early before merging
    block: avoid to merge splitted bio
    block: setup bi_phys_segments after splitting
    block: fix plug list flushing for nomerge queues
    blk-mq: remove unused blk_mq_clone_flush_request prototype
    blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
    fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
    fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
    block: kmemleak: Track the page allocations for struct request

    Linus Torvalds
     

03 Nov, 2015

1 commit

  • Hi,

    Zhangqing Luo reported long boot times on a system with thousands of
    LUNs when scsi-mq was enabled. He narrowed the problem down to
    blk_mq_add_queue_tag_set, where every queue is frozen in order to set
    the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
    added before it in sequence, which involves waiting for an RCU grace
    period for each one. We don't need to do this. After the second queue
    is added, only new queues need to be initialized with the shared tag.
    We can do that by percolating the flag up to the blk_mq_tag_set, and
    updating the newly added queue's hctxs if the flag is set.

    This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
    support for shared tags maps).

    Reported-and-tested-by: Jason Luo
    Reviewed-by: Ming Lei
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

28 Oct, 2015

1 commit

  • In commit b49a087("block: remove split code in
    blkdev_issue_{discard,write_same}"), discard_granularity and alignment
    checks were removed. Ideally, with bio late splitting, the upper layers
    shouldn't need to depend on device's limits.

    Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe
    device when mkfs.xfs. We have not found the root cause yet.

    This patch re-adds discard_granularity and alignment checks by reverting
    the related changes in commit b49a087. The good thing is now we can
    remove the 2G discard size cap and just use UINT_MAX to avoid bi_size
    overflow.

    Reviewed-by: Christoph Hellwig
    Tested-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lin
     

22 Oct, 2015

6 commits

  • The stat files on the root cgroup shows stats for the whole system and
    usually don't contain any information which isn't available through
    the usual system monitoring mechanisms. Some controllers skip
    collecting these duplicate stats to optimize cases where cgroup isn't
    used and later try to emulate the result on demand.

    This leads to complexities and subtle differences in the information
    shown through different channels. This is entirely unnecessary and
    cgroup v2 is dropping stat files which are duplicate from all
    controllers. This patch removes "io.stat" from the root hierarchy.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Vivek Goyal

    Tejun Heo
     
  • Most of times, flush plug should be the hottest I/O path,
    so mark ctx as pending after all requests in the list are
    inserted.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The trace point is for tracing plug event of each request
    queue instead of each task, so we should check the request
    count in the plug list from current queue instead of
    current task.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • After bio splitting is introduced, one bio can be splitted
    and it is marked as NOMERGE because it is too fat to be merged,
    so check bio_mergeable() earlier to avoid to try to merge it
    unnecessarily.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It isn't necessary to try to merge the bio which is marked
    as NOMERGE.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The splitted bio has been already too fat to merge, so mark it
    as NOMERGE.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei