15 Jan, 2012

2 commits

  • Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
    will pass the command to the underlying block device. This is
    well-known, but it is also a large security problem when (via Unix
    permissions, ACLs, SELinux or a combination thereof) a program or user
    needs to be granted access only to part of the disk.

    This patch lets partitions forward a small set of harmless ioctls;
    others are logged with printk so that we can see which ioctls are
    actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred.
    Of course it was being sent to a (partition on a) hard disk, so it would
    have failed with ENOTTY and the patch isn't changing anything in
    practice. Still, I'm treating it specially to avoid spamming the logs.

    In principle, this restriction should include programs running with
    CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and
    /dev/sdb, it still should not be able to read/write outside the
    boundaries of /dev/sda2 independent of the capabilities. However, for
    now programs with CAP_SYS_RAWIO will still be allowed to send the
    ioctls. Their actions will still be logged.

    This patch does not affect the non-libata IDE driver. That driver
    however already tests for bd != bd->bd_contains before issuing some
    ioctl; it could be restricted further to forbid these ioctls even for
    programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.

    Cc: linux-scsi@vger.kernel.org
    Cc: Jens Axboe
    Cc: James Bottomley
    Signed-off-by: Paolo Bonzini
    [ Make it also print the command name when warning - Linus ]
    Signed-off-by: Linus Torvalds

    Paolo Bonzini
     
  • Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

    The function will then be enhanced to detect partition block devices
    and, in that case, subject the ioctls to whitelisting.

    Cc: linux-scsi@vger.kernel.org
    Cc: Jens Axboe
    Cc: James Bottomley
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Linus Torvalds

    Paolo Bonzini
     

23 Nov, 2011

1 commit

  • struct request_queue is allocated with __GFP_ZERO so its "node" field is
    zero before initialization. This causes an oops if node 0 is offline in
    the page allocator because its zonelists are not initialized. From Dave
    Young's dmesg:

    SRAT: Node 1 PXM 2 0-d0000000
    SRAT: Node 1 PXM 2 100000000-330000000
    SRAT: Node 0 PXM 1 330000000-630000000
    Initmem setup node 1 0000000000000000-000000000affb000
    ...
    Built 1 zonelists in Node order, mobility grouping on.
    ...
    BUG: unable to handle kernel paging request at 0000000000001c08
    IP: [] __alloc_pages_nodemask+0xb5/0x870

    and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
    zonelist->_zonerefs.

    The fix is to initialize q->node at the time of allocation so the correct
    node is passed to the slab allocator later.

    Since blk_init_allocated_queue_node() is no longer needed, merge it with
    blk_init_allocated_queue().

    [rientjes@google.com: changelog, initializing q->node]
    Cc: stable@vger.kernel.org [2.6.37+]
    Reported-by: Dave Young
    Signed-off-by: Mike Snitzer
    Signed-off-by: David Rientjes
    Tested-by: Dave Young
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • The pretty much brings in the kitchen sink along
    with it, so it should be avoided wherever reasonably possible in
    terms of being included from other commonly used
    files, as it results in a measureable increase on compile times.

    The worst culprit was probably device.h since it is used everywhere.
    This file also had an implicit dependency/usage of mutex.h which was
    masked by module.h, and is also fixed here at the same time.

    There are over a dozen other headers that simply declare the
    struct instead of pulling in the whole file, so follow their lead
    and simply make it a few more.

    Most of the implicit dependencies on module.h being present by
    these headers pulling it in have been now weeded out, so we can
    finally make this change with hopefully minimal breakage.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

19 Oct, 2011

2 commits


21 Sep, 2011

1 commit

  • Thus spake Andrew Morton:

    "And I have the usual maintainability whine. If someone comes up to
    vmscan.c and sees it calling blk_start_plug(), how are they supposed to
    work out why that call is there? They go look at the blk_start_plug()
    definition and it is undocumented. I think we can do better than this?"

    Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
    Axboe and from an earlier attempt by Shaohua Li to document blk-plug.

    [akpm@linux-foundation.org: grammatical and spelling tweaks]
    Signed-off-by: Suresh Jayaraman
    Cc: Shaohua Li
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Suresh Jayaraman
     

12 Sep, 2011

3 commits


24 Aug, 2011

1 commit

  • Cleaning up the code a little bit. attempt_plug_merge() traverses the plug
    list anyway, we can do the request counting there, so stack size is reduced
    a little bit.
    The motivation here is I suspect if we should count the requests for each
    queue (task could handle multiple disks in the meantime), but my test doesn't
    show it's worthy doing. If somebody proves we should do it, below change
    will make that more easier.

    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

16 Aug, 2011

1 commit

  • Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
    FLUSH/FUA to support merge, introduced a performance regression when
    running any sort of fsyncing workload using dm-multipath and certain
    storage (in our case, an HP EVA). The test I ran was fs_mark, and it
    dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
    that dm-multipath always advertised flush+fua support, and passed
    commands on down the stack, where those flags used to get stripped off.
    The above commit changed that behavior:

    static inline struct request *__elv_next_request(struct request_queue *q)
    {
    struct request *rq;

    while (1) {
    - while (!list_empty(&q->queue_head)) {
    + if (!list_empty(&q->queue_head)) {
    rq = list_entry_rq(q->queue_head.next);
    - if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
    - (rq->cmd_flags & REQ_FLUSH_SEQ))
    - return rq;
    - rq = blk_do_flush(q, rq);
    - if (rq)
    - return rq;
    + return rq;
    }

    Note that previously, a command would come in here, have
    REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

    struct request *blk_do_flush(struct request_queue *q, struct request *rq)
    {
    unsigned int fflags = q->flush_flags; /* may change, cache it */
    bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
    bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
    bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
    REQ_FUA);
    unsigned skip = 0;
    ...
    if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
    rq->cmd_flags &= ~REQ_FLUSH;
    if (!has_fua)
    rq->cmd_flags &= ~REQ_FUA;
    return rq;
    }

    So, the flush machinery was bypassed in such cases (q->flush_flags == 0
    && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

    Now, however, we don't get into the flush machinery at all. Instead,
    __elv_next_request just hands a request with flush and fua bits set to
    the scsi_request_fn, even if the underlying request_queue does not
    support flush or fua.

    The agreed upon approach is to fix the flush machinery to allow
    stacking. While this isn't used in practice (since there is only one
    request-based dm target, and that target will now reflect the flush
    flags of the underlying device), it does future-proof the solution, and
    make it function as designed.

    In order to make this work, I had to add a field to the struct request,
    inside the flush structure (to store the original req->end_io). Shaohua
    had suggested overloading the union with rb_node and completion_data,
    but the completion data is used by device mapper and can also be used by
    other drivers. So, I didn't see a way around the additional field.

    I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
    the lost performance. Comments and other testers, as always, are
    appreciated.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

01 Aug, 2011

1 commit

  • This moves the FC classes bsg code to the block layer and
    makes it a lib so that other classes like iscsi and SAS can use it.

    It is helpful because working with the request queue, bios,
    creating scatterlists, etc are a pain that the LLD does not
    have to worry about with normal IOs and should not have to
    worry about for bsg requests.

    Signed-off-by: Mike Christie
    Signed-off-by: Jens Axboe

    Mike Christie
     

24 Jul, 2011

1 commit

  • Some systems benefit from completions always being steered to the strict
    requester cpu rather than the looser "per-socket" steering that
    blk_cpu_to_group() attempts by default. This is because the first
    CPU in the group mask ends up being completely overloaded with work,
    while the others (including the original submitter) has power left
    to spare.

    Allow the strict mode to be set by writing '2' to the sysfs control
    file. This is identical to the scheme used for the nomerges file,
    where '2' is a more aggressive setting than just being turned on.

    echo 2 > /sys/block//queue/rq_affinity

    Cc: Christoph Hellwig
    Cc: Roland Dreier
    Tested-by: Dave Jiang
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

14 Jul, 2011

1 commit

  • Reorder request_queue to remove 16 bytes of alignment padding in 64 bit
    builds.

    On my config this shrinks the size of this structure from 1608 to 1592
    bytes and therefore needs one fewer cachelines.

    Also trivially move the open bracket { to be on the same line as the
    structure name to make it easier to grep.

    Signed-off-by: Richard Kennedy
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Richard Kennedy
     

08 Jul, 2011

2 commits

  • I'm often confused why not disable preempt when changing blk_plug list. It
    would be better to add comments here in case others have the similar concerns.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • When I test fio script with big I/O depth, I found the total throughput drops
    compared to some relative small I/O depth. The reason is the thread accumulates
    big requests in its plug list and causes some delays (surely this depends
    on CPU speed).
    I thought we'd better have a threshold for requests. When a threshold reaches,
    this means there is no request merge and queue lock contention isn't severe
    when pushing per-task requests to queue, so the main advantages of blk plug
    don't exist. We can force a plug list flush in this case.
    With this, my test throughput actually increases and almost equals to small
    I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
    for big I/O depth.
    The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
    reduce lock contention to me. But I'm open here, 32 is ok in my test too.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

01 Jul, 2011

1 commit


13 Jun, 2011

1 commit


31 May, 2011

1 commit

  • Since those defined functions require additional semicolon
    from the caller, they could cause potential syntax errors
    when used in if-else statements.

    Signed-off-by: Namhyung Kim
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Namhyung Kim
     

21 May, 2011

1 commit

  • Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
    had churn in the core area that makes it difficult to handle
    patches for eg cfq or blk-throttle. Instead of requiring that they
    be based in older versions with bugs that have been fixed later
    in the rc cycle, merge in 2.6.39 final.

    Also fixes up conflicts in the below files.

    Conflicts:
    drivers/block/paride/pcd.c
    drivers/cdrom/viocd.c
    drivers/ide/ide-cd.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 May, 2011

1 commit

  • In some cases we would end up stacking discard_zeroes_data incorrectly.
    Fix this by enabling the feature by default for stacking drivers and
    clearing it for low-level drivers. Incorporating a device that does not
    support dzd will then cause the feature to be disabled in the stacking
    driver.

    Also ensure that the maximum discard value does not overflow when
    exported in sysfs and return 0 in the alignment and dzd fields for
    devices that don't support discard.

    Reported-by: Lukas Czerner
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

07 May, 2011

2 commits

  • In some drives, flush requests are non-queueable. When flush request is
    running, normal read/write requests can't run. If block layer dispatches
    such request, driver can't handle it and requeue it. Tejun suggested we
    can hold the queue when flush is running. This can avoid unnecessary
    requeue. Also this can improve performance. For example, we have
    request flush1, write1, flush 2. flush1 is dispatched, then queue is
    hold, write1 isn't inserted to queue. After flush1 is finished, flush2
    will be dispatched. Since disk cache is already clean, flush2 will be
    finished very soon, so looks like flush2 is folded to flush1.

    In my test, the queue holding completely solves a regression introduced by
    commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

    block: make the flush insertion use the tail of the dispatch list

    It's not a preempt type request, in fact we have to insert it
    behind requests that do specify INSERT_FRONT.

    which causes about 20% regression running a sysbench fileio
    workload.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com
     
  • flush request isn't queueable in some drives. Add a flag to let driver
    notify block layer about this. We can optimize flush performance with the
    knowledge.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com
     

19 Apr, 2011

1 commit

  • We are currently using this flag to check whether it's safe
    to call into ->request_fn(). If it is set, we punt to kblockd.
    But we get a lot of false positives and excessive punts to
    kblockd, which hurts performance.

    The only real abuser of this infrastructure is SCSI. So export
    the async queue run and convert SCSI over to use that. There's
    room for improvement in that SCSI need not always use the async
    call, but this fixes our performance issue and they can fix that
    up in due time.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Apr, 2011

3 commits


16 Apr, 2011

1 commit

  • Linus correctly observes that the most important dispatch cases
    are now done from kblockd, this isn't ideal for latency reasons.
    The original reason for switching dispatches out-of-line was to
    avoid too deep a stack, so by _only_ letting the "accidental"
    flush directly in schedule() be guarded by offload to kblockd,
    we should be able to get the best of both worlds.

    So add a blk_schedule_flush_plug() that offloads to kblockd,
    and only use that from the schedule() path.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Apr, 2011

2 commits

  • For the explicit unplugging, we'd prefer to kick things off
    immediately and not pay the penalty of the latency to switch
    to kblockd. So let blk_finish_plug() do the run inline, while
    the implicit-on-schedule-out unplug will punt to kblockd.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's a bit of a mess currently. task->plug is being cleared
    and reset in __blk_finish_plug(), and blk_finish_plug() is
    testing for a NULL plug which cannot happen even from schedule()
    anymore since it uses blk_needs_flush_plug() to determine
    whether to call into this function at all.

    So get rid of some of the cruft.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Apr, 2011

1 commit


06 Apr, 2011

1 commit

  • The current block integrity (DIF/DIX) support in DM is verifying that
    all devices' integrity profiles match during DM device resume (which
    is past the point of no return). To some degree that is unavoidable
    (stacked DM devices force this late checking). But for most DM
    devices (which aren't stacking on other DM devices) the ideal time to
    verify all integrity profiles match is during table load.

    Introduce the notion of an "initialized" integrity profile: a profile
    that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
    template. Add blk_integrity_is_initialized() to allow checking if a
    profile was initialized.

    Update DM integrity support to:
    - check all devices with _initialized_ integrity profiles match
    during table load; uninitialized profiles (e.g. for underlying DM
    device(s) of a stacked DM device) are ignored.
    - disallow a table load that would result in an integrity profile that
    conflicts with a DM device's existing (in-use) integrity profile
    - avoid clearing an existing integrity profile
    - validate all integrity profiles match during resume; but if they
    don't all we can do is report the mismatch (during resume we're past
    the point of no return)

    Signed-off-by: Mike Snitzer
    Cc: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

12 Mar, 2011

1 commit


10 Mar, 2011

4 commits

  • Conflicts:
    block/blk-core.c
    block/blk-flush.c
    drivers/md/raid1.c
    drivers/md/raid10.c
    drivers/md/raid5.c
    fs/nilfs2/btnode.c
    fs/nilfs2/mdt.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we use plugging for that, but as plugging is going away,
    we need an alternative mechanism.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo