28 Feb, 2013

3 commits

  • Convert to the much saner new idr interface. Both bsg and genhd
    protect idr w/ mutex making preloading unnecessary.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • idr allocation in blk_alloc_devt() wasn't synchronized against lookup
    and removal, and its limit check was off by one - 1 << MINORBITS is
    the number of minors allowed, not the maximum allowed minor.

    Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
    checking.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • While adding and removing a lot of disks disks and partitions this
    sometimes shows up:

    WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
    Hardware name:
    sysfs: cannot create duplicate filename '/dev/block/259:751'
    Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
    Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
    Call Trace:
    warn_slowpath_common+0x87/0xc0
    warn_slowpath_fmt+0x46/0x50
    sysfs_add_one+0xc9/0x130
    sysfs_do_create_link+0x12b/0x170
    sysfs_create_link+0x13/0x20
    device_add+0x317/0x650
    idr_get_new+0x13/0x50
    add_partition+0x21c/0x390
    rescan_partitions+0x32b/0x470
    sd_open+0x81/0x1f0 [sd_mod]
    __blkdev_get+0x1b6/0x3c0
    blkdev_get+0x10/0x20
    register_disk+0x155/0x170
    add_disk+0xa6/0x160
    sd_probe_async+0x13b/0x210 [sd_mod]
    add_wait_queue+0x46/0x60
    async_thread+0x102/0x250
    default_wake_function+0x0/0x20
    async_thread+0x0/0x250
    kthread+0x96/0xa0
    child_rip+0xa/0x20
    kthread+0x0/0xa0
    child_rip+0x0/0x20

    This most likely happens because dev_t is freed while the number is
    still used and idr_get_new() is not protected on every use. The fix
    adds a mutex where it wasn't before and moves the dev_t free function so
    it is called after device del.

    Signed-off-by: Tomas Henzl
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Henzl
     

24 Feb, 2013

1 commit

  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     

22 Feb, 2013

2 commits

  • This provides a band-aid to provide stable page writes on jbd without
    needing to backport the fixed locking and page writeback bit handling
    schemes of jbd2. The band-aid works by using bounce buffers to snapshot
    page contents instead of waiting.

    For those wondering about the ext3 bandage -- fixing the jbd locking
    (which was done as part of ext4dev years ago) is a lot of surgery, and
    setting PG_writeback on data pages when we actually hold the page lock
    dropped ext3 performance by nearly an order of magnitude. If we're
    going to migrate iscsi and raid to use stable page writes, the
    complaints about high latency will likely return. We might as well
    centralize their page snapshotting thing to one place.

    Signed-off-by: Darrick J. Wong
    Tested-by: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • This patchset ("stable page writes, part 2") makes some key
    modifications to the original 'stable page writes' patchset. First, it
    provides creators (devices and filesystems) of a backing_dev_info a flag
    that declares whether or not it is necessary to ensure that page
    contents cannot change during writeout. It is no longer assumed that
    this is true of all devices (which was never true anyway). Second, the
    flag is used to relaxed the wait_on_page_writeback calls so that wait
    only occurs if the device needs it. Third, it fixes up the remaining
    disk-backed filesystems to use this improved conditional-wait logic to
    provide stable page writes on those filesystems.

    It is hoped that (for people not using checksumming devices, anyway)
    this patchset will give back unnecessary performance decreases since the
    original stable page write patchset went into 3.0. Sorry about not
    fixing it sooner.

    Complaints were registered by several people about the long write
    latencies introduced by the original stable page write patchset.
    Generally speaking, the kernel ought to allocate as little extra memory
    as possible to facilitate writeout, but for people who simply cannot
    wait, a second page stability strategy is (re)introduced: snapshotting
    page contents. The waiting behavior is still the default strategy; to
    enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
    set. This flag is used to bandaid^Henable stable page writeback on
    ext3[1], and is not used anywhere else.

    Given that there are already a few storage devices and network FSes that
    have rolled their own page stability wait/page snapshot code, it would
    be nice to move towards consolidating all of these. It seems possible
    that iscsi and raid5 may wish to use the new stable page write support
    to enable zero-copy writeout.

    Thank you to Jan Kara for helping fix a couple more filesystems.

    Per Andrew Morton's request, here are the result of using dbench to measure
    latencies on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, for ext2 the maximum write latency decreases from ~60ms
    on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
    increase, though I suspect that being able to dirty pages faster gives
    the flusher more work to do.

    On ext4, the average write latency decreases as well as all the maximum
    latencies:

    3.8.0-rc3:
    WriteX 85624 0.152 33.078
    ReadX 272090 0.010 61.210
    Flush 12129 36.219 168.260

    Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms

    3.8.0-rc3 + patches:
    WriteX 86082 0.141 30.928
    ReadX 273358 0.010 36.124
    Flush 12214 34.800 165.689

    Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms

    XFS seems to exhibit similar latency improvements as ext2:

    3.8.0-rc3:
    WriteX 125739 0.028 104.343
    ReadX 399070 0.005 4.115
    Flush 17851 25.004 131.390

    Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms

    3.8.0-rc3 + patches:
    WriteX 123529 0.028 6.299
    ReadX 392434 0.005 4.287
    Flush 17549 25.120 188.687

    Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms

    ...and btrfs, just to round things out, also shows some latency
    decreases:

    3.8.0-rc3:
    WriteX 67122 0.083 82.355
    ReadX 212719 0.005 2.828
    Flush 9547 47.561 147.418

    Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms

    3.8.0-rc3 + patches:
    WriteX 64898 0.101 71.631
    ReadX 206673 0.005 7.123
    Flush 9190 47.963 219.034

    Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own wait code, or they don't block at all. The blocking
    behavior is back to what it was before 3.0 if you don't have a disk
    requiring stable page writes.

    This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
    xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
    results as -rc3.

    [1] The alternative fixes to ext3 include fixing the locking order and
    page bit handling like we did for ext4 (but then why not just use
    ext4?), or setting PG_writeback so early that ext3 becomes extremely
    slow. I tried that, but the number of write()s I could initiate dropped
    by nearly an order of magnitude. That was a bit much even for the
    author of the stable page series! :)

    This patch:

    Creates a per-backing-device flag that tracks whether or not pages must
    be held immutable during writeout. Eventually it will be used to waive
    wait_for_page_writeback() if nothing requires stable pages.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

20 Feb, 2013

2 commits

  • Pull async changes from Tejun Heo:
    "These are followups for the earlier deadlock issue involving async
    ending up waiting for itself through block requesting module[1]. The
    following changes are made by these commits.

    - Instead of requesting default elevator on each request_queue init,
    block now requests it once early during boot.

    - Kmod triggers warning if invoked from an async worker.

    - Async synchronization implementation has been reimplemented. It's
    a lot simpler now."

    * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    async: initialise list heads to fix crash
    async: replace list of active domains with global list of pending items
    async: keep pending tasks on async_domain and remove async_pending
    async: use ULLONG_MAX for infinity cookie value
    async: bring sanity to the use of words domain and running
    async, kmod: warn on synchronous request_module() from async workers
    block: don't request module during elevator init
    init, block: try to load default elevator module early during boot

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

23 Jan, 2013

1 commit

  • Block layer allows selecting an elevator which is built as a module to
    be selected as system default via kernel param "elevator=". This is
    achieved by automatically invoking request_module() whenever a new
    block device is initialized and the elevator is not available.

    This led to an interesting deadlock problem involving async and module
    init. Block device probing running off an async job invokes
    request_module(). While the module is being loaded, it performs
    async_synchronize_full() which ends up waiting for the async job which
    is already waiting for request_module() to finish, leading to
    deadlock.

    Invoking request_module() from deep in block device init path is
    already nasty in itself. It seems best to avoid these situations from
    the beginning by moving on-demand module loading out of block init
    path.

    The previous patch made sure that the default elevator module is
    loaded early during boot if available. This patch removes on-demand
    loading of the default elevator from elevator init path. As the
    module would have been loaded during boot, userland-visible behavior
    difference should be minimal.

    For more details, please refer to the following thread.

    http://thread.gmane.org/gmane.linux.kernel/1420814

    v2: The bool parameter was named @request_module which conflicted with
    request_module(). This built okay w/ CONFIG_MODULES because
    request_module() was defined as a macro. W/o CONFIG_MODULES, it
    causes build breakage. Rename the parameter to @try_loading.
    Reported by Fengguang.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang Wu

    Tejun Heo
     

19 Jan, 2013

1 commit

  • This patch adds default module loading and uses it to load the default
    block elevator. During boot, it's called right after initramfs or
    initrd is made available and right before control is passed to
    userland. This ensures that as long as the modules are available in
    the usual places in initramfs, initrd or the root filesystem, the
    default modules are loaded as soon as possible.

    This will replace the on-demand elevator module loading from elevator
    init path.

    v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
    robot.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang We

    Tejun Heo
     

20 Dec, 2012

2 commits

  • Remove a race condition which causes a warning in disk_clear_events. This
    is a race between disk_clear_events() and disk_flush_events().
    ev->clearing will be altered by disk_flush_events() even though we are
    blocking event checking through disk_flush_events(). If this happens
    after ev->clearing was cleared for disk_clear_events(), this can cause the
    WARN_ON_ONCE() in that function to be triggered.

    This change also has disk_clear_events() not go through a workqueue.
    Since we have to wait for the work to complete, we should just call the
    function directly. Also, since this work cannot be put on a freezable
    workqueue, it will have to contend with increased demand, so calling the
    function directly avoids this.

    [akpm@linux-foundation.org: fix spello in comment]
    Signed-off-by: Derek Basehore
    Cc: Mandeep Singh Baines
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Derek Basehore
     
  • In disk_clear_events, do not put work on system_nrt_freezable_wq.
    Instead, put it on system_nrt_wq.

    There is a race between probing a usb and suspending the device. Since
    probing a usb calls disk_clear_events, which puts work on a frozen
    workqueue, probing cannot finish after the workqueue is frozen. However,
    suspending cannot finish until the usb probe is finished, so we get a
    deadlock, causing the system to reboot.

    The way to reproduce this bug is to wake up from suspend with a usb
    storage device plugged in, or plugging in a usb storage device right
    before suspend. The window of time is on the order of time it takes to
    probe the usb device. As long as the workqueues are frozen before the
    call to add_disk within sd_probe_async finishes, there will be a deadlock
    (which calls blkdev_get, sd_open, check_disk_change, then
    disk_clear_events). This is not difficult to reproduce after figuring out
    the timings.

    [akpm@linux-foundation.org: fix up comment]
    Signed-off-by: Derek Basehore
    Reviewed-by: Mandeep Singh Baines
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Derek Basehore
     

18 Dec, 2012

4 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • Currently only block_dev and uprobes use percpu_rw_semaphore,
    add the config option selected by BLOCK || UPROBES.

    Signed-off-by: Oleg Nesterov
    Cc: Anton Arapov
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Mikulas Patocka
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Pull block driver update from Jens Axboe:
    "Now that the core bits are in, here are the driver bits for 3.8. The
    branch contains:

    - A huge pile of drbd bits that were dumped from the 3.7 merge
    window. Following that, it was both made perfectly clear that
    there is going to be no more over-the-wall pulls and how the
    situation on individual pulls can be improved.

    - A few cleanups from Akinobu Mita for drbd and cciss.

    - Queue improvement for loop from Lukas. This grew into adding a
    generic interface for waiting/checking an even with a specific
    lock, allowing this to be pulled out of md and now loop and drbd is
    also using it.

    - A few fixes for xen back/front block driver from Roger Pau Monne.

    - Partition improvements from Stephen Warren, allowing partiion UUID
    to be used as an identifier."

    * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
    drbd: update Kconfig to match current dependencies
    drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
    drbd: close race between drbd_set_role and drbd_connect
    drbd: respect no-md-barriers setting also when changed online via disk-options
    drbd: Remove obsolete check
    drbd: fixup after wait_even_lock_irq() addition to generic code
    loop: Limit the number of requests in the bio list
    wait: add wait_event_lock_irq() interface
    xen-blkfront: free allocated page
    xen-blkback: move free persistent grants code
    block: partition: msdos: provide UUIDs for partitions
    init: reduce PARTUUID min length to 1 from 36
    block: store partition_meta_info.uuid as a string
    cciss: use check_signature()
    cciss: cleanup bitops usage
    drbd: use copy_highpage
    drbd: if the replication link breaks during handshake, keep retrying
    drbd: check return of kmalloc in receive_uuids
    drbd: Broadcast sync progress no more often than once per second
    drbd: don't try to clear bits once the disk has failed
    ...

    Linus Torvalds
     
  • Pull block layer core updates from Jens Axboe:
    "Here are the core block IO bits for 3.8. The branch contains:

    - The final version of the surprise device removal fixups from Bart.

    - Don't hide EFI partitions under advanced partition types. It's
    fairly wide spread these days. This is especially dangerous for
    systems that have both msdos and efi partition tables, where you
    want to keep them in sync.

    - Cleanup of using -1 instead of the proper NUMA_NO_NODE

    - Export control of bdi flusher thread CPU mask and default to using
    the home node (if known) from Jeff.

    - Export unplug tracepoint for MD.

    - Core improvements from Shaohua. Reinstate the recursive merge, as
    the original bug has been fixed. Add plugging for discard and also
    fix a problem handling non pow-of-2 discard limits.

    There's a trivial merge in block/blk-exec.c due to a fix that went
    into 3.7-rc at a later point than -rc4 where this is based."

    * 'for-3.8/core' of git://git.kernel.dk/linux-block:
    block: export block_unplug tracepoint
    block: add plug for blkdev_issue_discard
    block: discard granularity might not be power of 2
    deadline: Allow 0ms deadline latency, increase the read speed
    partitions: enable EFI/GPT support by default
    bsg: Remove unused function bsg_goose_queue()
    block: Make blk_cleanup_queue() wait until request_fn finished
    block: Avoid scheduling delayed work on a dead queue
    block: Avoid that request_fn is invoked on a dead queue
    block: Let blk_drain_queue() caller obtain the queue lock
    block: Rename queue dead flag
    bdi: add a user-tunable cpu_list for the bdi flusher threads
    block: use NUMA_NO_NODE instead of -1
    block: recursive merge requests
    block CFQ: avoid moving request to different queue

    Linus Torvalds
     

15 Dec, 2012

3 commits

  • This allows stacked devices (like md/raid5) to provide blktrace
    tracing, including unplug events.

    Reported-by: Fengguang Wu
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • Last post of this patch appears lost, so I resend this.

    Now discard merge works, add plug for blkdev_issue_discard. This will help
    discard request merge especially for raid0 case. In raid0, a big discard
    request is split to small requests, and if correct plug is added, such small
    requests can be merged in low layer.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • In MD raid case, discard granularity might not be power of 2, for example, a
    4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
    such cases.

    Reported-by: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

13 Dec, 2012

1 commit

  • Pull cgroup changes from Tejun Heo:
    "A lot of activities on cgroup side. The big changes are focused on
    making cgroup hierarchy handling saner.

    - cgroup_rmdir() had peculiar semantics - it allowed cgroup
    destruction to be vetoed by individual controllers and tried to
    drain refcnt synchronously. The vetoing never worked properly and
    caused good deal of contortions in cgroup. memcg was the last
    reamining user. Michal Hocko removed the usage and cgroup_rmdir()
    path has been simplified significantly. This was done in a
    separate branch so that the memcg people can base further memcg
    changes on top.

    - The above allowed cleaning up cgroup lifecycle management and
    implementation of generic cgroup iterators which are used to
    improve hierarchy support.

    - cgroup_freezer updated to allow migration in and out of a frozen
    cgroup and handle hierarchy. If a cgroup is frozen, all descendant
    cgroups are frozen.

    - netcls_cgroup and netprio_cgroup updated to handle hierarchy
    properly.

    - Various fixes and cleanups.

    - Two merge commits. One to pull in memcg and rmdir cleanups (needed
    to build iterators). The other pulled in cgroup/for-3.7-fixes for
    device_cgroup fixes so that further device_cgroup patches can be
    stacked on top."

    Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
    commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
    touching code close to commit 2ef37d3fe4 ("memcg: Simplify
    mem_cgroup_force_empty_list error handling") in for-3.8)

    * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
    cgroup: update Documentation/cgroups/00-INDEX
    cgroup_rm_file: don't delete the uncreated files
    cgroup: remove subsystem files when remounting cgroup
    cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
    cgroup: warn about broken hierarchies only after css_online
    cgroup: list_del_init() on removed events
    cgroup: fix lockdep warning for event_control
    cgroup: move list add after list head initilization
    netprio_cgroup: allow nesting and inherit config on cgroup creation
    netprio_cgroup: implement netprio[_set]_prio() helpers
    netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
    netprio_cgroup: reimplement priomap expansion
    netprio_cgroup: shorten variable names in extend_netdev_table()
    netprio_cgroup: simplify write_priomap()
    netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
    cgroup: remove obsolete guarantee from cgroup_task_migrate.
    cgroup: add cgroup->id
    cgroup, cpuset: remove cgroup_subsys->post_clone()
    cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
    cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
    ...

    Linus Torvalds
     

10 Dec, 2012

1 commit


06 Dec, 2012

7 commits

  • The Kconfig currently enables MSDOS partitions by default because they
    are assumed to be essential, but it's necessary to enable "advanced
    partition selection" in order to get GPT support. IMO GPT partitions
    are becoming common enought to deserve the same treatment MSDOS
    partitions get.

    (Side note: I got bit by a disk that had MSDOS and GPT partition
    tables, but for some reason the MSDOS table was different from the
    GPT one. I was stupid enought to disable "advanced partition
    selection" in my .config, which disabled GPT partitioning and made
    my btrfs pool unbootable because it couldn't find the partitions)

    Signed-off-by: Diego Calleja
    Signed-off-by: Jens Axboe

    Diego Calleja
     
  • The function bsg_goose_queue() does not have any in-tree callers,
    so let's remove it.

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Some request_fn implementations, e.g. scsi_request_fn(), unlock
    the queue lock internally. This may result in multiple threads
    executing request_fn for the same queue simultaneously. Keep
    track of the number of active request_fn calls and make sure that
    blk_cleanup_queue() waits until all active request_fn invocations
    have finished. A block driver may start cleaning up resources
    needed by its request_fn as soon as blk_cleanup_queue() finished,
    so blk_cleanup_queue() must wait for all outstanding request_fn
    invocations to finish.

    Signed-off-by: Bart Van Assche
    Reported-by: Chanho Min
    Cc: James Bottomley
    Cc: Mike Christie
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Running a queue must continue after it has been marked dying until
    it has been marked dead. So the function blk_run_queue_async() must
    not schedule delayed work after blk_cleanup_queue() has marked a queue
    dead. Hence add a test for that queue state in blk_run_queue_async()
    and make sure that queue_unplugged() invokes that function with the
    queue lock held. This avoids that the queue state can change after
    it has been tested and before mod_delayed_work() is invoked. Drop
    the queue dying test in queue_unplugged() since it is now
    superfluous: __blk_run_queue() already tests whether or not the
    queue is dead.

    Signed-off-by: Bart Van Assche
    Cc: Mike Christie
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A block driver may start cleaning up resources needed by its
    request_fn as soon as blk_cleanup_queue() finished, so request_fn
    must not be invoked after draining finished. This is important
    when blk_run_queue() is invoked without any requests in progress.
    As an example, if blk_drain_queue() and scsi_run_queue() run in
    parallel, blk_drain_queue() may have finished all requests after
    scsi_run_queue() has taken a SCSI device off the starved list but
    before that last function has had a chance to run the queue.

    Signed-off-by: Bart Van Assche
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Chanho Min
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Let the caller of blk_drain_queue() obtain the queue lock to improve
    readability of the patch called "Avoid that request_fn is invoked on
    a dead queue".

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
    stop. After this flag has been set queue draining starts. However,
    during the queue draining phase it is still safe to invoke the
    queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
    flag.

    This patch has been generated by running the following command
    over the kernel source tree:

    git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
    -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
    sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h; \
    sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

23 Nov, 2012

3 commits

  • After we've done __elv_add_request() and __blk_run_queue() in
    blk_execute_rq_nowait(), the request might finish and be freed
    immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME
    isn't safe afterwards, because if it isn't, rq might be gone.
    Instead, check beforehand and stash the result in a temporary.

    This fixes crashes in blk_execute_rq_nowait() I get occasionally when
    running with lots of memory debugging options enabled -- I think this
    race is usually harmless because the window for rq to be reallocated
    is so small.

    Signed-off-by: Roland Dreier
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Roland Dreier
     
  • The MSDOS/MBR partition table includes a 32-bit unique ID, often referred
    to as the NT disk signature. When combined with a partition number within
    the table, this can form a unique ID similar in concept to EFI/GPT's
    partition UUID. Constructing and recording this value in struct
    partition_meta_info allows MSDOS partitions to be referred to on the
    kernel command-line using the following syntax:

    root=PARTUUID=0002dd75-01

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     
  • This will allow other types of UUID to be stored here, aside from true
    UUIDs. This also simplifies code that uses this field, since it's usually
    constructed from a, used as a, or compared to other, strings.

    Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
    /PARTNROFF option was found to be present. However, this modifies the
    input string, and causes subsequent calls to devt_from_partuuid() not to
    see the /PARTNROFF option, which causes different results. In order to
    avoid misleading future maintainers, this parameter is marked const.

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     

20 Nov, 2012

1 commit


10 Nov, 2012

1 commit


09 Nov, 2012

1 commit

  • In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
    When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
    and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.

    If we do recursive merge for such interleave access, some workloads throughput
    get improvement. A recent worload I'm checking on is swap, below change
    boostes the throughput around 5% ~ 10%.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Nov, 2012

1 commit


06 Nov, 2012

3 commits

  • request is queued in cfqq->fifo list. Looks it's possible we are moving a
    request from one cfqq to another in request merge case. In such case, adjusting
    the fifo list order doesn't make sense and is impossible if we don't iterate
    the whole fifo list.

    My test does hit one case the two cfqq are different, but didn't cause kernel
    crash, maybe it's because fifo list isn't used frequently. Anyway, from the
    code logic, this is buggy.

    I thought we can re-enable the recusive merge logic after this is fixed.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Pull rmdir updates into for-3.8 so that further callback updates can
    be put on top. This pull created a trivial conflict between the
    following two commits.

    8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
    ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")

    The former added a field to cgroup_subsys and the latter removed one
    from it. They happen to be colocated causing the conflict. Keeping
    what's added and removing what's removed resolves the conflict.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • All ->pre_destory() implementations return 0 now, which is the only
    allowed return value. Make it return void.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Vivek Goyal

    Tejun Heo
     

26 Oct, 2012

1 commit

  • My workload is a raid5 which had 16 disks. And used our filesystem to
    write using direct-io mode.

    I used the blktrace to find those message:
    8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
    8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
    8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
    8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
    8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
    8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
    8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
    8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
    8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
    8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
    8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
    8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
    8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
    8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
    8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
    8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
    8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
    8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
    8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
    8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
    8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
    8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
    8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
    8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
    8,16 0 0 2.453853661 0 m N cfq2579 insert_request
    8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453854439 0 m N cfq2579 insert_request
    8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
    8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
    8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
    8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
    8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
    8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
    8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
    8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
    8,16 0 0 2.454795160 0 m N cfq schedule dispatch

    From above messages,we can find rq[W 7493144 + 104] and rq[W
    7493120 + 24] do not merge.
    Because the bio order is:
    8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
    8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
    8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
    8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
    The bio(7493144) first and bio(7493120) later.So the subsequent
    bios will be divided into two parts.
    When flushing plug-list,because elv_attempt_insert_merge only support
    backmerge,not supporting frontmerge.
    So rq[7493120 + 24] can't merge with rq[7493144 + 104].

    From my test,i found those situation can count 25% in our system.
    Using this patch, there is no this situation.

    Signed-off-by: Jianpeng Ma
    CC:Shaohua Li
    Signed-off-by: Jens Axboe

    Jianpeng Ma