18 May, 2011

1 commit

  • In some cases we would end up stacking discard_zeroes_data incorrectly.
    Fix this by enabling the feature by default for stacking drivers and
    clearing it for low-level drivers. Incorporating a device that does not
    support dzd will then cause the feature to be disabled in the stacking
    driver.

    Also ensure that the maximum discard value does not overflow when
    exported in sysfs and return 0 in the alignment and dzd fields for
    devices that don't support discard.

    Reported-by: Lukas Czerner
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

09 May, 2011

1 commit

  • Stephen reports:

    -----

    After merging the block tree, today's linux-next build (x86_64
    allmodconfig) produced this warning:

    fs/partitions/check.c: In function 'part_discard_alignment_show':
    fs/partitions/check.c:263: warning: format '%u' expects type 'unsigned int', but argument 3 has type 'long long unsigned int'

    Introduced by commit ("block: Remove extra discard_alignment from
    hd_struct")

    -----

    Fix it up by just removing the cast, we return an int already.

    Reported-by: Stephen Rothwell
    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 May, 2011

7 commits

  • Currently, hd_struct.discard_alignment is only used when we
    show /sys/block/sdx/sdx/discard_alignment. So remove it and
    calculate when it is asked to show.

    Signed-off-by: Tao Ma
    Signed-off-by: Jens Axboe

    Tao Ma
     
  • Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
    bio fails due to underlying device not supporting discard request.
    However, if the device is for example dm device composed of devices
    which some of them support discard and some of them does not, it is ok
    for some bios to fail with EOPNOTSUPP, but it does not mean that discard
    is not supported at all.

    This commit removes the check for bios failed with EOPNOTSUPP and change
    blkdev_issue_discard() to return operation not supported if and only if
    the device does not actually supports it, not just part of the device as
    some bios might indicate.

    This change also fixes problem with BLKDISCARD ioctl() which now works
    correctly on such dm devices.

    Signed-off-by: Lukas Czerner
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
    not need to check for -EOPNOTSUPP specifically in case of error. Also
    there is no need to have label submit: because there is no way to jump
    out from the while cycle without an error and we really want to exit,
    rather than try again. And also remove the check for (sz == 0) since at
    that point sz can never be zero.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • Currently we are waiting for every submitted REQ_DISCARD bio separately,
    but it can have unwanted consequences of repeatedly flushing the queue,
    so we rather submit bios in batches and wait for the entire batch, hence
    narrowing the window of other ios going in.

    Use bio_batch_end_io() and struct bio_batch for that purpose, the same
    is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
    always set !BIO_UPTODATE in the case of error and remove the check for
    bb, since we are the only user of this function and we always set this.

    Remove bio_get()/bio_put() from the blkdev_issue_discard() since
    bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
    not needed anymore.

    I have done simple dd testing with surprising results. The script I have
    used is:

    for i in $(seq 10); do
    echo $i
    dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
    sleep 5
    done
    /usr/bin/time -f %e ./blkdiscard /dev/sdc1

    Running time of BLKDISCARD on the whole device:
    with patch without patch
    0.95 15.58

    So we can see that in this artificial test the kernel with the patch
    applied is approx 16x faster in discarding the device.

    Signed-off-by: Lukas Czerner
    CC: Dmitry Monakhov
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • Enable non-queueable flush flag for SATA.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com
     
  • In some drives, flush requests are non-queueable. When flush request is
    running, normal read/write requests can't run. If block layer dispatches
    such request, driver can't handle it and requeue it. Tejun suggested we
    can hold the queue when flush is running. This can avoid unnecessary
    requeue. Also this can improve performance. For example, we have
    request flush1, write1, flush 2. flush1 is dispatched, then queue is
    hold, write1 isn't inserted to queue. After flush1 is finished, flush2
    will be dispatched. Since disk cache is already clean, flush2 will be
    finished very soon, so looks like flush2 is folded to flush1.

    In my test, the queue holding completely solves a regression introduced by
    commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

    block: make the flush insertion use the tail of the dispatch list

    It's not a preempt type request, in fact we have to insert it
    behind requests that do specify INSERT_FRONT.

    which causes about 20% regression running a sysbench fileio
    workload.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com
     
  • flush request isn't queueable in some drives. Add a flag to let driver
    notify block layer about this. We can optimize flush performance with the
    knowledge.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com
     

06 May, 2011

2 commits


22 Apr, 2011

3 commits

  • Disk event code automatically blocks events on excl write. This is
    primarily to avoid issuing polling commands while burning is in
    progress. This behavior doesn't fit other types of devices with
    removeable media where polling commands don't have adverse side
    effects and door locking usually doesn't exist.

    This patch introduces new genhd flag which controls the auto-blocking
    behavior and uses it to enable auto-blocking only on optical devices.

    Note for stable: 2.6.38 and later only

    Cc: stable@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • __blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
    which leads to ghost partition devices lingering after medimum removal
    is known to both the kernel and userland. The behavior also creates a
    subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
    there's no medium, clears the ghots partitions, which is exploited to
    work around the problem from userland.

    Fix it by updating __blkdev_get() to issue partition rescan after
    -ENOMEDIA too.

    This was reported in the following bz.

    https://bugzilla.kernel.org/show_bug.cgi?id=13029

    Note for stable: 2.6.38 and later only

    Cc: stable@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: David Zeuthen
    Reported-by: Martin Pitt
    Reported-by: Kay Sievers
    Tested-by: Kay Sievers
    Cc: Alan Cox
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cdrom_open() called check_disk_change() after the rest of open path
    succeeded which leads to the following bizarre behavior.

    * After media change, if the device opened without O_NONBLOCK,
    open_for_data() naturally fails with -ENOMEDIA and
    check_disk_change() is never called. The media is known to be gone
    and the open failure makes it obvious to the userland but device
    invalidation never happens.

    * But if the device is opened with O_NONBLOCK, all the checks are
    bypassed and cdrom_open() doesn't notice that the media is not there
    and check_disk_change() is called and invalidation happens.

    There's nothing to be gained by avoiding calling check_disk_change()
    on open failure. Common cases end up calling check_disk_change()
    anyway. All we get is inconsistent behavior.

    Fix it by moving check_disk_change() invocation to the top of
    cdrom_open() so that it always gets called regardless of how the rest
    of open proceeds.

    Note for stable: 2.6.38 and later only

    Cc: stable@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Amit Shah
    Tested-by: Amit Shah
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 Apr, 2011

10 commits

  • Linus Torvalds
     
  • * 'for-39-rc4' of git://codeaurora.org/quic/kernel/davidb/linux-msm:
    msm: timer: fix missing return value
    msm: Remove extraneous ffa device check

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: xen-kbdfront - fix mouse getting stuck after save/restore
    Input: estimate number of events per packet
    Input: evdev - indicate buffer overrun with SYN_DROPPED
    Input: document event types and codes and their intended use
    Input: add KEY_IMAGES specifically for AL Image Browser
    Input: twl4030_keypad - fix potential NULL dereference in twl4030_kp_probe()
    Input: h3600_ts - fix error handling at connect
    Input: twl4030_keypad - avoid potential NULL-pointer dereference

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: add blk_run_queue_async
    block: blk_delay_queue() should use kblockd workqueue
    md: fix up raid1/raid10 unplugging.
    md: incorporate new plugging into raid5.
    md: provide generic support for handling unplug callbacks.
    md - remove old plugging code.
    md/dm - remove remains of plug_fn callback.
    md: use new plugging interface for RAID IO.
    block: drop queue lock before calling __blk_run_queue() for kblockd punt
    Revert "block: add callback function for unplug notification"
    block: Enhance new plugging support to support general callbacks

    Linus Torvalds
     
  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/powermac: Build fix with SMP and CPU hotplug
    powerpc/perf_event: Skip updating kernel counters if register value shrinks
    powerpc: Don't write protect kernel text with CONFIG_DYNAMIC_FTRACE enabled
    powerpc: Fix oops if scan_dispatch_log is called too early
    powerpc/pseries: Use a kmem cache for DTL buffers
    powerpc/kexec: Fix regression causing compile failure on UP
    powerpc/85xx: disable Suspend support if SMP enabled
    powerpc/e500mc: Remove CPU_FTR_MAYBE_CAN_NAP/CPU_FTR_MAYBE_CAN_DOZE
    powerpc/book3e: Fix CPU feature handling on 64-bit e5500
    powerpc: Check device status before adding serial device
    powerpc/85xx: Don't add disabled PCIe devices

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
    Btrfs: fix free space cache leak
    Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
    Btrfs end_bio_extent_readpage should look for locked bits
    Btrfs: don't force chunk allocation in find_free_extent
    Btrfs: Check validity before setting an acl
    Btrfs: Fix incorrect inode nlink in btrfs_link()
    Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
    Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
    Btrfs: make uncache_state unconditional
    btrfs: using cached extent_state in set/unlock combinations
    Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
    Btrfs: fix subvolume mount by name problem when default mount subvolume is set
    fix user annotation in ioctl.c
    Btrfs: check for duplicate iov_base's when doing dio reads
    btrfs: properly handle overlapping areas in memmove_extent_buffer
    Btrfs: fix memory leaks in btrfs_new_inode()
    Btrfs: check for duplicate iov_base's when doing dio reads
    Btrfs: reuse the extent_map we found when calling btrfs_get_extent
    Btrfs: do not use async submit for small DIO io's
    Btrfs: don't split dio bios if we don't have to
    ...

    Linus Torvalds
     
  • Rather than pass in some random truncated offset to the pid-related
    functions, check that the offset is in range up-front.

    This is just cleanup, the previous commit fixed the real problem.

    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • next_pidmap() just quietly accepted whatever 'last' pid that was passed
    in, which is not all that safe when one of the users is /proc.

    Admittedly the proc code should do some sanity checking on the range
    (and that will be the next commit), but that doesn't mean that the
    helper functions should just do that pidmap pointer arithmetic without
    checking the range of its arguments.

    So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
    doesn't really matter, the for-loop does check against the end of the
    pidmap array properly (it's only the actual pointer arithmetic overflow
    case we need to worry about, and going one bit beyond isn't going to
    overflow).

    [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

    Reported-by: Tavis Ormandy
    Analyzed-by: Robert Święcki
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Mouse gets "stuck" after restore of PV guest but buttons are in working
    condition.

    If driver has been configured for ABS coordinates at start it will get
    XENKBD_TYPE_POS events and then suddenly after restore it'll start getting
    XENKBD_TYPE_MOTION events, that will be dropped later and they won't get
    into user-space.

    Regression was introduced by hunk 5 and 6 of
    5ea5254aa0ad269cfbd2875c973ef25ab5b5e9db
    ("Input: xen-kbdfront - advertise either absolute or relative
    coordinates").

    Driver on restore should ask xen for request-abs-pointer again if it is
    available. So restore parts that did it before 5ea5254.

    Acked-by: Olaf Hering
    Signed-off-by: Igor Mammedov
    [v1: Expanded the commit description]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Dmitry Torokhov

    Igor Mammedov
     
  • Calculate a default based on the number of ABS axes, REL axes,
    and MT slots for the device during input device registration.

    Signed-off-by: Jeff Brown
    Reviewed-by: Henrik Rydberg
    Signed-off-by: Dmitry Torokhov

    Jeff Brown
     

18 Apr, 2011

16 commits

  • The free space caching code was recently reworked to
    cache all the pages it needed instead of using find_get_page everywhere.

    One loop was missed though, so it ended up leaking pages. This fixes
    it to use our page array instead of find_get_page.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Reported-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We just need to make sure that an unplug event wakes up the md
    thread, which is exactly what mddev_check_plugged does.

    Also remove some plug-related code that is no longer needed.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • In raid5 plugging is used for 2 things:
    1/ collecting writes that require a bitmap update
    2/ collecting writes in the hope that we can create full
    stripes - or at least more-full.

    We now release these different sets of stripes when plug_cnt
    is zero.

    Also in make_request, we call mddev_check_plug to hopefully increase
    plug_cnt, and wake up the thread at the end if plugging wasn't
    achieved for some reason.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When an md device adds a request to a queue, it can call
    mddev_check_plugged.
    If this succeeds then we know that the md thread will be woken up
    shortly, and ->plug_cnt will be non-zero until then, so some
    processing can be delayed.

    If it fails, then no unplug callback is expected and the make_request
    function needs to do whatever is required to make the request happen.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md has some plugging infrastructure for RAID5 to use because the
    normal plugging infrastructure required a 'request_queue', and when
    called from dm, RAID5 doesn't have one of those available.

    This relied on the ->unplug_fn callback which doesn't exist any more.

    So remove all of that code, both in md and raid5. Subsequent patches
    with restore the plugging functionality.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Now that unplugging is done differently, the unplug_fn callback is
    never called, so it can be completely discarded.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md/raid submits a lot of IO from the various raid threads.
    So adding start/finish plug calls to those so that some
    plugging happens.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If we know we are going to punt to kblockd, we can drop the queue
    lock before calling into __blk_run_queue() since it only does a
    safe bit test and a workqueue call. Since kblockd needs to grab
    this very lock as one of the first things it does, it's a good
    optimization to drop the lock before waking kblockd.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • MD can't use this since it really requires us to be able to
    keep more than a single piece of state for the unplug. Commit
    048c9374 added the required support for MD, so get rid of this
    now unused code.

    This reverts commit f75664570d8b75469cc468f23c2b27220984983b.

    Conflicts:

    block/blk-core.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • md/raid requires an unplug callback, but as it does not uses
    requests the current code cannot provide one.

    So allow arbitrary callbacks to be attached to the blk_plug.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • Because of speculative event roll back, it is possible for some event coutners
    to decrease between reads on POWER7. This causes a problem with the way that
    counters are updated. Delta calues are calculated in a 64 bit value and the
    top 32 bits are masked. If the register value has decreased, this leaves us
    with a very large positive value added to the kernel counters. This patch
    protects against this by skipping the update if the delta would be negative.
    This can lead to a lack of precision in the coutner values, but from my testing
    the value is typcially fewer than 10 samples at a time.

    Signed-off-by: Eric B Munson
    Cc: stable@kernel.org
    Signed-off-by: Benjamin Herrenschmidt

    Eric B Munson
     
  • This problem was noticed on an MPC855T platform. Ftrace did oops
    when trying to write to the kernel text segment.

    Many thanks to Joakim for finding the root cause of this problem.

    Signed-off-by: Stefan Roese
    Cc: Joakim Tjernlund
    Cc: Benjamin Herrenschmidt
    Cc: Steven Rostedt
    Signed-off-by: Benjamin Herrenschmidt

    Stefan Roese
     
  • We currently enable interrupts before the dispatch log for the boot
    cpu is setup. If a timer interrupt comes in early enough we oops in
    scan_dispatch_log:

    Unable to handle kernel paging request for data at address 0x00000010

    ...

    .scan_dispatch_log+0xb0/0x170
    .account_system_vtime+0xa0/0x220
    .irq_enter+0x88/0xc0
    .do_IRQ+0x48/0x230

    The patch below adds a check to scan_dispatch_log to ensure the
    dispatch log has been allocated.

    Signed-off-by: Anton Blanchard
    Cc:
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard