27 Sep, 2012

1 commit


26 Sep, 2012

3 commits

  • blkdev_mmap() isn't used outside of fs/block_dev.c, mark it as
    static.

    Reported-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    Fengguang Wu
     
  • This avoids cache line bouncing when many processes lock the semaphore
    for read.

    New percpu lock implementation

    The lock consists of an array of percpu unsigned integers, a boolean
    variable and a mutex.

    When we take the lock for read, we enter rcu read section, check for a
    "locked" variable. If it is false, we increase a percpu counter on the
    current cpu and exit the rcu section. If "locked" is true, we exit the
    rcu section, take the mutex and drop it (this waits until a writer
    finished) and retry.

    Unlocking for read just decreases percpu variable. Note that we can
    unlock on a difference cpu than where we locked, in this case the
    counter underflows. The sum of all percpu counters represents the number
    of processes that hold the lock for read.

    When we need to lock for write, we take the mutex, set "locked" variable
    to true and synchronize rcu. Since RCU has been synchronized, no
    processes can create new read locks. We wait until the sum of percpu
    counters is zero - when it is, there are no readers in the critical
    section.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • The kernel may crash when block size is changed and I/O is issued
    simultaneously.

    Because some subsystems (udev or lvm) may read any block device anytime,
    the bug actually puts any code that changes a block device size in
    jeopardy.

    The crash can be reproduced if you place "msleep(1000)" to
    blkdev_get_blocks just before "bh->b_size = max_blocks <<
    inode->i_blkbits;".
    Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
    While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
    You get a BUG.

    The direct and non-direct I/O is written with the assumption that block
    size does not change. It doesn't seem practical to fix these crashes
    one-by-one there may be many crash possibilities when block size changes
    at a certain place and it is impossible to find them all and verify the
    code.

    This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
    taken for read during I/O. It is taken for write when changing block
    size. Consequently, block size can't be changed while I/O is being
    submitted.

    For asynchronous I/O, the patch only prevents block size change while
    the I/O is being submitted. The block size can change when the I/O is in
    progress or when the I/O is being finished. This is acceptable because
    there are no accesses to block size when asynchronous I/O is being
    finished.

    The patch prevents block size changing while the device is mapped with
    mmap.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

21 Sep, 2012

2 commits

  • A queue newly allocated with blk_alloc_queue_node() has only
    QUEUE_FLAG_BYPASS set. For request-based drivers,
    blk_init_allocated_queue() is called and q->queue_flags is overwritten
    with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
    initial bypass is still in effect.

    In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
    instead of overwriting.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • …_init_allocated_queue()

    b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
    request_queues bypassed on allocation to avoid switching on and off
    bypass mode on a queue being initialized. Some drivers allocate and
    then destroy a lot of queues without fully initializing them and
    incurring bypass latency overhead on each of them could add upto
    significant overhead.

    Unfortunately, blk_init_allocated_queue() is never used by queues of
    bio-based drivers, which means that all bio-based driver queues are in
    bypass mode even after initialization and registration complete
    successfully.

    Due to the limited way request_queues are used by bio drivers, this
    problem is hidden pretty well but it shows up when blk-throttle is
    used in combination with a bio-based driver. Trying to configure
    (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
    indefinitely in blkg_conf_prep() waiting for bypass mode to end.

    This patch moves the initial blk_queue_bypass_end() call from
    blk_init_allocated_queue() to blk_register_queue() which is called for
    any userland-visible queues regardless of its type.

    I believe this is correct because I don't think there is any block
    driver which needs or wants working elevator and blk-cgroup on a queue
    which isn't visible to userland. If there are such users, we need a
    different solution.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Tejun Heo
     

20 Sep, 2012

5 commits

  • Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by
    way of blkdev_issue_zeroout().

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • If the device supports WRITE SAME, use that to optimize zeroing of
    blocks. If the device does not support WRITE SAME or if the operation
    fails, fall back to writing zeroes the old-fashioned way.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • - blk_check_merge_flags() verifies that cmd_flags / bi_rw are
    compatible. This function is called for both req-req and req-bio
    merging.

    - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
    to query the maximum sector count for a given request or queue. The
    calls will return the right value from the queue limits given the
    type of command (RW, discard, write same, etc.)

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Remove special-casing of non-rw fs style requests (discard). The nomerge
    flags are consolidated in blk_types.h, and rq_mergeable() and
    bio_mergeable() have been modified to use them.

    bio_is_rw() is used in place of bio_has_data() a few places. This is
    done to to distinguish true reads and writes from other fs type requests
    that carry a payload (e.g. write same).

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

13 Sep, 2012

1 commit

  • Remove useless kfree() and clean up code related to the removal.

    The semantic patch that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @r exists@
    position p1,p2;
    expression x;
    @@

    if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; }

    @unchanged exists@
    position r.p1,r.p2;
    expression e

    Signed-off-by: Peter Senna Tschudin
    Signed-off-by: Jens Axboe

    Peter Senna Tschudin
     

09 Sep, 2012

10 commits

  • Before call the blk_queue_congestion_threshold(),
    the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest().
    Because this code is the duplicated, it has removed.

    Signed-off-by: Jaehoon Chung
    Signed-off-by: Kyungmin Park
    Signed-off-by: Jens Axboe

    Jaehoon Chung
     
  • Instead of using simple_strtoul which "converts" invalid numbers to 0,
    use strict_strtoul and perform error checking to ensure that userspace
    passes us a valid unsigned long. This addresses problems with functions
    such as writev, which might want to write a trailing newline -- the
    newline should rightfully be rejected, but the value preceeding it
    should be preserved.

    Fixes BZ#46981.

    Signed-off-by: Dave Reisner
    Signed-off-by: Jens Axboe

    Dave Reisner
     
  • Previously, there was bio_clone() but it only allocated from the fs bio
    set; as a result various users were open coding it and using
    __bio_clone().

    This changes bio_clone() to become bio_clone_bioset(), and then we add
    bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
    the functionality the last patch adedd.

    This will also help in a later patch changing how bio cloning works.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: Boaz Harrosh
    CC: Jeff Garzik
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Previously, bio_kmalloc() and bio_alloc_bioset() behaved slightly
    different because there was some almost-duplicated code - this fixes
    some of that.

    The important change is that previously bio_kmalloc() always set
    bi_io_vec = bi_inline_vecs, even if nr_iovecs == 0 - unlike
    bio_alloc_bioset(). This would cause bio_has_data() to return true; I
    don't know if this resulted in any actual bugs but it was certainly
    wrong.

    bio_kmalloc() and bio_alloc_bioset() also have different arbitrary
    limits on nr_iovecs - 1024 (UIO_MAXIOV) for bio_kmalloc(), 256
    (BIO_MAX_PAGES) for bio_alloc_bioset(). This patch doesn't fix that, but
    at least they're enforced closer together and hopefully they will be
    fixed in a later patch.

    This'll also help with some future cleanups - there are a fair number of
    functions that allocate bios (e.g. bio_clone()), and now they don't have
    to be duplicated for bio_alloc(), bio_alloc_bioset(), and bio_kmalloc().

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    v7: Re-add dropped comments, improv patch description
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Now that we've got generic code for freeing bios allocated from bio
    pools, this isn't needed anymore.

    This patch also makes bio_free() static, since without bi_destructor
    there should be no need for it to be called anywhere else.

    bio_free() is now only called from bio_put, so we can refactor those a
    bit - move some code from bio_put() to bio_free() and kill the redundant
    bio->bi_next = NULL.

    v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
    v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
    v7: No #define BIO_KMALLOC_POOL anymore

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • This is prep work for killing bi_destructor - previously, pktcdvd had
    its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
    necessitating its own bi_destructor implementation.

    v5: Un-reorder some functions, to make the patch easier to review

    Signed-off-by: Kent Overstreet
    Acked-by: Jiri Kosina
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Reusing bios is something that's been highly frowned upon in the past,
    but driver code keeps doing it anyways. If it's going to happen anyways,
    we should provide a generic method.

    This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
    was open coding it, by doing a bio_init() and resetting bi_destructor.

    This required reordering struct bio, but the block layer is not yet
    nearly fast enough for any cacheline effects to matter here.

    v5: Add a define BIO_RESET_BITS, to be very explicit about what parts of
    bio->bi_flags are saved.
    v6: Further commenting verbosity, per Tejun
    v9: Add a function comment

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Previously, dm_rq_clone_bio_info needed to be freed by the bio's
    destructor to avoid a memory leak in the blk_rq_prep_clone() error path.
    This gets rid of a memory allocation and means we can kill
    dm_rq_bio_destructor.

    The _rq_bio_info_cache kmem cache is unused now and needs to be deleted,
    but due to the way io_pool is used and overloaded this looks not quite
    trivial so I'm leaving it for a later patch.

    v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun

    Signed-off-by: Kent Overstreet
    CC: Alasdair Kergon
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Now that bios keep track of where they were allocated from,
    bio_integrity_alloc_bioset() becomes redundant.

    Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
    related functions and make them use bio->bi_pool.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Martin K. Petersen
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • With the old code, when you allocate a bio from a bio pool you have to
    implement your own destructor that knows how to find the bio pool the
    bio was originally allocated from.

    This adds a new field to struct bio (bi_pool) and changes
    bio_alloc_bioset() to use it. This makes various bio destructors
    unnecessary, so they're then deleted.

    v6: Explain the temporary if statement in bio_put

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: Nicholas Bellinger
    CC: Lars Ellenberg
    Acked-by: Tejun Heo
    Acked-by: Nicholas Bellinger
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

07 Sep, 2012

4 commits

  • Pull ARM SoC bug fixes from Olof Johansson:
    "Mostly Renesas and Atmel bugfixes this time, targeting boot and build
    problems. A couple of patches for gemini and kirkwood as well. On a
    whole nothing very controversial."

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
    ARM: gemini: fix the gemini build
    ARM: shmobile: armadillo800eva: enable rw rootfs mount
    ARM: Kirkwood: Fix 'SZ_1M' undeclared here for db88f6281-bp-setup.c
    ARM: shmobile: mackerel: fixup usb module order
    ARM: shmobile: armadillo800eva: fixup: sound card detection order
    ARM: shmobile: marzen: fixup smsc911x id for regulator
    ARM: at91/feature-removal-schedule: delay at91_mci removal
    ARM: mach-shmobile: armadillo800eva: Enable power button as wakeup source
    ARM: mach-shmobile: armadillo800eva: Fix GPIO buttons descriptions
    ARM: at91/dts: remove partial parameter in at91sam9g25ek.dts
    ARM: at91/clock: fix PLLA overclock warning
    ARM: at91: fix rtc-at91sam9 irq issue due to sparse irq support
    ARM: at91: fix system timer irq issue due to sparse irq support
    ARM: shmobile: sh73a0: fixup RELOC_BASE of intca_irq_pins_desc

    Linus Torvalds
     
  • Pull a hwmon fix from Guenter Roeck:
    "One patch, fixing DIV_ROUND_CLOSEST to support negative dividends.

    While the changes are not in the drivers/hwmon directory, the problem
    primarily affects hwmon drivers, and it makes sense to push the patch
    through the hwmon tree."

    * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
    linux/kernel.h: Fix DIV_ROUND_CLOSEST to support negative dividends

    Linus Torvalds
     
  • Pull kbuild fixes from Michal Marek:
    "These are two fixes that should go into 3.6. The link-vmlinux.sh one
    is obvious.

    The other one fixes make firmware_install with certain configurations,
    where a file in the toplevel firmware tree gets installed first, and
    $(INSTALL_FW_PATH)/$$(dir ) results in /lib/firmware/./, which
    confuses make 3.82 for some reason."

    * 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
    firmware: fix directory creation rule matching with make 3.82
    link-vmlinux.sh: Fix stray "echo" in error message

    Linus Torvalds
     
  • Trivially triggerable, found by trinity:

    kernel BUG at mm/mempolicy.c:2546!
    Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
    Call Trace:
    show_numa_map+0xd5/0x450
    show_pid_numa_map+0x13/0x20
    traverse+0xf2/0x230
    seq_read+0x34b/0x3e0
    vfs_read+0xac/0x180
    sys_pread64+0xa2/0xc0
    system_call_fastpath+0x1a/0x1f
    RIP: mpol_to_str+0x156/0x360

    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Dave Jones
     

06 Sep, 2012

9 commits

  • Pull MMC fixes from Chris Ball:
    - a firmware bug on several Samsung MoviNAND eMMC models causes
    permanent corruption on the device when secure erase and secure trim
    requests are made, so we disable those requests on these eMMC devices.
    - atmel-mci: fix a hang with some SD cards by waiting for not-busy flag.
    - dw_mmc: low-power mode breaks SDIO interrupts; fix PIO error handling;
    fix handling of error interrupts.
    - mxs-mmc: fix deadlocks; fix compile error due to dma.h arch change.
    - omap: fix broken PIO mode causing memory corruption.
    - sdhci-esdhc: fix card detection.

    * tag 'mmc-fixes-for-3.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
    mmc: omap: fix broken PIO mode
    mmc: card: Skip secure erase on MoviNAND; causes unrecoverable corruption.
    mmc: dw_mmc: Disable low power mode if SDIO interrupts are used
    mmc: dw_mmc: fix error handling in PIO mode
    mmc: dw_mmc: correct mishandling error interrupt
    mmc: dw_mmc: amend using error interrupt status
    mmc: atmel-mci: not busy flag has also to be used for read operations
    mmc: sdhci-esdhc: break out early if clock is 0
    mmc: mxs-mmc: fix deadlock caused by recursion loop
    mmc: mxs-mmc: fix deadlock in SDIO IRQ case
    mmc: bfin_sdh: fix dma_desc_array build error

    Linus Torvalds
     
  • Fix the following compile error on UML.

    arch/um/os-Linux/time.c: In function 'deliver_alarm':
    arch/um/os-Linux/time.c:117:3: error: too few arguments to function 'alarm_handler'
    arch/um/os-Linux/internal.h:1:6: note: declared here

    The error was introduced by commit d3c1cfcd ("um: pass siginfo to guest
    process") in 3.6-rc1.

    Signed-off-by: Miklos Szeredi
    CC: Martin Pärtel
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Allocate a structure not a pointer to it !

    Signed-off-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Pull powerpc fixes from Benjamin Herrenschmidt:
    "Here are a few fixes for 3.6 that were piling up while I was away or
    busy (I was mostly MIA a week or two before San Diego).

    Some fixes from Anton fixing up issues with our relatively new DSCR
    control feature, and a few other fixes that are either regressions or
    bugs nasty enough to warrant not waiting."

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc: Don't use __put_user() in patch_instruction
    powerpc: Make sure IPI handlers see data written by IPI senders
    powerpc: Restore correct DSCR in context switch
    powerpc: Fix DSCR inheritance in copy_thread()
    powerpc: Keep thread.dscr and thread.dscr_inherit in sync
    powerpc: Update DSCR on all CPUs when writing sysfs dscr_default
    powerpc/powernv: Always go into nap mode when CPU is offline
    powerpc: Give hypervisor decrementer interrupts their own handler
    powerpc/vphn: Fix arch_update_cpu_topology() return value

    Linus Torvalds
     
  • Pull GPIO fixes from Linus Walleij:
    "These are some GPIO regression fixes for v3.6:
    - Erroneous debug message from of_get_named_gpio_flags()
    - Make sure the MC9S08DZ60 GPIO driver depend on I2C being compiled
    in (not module) or allmodconfig breaks.
    - Check return value from irq_alloc_descs() in the Emma Mobile GPIO
    driver.
    - Assign the owner field for the rdc321x driver so the module won't
    be removed if it has active GPIOs."

    * tag 'gpio-fixes-for-v3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
    gpio: rdc321x: Prevent removal of modules exporting active GPIOs
    gpio: em: Fix checking return value of irq_alloc_descs
    gpio: mc9s08dz60: Fix build error if I2C=m
    gpio: Fix debug message in of_get_named_gpio_flags()

    Linus Torvalds
     
  • Pull sound fixes from Takashi Iwai:
    "There are nothing scaring, contains only small fixes for HD-audio and
    USB-audio:
    - EPSS regression fix and GPIO fix for HD-audio IDT codecs
    - A series of USB-audio regression fixes that are found since 3.5
    kernel"

    * tag 'sound-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ALSA: snd-usb: fix cross-interface streaming devices
    ALSA: snd-usb: fix calls to next_packet_size
    ALSA: snd-usb: restore delay information
    ALSA: snd-usb: use list_for_each_safe for endpoint resources
    ALSA: snd-usb: Fix URB cancellation at stream start
    ALSA: hda - Don't trust codec EPSS bit for IDT 92HD83xx & co
    ALSA: hda - Avoid unnecessary parameter read for EPSS
    ALSA: hda - Do not set GPIOs for speakers on IDT if there are no speakers

    Linus Torvalds
     
  • Pull fbdev fixes from Florian Tobias Schandinat:
    - a fix by Paul Cercueil to prevent a possible buffer overflow
    - a fix by Bruno Prémont to prevent a rare sleep in invalid context
    - a fix by Julia Lawall for a double free in auo_k190x
    - a fix by Dan Carpenter to prevent a division by zero in mb862xxfb
    - a regression fix by Tomi Valkeinen for the SDI output in OMAP
    - a fix by Grazvydas Ignotas to fix the console colors in OMAP

    * tag 'fbdev-fixes-for-3.6-1' of git://github.com/schandinat/linux-2.6:
    OMAPFB: fix framebuffer console colors
    OMAPDSS: Fix SDI PLL locking
    video: mb862xxfb: prevent divide by zero bug
    drivers/video/auo_k190x.c: drop kfree of devm_kzalloc's data
    fbcon: Fix bit_putcs() call to kmalloc(s, GFP_KERNEL)
    fbcon: prevent possible buffer overflow.

    Linus Torvalds
     
  • Pull ubi fix from Artem Bityutskiy:
    "A single small fix for memory deallocation: we allocated memory using
    'kmem_cache_alloc()' but were freeing it using 'kfree()' in some
    cases. Now we fix this by using 'kmem_cache_free()' instead."

    * tag 'upstream-3.6-rc5' of git://git.infradead.org/linux-ubi:
    UBI: fix a horrible memory deallocation bug

    Linus Torvalds
     
  • Commit 644595f89620 ("compat: Handle COMPAT_USE_64BIT_TIME in
    net/socket.c") introduced a bug where the helper functions to take
    either a 64-bit or compat time[spec|val] got the arguments in the wrong
    order, passing the kernel stack pointer off as a user pointer (and vice
    versa).

    Because of the user address range check, that in turn then causes an
    EFAULT due to the user pointer range checking failing for the kernel
    address. Incorrectly resuling in a failed system call for 32-bit
    processes with a 64-bit kernel.

    On odder architectures like HP-PA (with separate user/kernel address
    spaces), it can be used read kernel memory.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

05 Sep, 2012

5 commits

  • patch_instruction() can be called very early on ppc32, when the kernel
    isn't yet running at it's linked address. That can cause the !
    is_kernel_addr() test in __put_user() to trip and call might_sleep()
    which is very bad at that point during boot.

    Use a lower level function instead for now, at least until we get to
    rework ppc32 boot process to do the code patching later, like ppc64
    does.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • We have been observing hangs, both of KVM guest vcpu tasks and more
    generally, where a process that is woken doesn't properly wake up and
    continue to run, but instead sticks in TASK_WAKING state. This
    happens because the update of rq->wake_list in ttwu_queue_remote()
    is not ordered with the update of ipi_message in
    smp_muxed_ipi_message_pass(), and the reading of rq->wake_list in
    scheduler_ipi() is not ordered with the reading of ipi_message in
    smp_ipi_demux(). Thus it is possible for the IPI receiver not to see
    the updated rq->wake_list and therefore conclude that there is nothing
    for it to do.

    In order to make sure that anything done before smp_send_reschedule()
    is ordered before anything done in the resulting call to scheduler_ipi(),
    this adds barriers in smp_muxed_message_pass() and smp_ipi_demux().
    The barrier in smp_muxed_message_pass() is a full barrier to ensure that
    there is a full ordering between the smp_send_reschedule() caller and
    scheduler_ipi(). In smp_ipi_demux(), we use xchg() rather than
    xchg_local() because xchg() includes release and acquire barriers.
    Using xchg() rather than xchg_local() makes sense given that
    ipi_message is not just accessed locally.

    This moves the barrier between setting the message and calling the
    cause_ipi() function into the individual cause_ipi implementations.
    Most of them -- those that used outb, out_8 or similar -- already had
    a full barrier because out_8 etc. include a sync before the MMIO
    store. This adds an explicit barrier in the two remaining cases.

    These changes made no measurable difference to the speed of IPIs as
    measured using a simple ping-pong latency test across two CPUs on
    different cores of a POWER7 machine.

    The analysis of the reason why processes were not waking up properly
    is due to Milton Miller.

    Cc: stable@vger.kernel.org # v3.0+
    Reported-by: Milton Miller
    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     
  • During a context switch we always restore the per thread DSCR value.
    If we aren't doing explicit DSCR management
    (ie thread.dscr_inherit == 0) and the default DSCR changed while
    the process has been sleeping we end up with the wrong value.

    Check thread.dscr_inherit and select the default DSCR or per thread
    DSCR as required.

    This was found with the following test case, when running with
    more threads than CPUs (ie forcing context switching):

    http://ozlabs.org/~anton/junkcode/dscr_default_test.c

    With the four patches applied I can run a combination of all
    test cases successfully at the same time:

    http://ozlabs.org/~anton/junkcode/dscr_default_test.c
    http://ozlabs.org/~anton/junkcode/dscr_explicit_test.c
    http://ozlabs.org/~anton/junkcode/dscr_inherit_test.c

    Signed-off-by: Anton Blanchard
    Cc: # 3.0+
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • If the default DSCR is non zero we set thread.dscr_inherit in
    copy_thread() meaning the new thread and all its children will ignore
    future updates to the default DSCR. This is not intended and is
    a change in behaviour that a number of our users have hit.

    We just need to inherit thread.dscr and thread.dscr_inherit from
    the parent which ends up being much simpler.

    This was found with the following test case:

    http://ozlabs.org/~anton/junkcode/dscr_default_test.c

    Signed-off-by: Anton Blanchard
    Cc: # 3.0+
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • When we update the DSCR either via emulation of mtspr(DSCR) or via
    a change to dscr_default in sysfs we don't update thread.dscr.
    We will eventually update it at context switch time but there is
    a period where thread.dscr is incorrect.

    If we fork at this point we will copy the old value of thread.dscr
    into the child. To avoid this, always keep thread.dscr in sync with
    reality.

    This issue was found with the following testcase:

    http://ozlabs.org/~anton/junkcode/dscr_inherit_test.c

    Signed-off-by: Anton Blanchard
    Cc: # 3.0+
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard