30 Nov, 2017

40 commits

  • commit 67f2519fe2903c4041c0e94394d14d372fe51399 upstream.

    guard_bio_eod() needs to look at the partition capacity, not just the
    capacity of the whole device, when determining if truncation is
    necessary.

    [ 60.268688] attempt to access beyond end of device
    [ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
    [ 60.268693] buffer_io_error: 2 callbacks suppressed
    [ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     
  • commit 91af8300d9c1d7c6b6a2fd754109e08d4798b8d8 upstream.

    In bcache code, sysfs entries are created before all resources get
    allocated, e.g. allocation thread of a cache set.

    There is posibility for NULL pointer deference if a resource is accessed
    but which is not initialized yet. Indeed Jorg Bornschein catches one on
    cache set allocation thread and gets a kernel oops.

    The reason for this bug is, when bch_bucket_alloc() is called during
    cache set registration and attaching, ca->alloc_thread is not properly
    allocated and initialized yet, call wake_up_process() on ca->alloc_thread
    triggers NULL pointer deference failure. A simple and fast fix is, before
    waking up ca->alloc_thread, checking whether it is allocated, and only
    wake up ca->alloc_thread when it is not NULL.

    Signed-off-by: Coly Li
    Reported-by: Jorg Bornschein
    Cc: Kent Overstreet
    Reviewed-by: Michael Lyle
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Coly Li
     
  • commit b11270853fa3654f08d4a6a03b23ddb220512d8d upstream.

    The WARN_ON(!key->len) in set_secret() in net/ceph/crypto.c is hit if a
    user tries to add a key of type "ceph" with an invalid payload as
    follows (assuming CONFIG_CEPH_LIB=y):

    echo -e -n '\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
    | keyctl padd ceph desc @s

    This can be hit by fuzzers. As this is merely bad input and not a
    kernel bug, replace the WARN_ON() with return -EINVAL.

    Fixes: 7af3ea189a9a ("libceph: stop allocating a new cipher on every crypto request")
    Signed-off-by: Eric Biggers
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit db86be3a12d0b6e5c5b51c2ab2a48f06329cb590 upstream.

    We're freeing the list iterator so we should be using the _safe()
    version of hlist_for_each_entry().

    Fixes: 88b4a07e6610 ("[PATCH] eCryptfs: Public key transport mechanism")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit a0b3bc855374c50b5ea85273553485af48caf2f7 upstream.

    fscrypt_initialize(), which allocates the global bounce page pool when
    an encrypted file is first accessed, uses "double-checked locking" to
    try to avoid locking fscrypt_init_mutex. However, it doesn't use any
    memory barriers, so it's theoretically possible for a thread to observe
    a bounce page pool which has not been fully initialized. This is a
    classic bug with "double-checked locking".

    While "only a theoretical issue" in the latest kernel, in pre-4.8
    kernels the pointer that was checked was not even the last to be
    initialized, so it was easily possible for a crash (NULL pointer
    dereference) to happen. This was changed only incidentally by the large
    refactor to use fs/crypto/.

    Solve both problems in a trivial way that can easily be backported: just
    always take the mutex. It's theoretically less efficient, but it
    shouldn't be noticeable in practice as the mutex is only acquired very
    briefly once per encrypted file.

    Later I'd like to make this use a helper macro like DO_ONCE(). However,
    DO_ONCE() runs in atomic context, so we'd need to add a new macro that
    allows blocking.

    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 31ccb1f7ba3cfe29631587d451cf5bb8ab593550 upstream.

    There is a race condition between nilfs_dirty_inode() and
    nilfs_set_file_dirty().

    When a file is opened, nilfs_dirty_inode() is called to update the
    access timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a
    separate transaction. __nilfs_mark_inode_dirty() caches the ifile
    buffer_head in the i_bh field of the inode info structure and marks it
    as dirty.

    After some data was written to the file in another transaction, the
    function nilfs_set_file_dirty() is called, which adds the inode to the
    ns_dirty_files list.

    Then the segment construction calls nilfs_segctor_collect_dirty_files(),
    which goes through the ns_dirty_files list and checks the i_bh field.
    If there is a cached buffer_head in i_bh it is not marked as dirty
    again.

    Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
    transactions, it is possible that a segment construction that writes out
    the ifile occurs in-between the two. If this happens the inode is not
    on the ns_dirty_files list, but its ifile block is still marked as dirty
    and written out.

    In the next segment construction, the data for the file is written out
    and nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root
    is written into the i_bh block, which is not dirty, because it was
    written out in another segment construction.

    As a result the bmap update can be lost, which leads to file system
    corruption. Either the virtual block address points to an unallocated
    DAT block, or the DAT entry will be reused for something different.

    The error can remain undetected for a long time. A typical error
    message would be one of the "bad btree" errors or a warning that a DAT
    entry could not be found.

    This bug can be reproduced reliably by a simple benchmark that creates
    and overwrites millions of 4k files.

    Link: http://lkml.kernel.org/r/1509367935-3086-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
    Signed-off-by: Andreas Rohner
    Signed-off-by: Ryusuke Konishi
    Tested-by: Andreas Rohner
    Tested-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andreas Rohner
     
  • commit ecc0c469f27765ed1e2b967be0aa17cee1a60b76 upstream.

    Currently if the autofs kernel module gets an error when writing to the
    pipe which links to the daemon, then it marks the whole moutpoint as
    catatonic, and it will stop working.

    It is possible that the error is transient. This can happen if the
    daemon is slow and more than 16 requests queue up. If a subsequent
    process tries to queue a request, and is then signalled, the write to
    the pipe will return -ERESTARTSYS and autofs will take that as total
    failure.

    So change the code to assess -ERESTARTSYS and -ENOMEM as transient
    failures which only abort the current request, not the whole mountpoint.

    It isn't a crash or a data corruption, but having autofs mountpoints
    suddenly stop working is rather inconvenient.

    Ian said:

    : And given the problems with a half dozen (or so) user space applications
    : consuming large amounts of CPU under heavy mount and umount activity this
    : could happen more easily than we expect.

    Link: http://lkml.kernel.org/r/87y3norvgp.fsf@notabene.neil.brown.name
    Signed-off-by: NeilBrown
    Acked-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 5d03a6613957785e94af7a4a6212ad4af66aa5c2 upstream.

    There is a race in the current z3fold implementation between
    do_compact() called in a work queue context and the page release
    procedure when page's kref goes to 0.

    do_compact() may be waiting for page lock, which is released by
    release_z3fold_page_locked right before putting the page onto the
    "stale" list, and then the page may be freed as do_compact() modifies
    its contents.

    The mechanism currently implemented to handle that (checking the
    PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref
    counter to guarantee that the page is not released if its compaction is
    scheduled. It then becomes compaction function's responsibility to
    decrease the counter and quit immediately if the page was actually
    freed.

    Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com
    Signed-off-by: Vitaly Wool
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Wool
     
  • commit bfa62a52cad93686bb8d8171ea5288813248a7c6 upstream.

    ENOENT usb error mean "specified interface or endpoint does not exist or
    is not enabled". Mark device not present when we encounter this error
    similar like we do with ENODEV error.

    Otherwise we can have infinite loop in rt2x00usb_work_rxdone(), because
    we remove and put again RX entries to the queue infinitely.

    We can have similar situation when submit urb will fail all the time
    with other error, so we need consider to limit number of entries
    processed by rxdone work. But for now, since the patch fixes
    reproducible soft lockup issue on single processor systems
    and taken ENOENT error meaning, let apply this fix.

    Patch adds additional ENOENT check not only in rx kick routine, but
    also on other places where we check for ENODEV error.

    Reported-by: Richard Genoud
    Debugged-by: Richard Genoud
    Signed-off-by: Stanislaw Gruszka
    Tested-by: Richard Genoud
    Signed-off-by: Kalle Valo
    Signed-off-by: Greg Kroah-Hartman

    Stanislaw Gruszka
     
  • commit 409fcace9963c1e8d2cb0f7ac62e8b34d47ef979 upstream.

    Fix final phase of .
    emulation. Provide proper generation of SIGFPE signal and updating
    debugfs FP exception stats in cases of any exception flags set in
    preceding phases of emulation.

    CLASS. instruction may generate "Unimplemented Operation" FP
    exception. . instructions may generate "Inexact",
    "Unimplemented Operation", "Invalid Operation", "Overflow", and
    "Underflow" FP exceptions. . instructions
    can generate "Unimplemented Operation" and "Invalid Operation" FP
    exceptions.

    The proper final processing of the cases when any FP exception
    flag is set is achieved by replacing "break" statement with "goto
    copcsr" statement. With such solution, this patch brings the final
    phase of emulation of the above instructions consistent with the
    one corresponding to the previously implemented emulation of other
    related FPU instructions (ADD, SUB, etc.).

    Fixes: 38db37ba069f ("MIPS: math-emu: Add support for the MIPS R6 CLASS FPU instruction")
    Fixes: e24c3bec3e8e ("MIPS: math-emu: Add support for the MIPS R6 MADDF FPU instruction")
    Fixes: 83d43305a1df ("MIPS: math-emu: Add support for the MIPS R6 MSUBF FPU instruction")
    Fixes: a79f5f9ba508 ("MIPS: math-emu: Add support for the MIPS R6 MAX{, A} FPU instruction")
    Fixes: 4e9561b20e2f ("MIPS: math-emu: Add support for the MIPS R6 MIN{, A} FPU instruction")
    Signed-off-by: Aleksandar Markovic
    Cc: Ralf Baechle
    Cc: Douglas Leung
    Cc: Goran Ferenc
    Cc: "Maciej W. Rozycki"
    Cc: Miodrag Dinic
    Cc: Paul Burton
    Cc: Petar Jovanovic
    Cc: Raghu Gandham
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/17581/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    Aleksandar Markovic
     
  • commit 56a46acf62af5ba44fca2f3f1c7c25a2d5385b19 upstream.

    The WLAN LED on the Linksys WRT54GSv1 is active low, but the software
    treats it as active high. Fix the inverted logic.

    Fixes: 7bb26b169116 ("MIPS: BCM47xx: Fix LEDs on WRT54GS V1.0")
    Signed-off-by: Mirko Parthey
    Looks-ok-by: Rafał Miłecki
    Cc: Hauke Mehrtens
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/16071/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    Mirko Parthey
     
  • commit 547da673173de51f73887377eb275304775064ad upstream.

    Fix a commit 7aeb753b5353 ("MIPS: Implement task_user_regset_view.")
    regression, then activated by commit 6a9c001b7ec3 ("MIPS: Switch ELF
    core dumper to use regsets.)", that caused n32 processes to dump o32
    core files by failing to set the EF_MIPS_ABI2 flag in the ELF core file
    header's `e_flags' member:

    $ file tls-core
    tls-core: ELF 32-bit MSB executable, MIPS, N32 MIPS64 rel2 version 1 (SYSV), [...]
    $ ./tls-core
    Aborted (core dumped)
    $ file core
    core: ELF 32-bit MSB core file MIPS, MIPS-I version 1 (SYSV), SVR4-style
    $

    Previously the flag was set as the result of a:

    statement placed in arch/mips/kernel/binfmt_elfn32.c, however in the
    regset case, i.e. when CORE_DUMP_USE_REGSET is set, ELF_CORE_EFLAGS is
    no longer used by `fill_note_info' in fs/binfmt_elf.c, and instead the
    `->e_flags' member of the regset view chosen is. We have the views
    defined in arch/mips/kernel/ptrace.c, however only an o32 and an n64
    one, and the latter is used for n32 as well. Consequently an o32 core
    file is incorrectly dumped from n32 processes (the ELF32 vs ELF64 class
    is chosen elsewhere, and the 32-bit one is correctly selected for n32).

    Correct the issue then by defining an n32 regset view and using it as
    appropriate. Issue discovered in GDB testing.

    Fixes: 7aeb753b5353 ("MIPS: Implement task_user_regset_view.")
    Signed-off-by: Maciej W. Rozycki
    Cc: Ralf Baechle
    Cc: Djordje Todorovic
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/17617/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    Maciej W. Rozycki
     
  • commit 3cad14d56adbf7d621fc5a35db42f3acc0a2d6e8 upstream.

    arch/mips/boot/dts/brcm/bcm96358nb4ser.dts does not exist, so
    we cannot build bcm96358nb4ser.dtb .

    Signed-off-by: Masahiro Yamada
    Fixes: 695835511f96 ("MIPS: BMIPS: rename bcm96358nb4ser to bcm6358-neufbox4-sercom")
    Acked-by: James Hogan
    Signed-off-by: Rob Herring
    Signed-off-by: Greg Kroah-Hartman

    Masahiro Yamada
     
  • commit 22b8ba765a726d90e9830ff6134c32b04f12c10f upstream.

    32-bit kernels can be configured to support MIPS64, in which case
    neither CONFIG_64BIT or CONFIG_CPU_MIPS32_R* will be set. This causes
    the CP0_Status.FR checks at the point of floating point register save
    and restore to be compiled out, which results in odd FP registers not
    being saved or restored to the task or signal context even when
    CP0_Status.FR is set.

    Fix the ifdefs to use CONFIG_CPU_MIPSR2 and CONFIG_CPU_MIPSR6, which are
    enabled for the relevant revisions of either MIPS32 or MIPS64, along
    with some other CPUs such as Octeon (r2), Loongson1 (r2), XLP (r2),
    Loongson 3A R2.

    The suspect code originates from commit 597ce1723e0f ("MIPS: Support for
    64-bit FP with O32 binaries") in v3.14, however the code in
    __enable_fpu() was consistent and refused to set FR=1, falling back to
    software FPU emulation. This was suboptimal but should be functionally
    correct.

    Commit fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6
    CPU") in v4.2 (and stable tagged back to 4.0) later introduced the bug
    by updating __enable_fpu() to set FR=1 but failing to update the other
    similar ifdefs to enable FR=1 state handling.

    Fixes: fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6 CPU")
    Signed-off-by: James Hogan
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/16739/
    Signed-off-by: Greg Kroah-Hartman

    James Hogan
     
  • commit c7fd89a6407ea3a44a2a2fa12d290162c42499c4 upstream.

    Building 32-bit MIPS64r2 kernels produces warnings like the following
    on certain toolchains (such as GNU assembler 2.24.90, but not GNU
    assembler 2.28.51) since commit 22b8ba765a72 ("MIPS: Fix MIPS64 FP
    save/restore on 32-bit kernels"), due to the exposure of fpu_save_16odd
    from fpu_save_double and fpu_restore_16odd from fpu_restore_double:

    arch/mips/kernel/r4k_fpu.S:47: Warning: float register should be even, was 1
    ...
    arch/mips/kernel/r4k_fpu.S:59: Warning: float register should be even, was 1
    ...

    This appears to be because .set mips64r2 does not change the FPU ABI to
    64-bit when -march=mips64r2 (or e.g. -march=xlp) is provided on the
    command line on that toolchain, from the default FPU ABI of 32-bit due
    to the -mabi=32. This makes access to the odd FPU registers invalid.

    Fix by explicitly changing the FPU ABI with .set fp=64 directives in
    fpu_save_16odd and fpu_restore_16odd, and moving the undefine of fp up
    in asmmacro.h so fp doesn't turn into $30.

    Fixes: 22b8ba765a72 ("MIPS: Fix MIPS64 FP save/restore on 32-bit kernels")
    Signed-off-by: James Hogan
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/17656/
    Signed-off-by: Greg Kroah-Hartman

    James Hogan
     
  • commit 8a74d29d541cd86569139c6f3f44b2d210458071 upstream.

    A DM device with a mix of discard capabilities (due to some underlying
    devices not having discard support) _should_ just return -EOPNOTSUPP for
    the region of the device that doesn't support discards (even if only by
    way of the underlying driver formally not supporting discards). BUT,
    that does ask the underlying driver to handle something that it never
    advertised support for. In doing so we're exposing users to the
    potential for a underlying disk driver hanging if/when a discard is
    issued a the device that is incapable and never claimed to support
    discards.

    Fix this by requiring that each DM target in a DM table provide discard
    support as a prereq for a DM device to advertise support for discards.

    This may cause some configurations that were happily supporting discards
    (even in the face of a mix of discard support) to stop supporting
    discards -- but the risk of users hitting driver hangs, and forced
    reboots, outweighs supporting those fringe mixed discard
    configurations.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit b9a41d21dceadf8104812626ef85dc56ee8a60ed upstream.

    The following BUG_ON was hit when testing repeat creation and removal of
    DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
    [] dm_get_from_kobject+0x34/0x3a
    [] dm_attr_show+0x2b/0x5e
    [] ? mutex_lock+0x26/0x44
    [] sysfs_kf_seq_show+0x83/0xcf
    [] kernfs_seq_show+0x23/0x25
    [] seq_read+0x16f/0x325
    [] kernfs_fop_read+0x3a/0x13f
    [] __vfs_read+0x26/0x9d
    [] ? security_file_permission+0x3c/0x44
    [] ? rw_verify_area+0x83/0xd9
    [] vfs_read+0x8f/0xcf
    [] ? __fdget_pos+0x12/0x41
    [] SyS_read+0x4b/0x76
    [] system_call_fastpath+0x12/0x71

    The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
    between the test of DMF_FREEING & DMF_DELETING and dm_get() in
    dm_get_from_kobject().

    To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
    dm_get() are done in an atomic way, so _minor_lock is used.

    The other callers of dm_get() have also been checked to be OK: some
    callers invoke dm_get() under _minor_lock, some callers invoke it under
    _hash_lock, and dm_start_request() invoke it after increasing
    md->open_count.

    Signed-off-by: Hou Tao
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     
  • commit 8593b18ad348733b5d5ddfa0c79dcabf51dff308 upstream.

    Switch the printk() call to the prefered pr_warn() api.

    Fixes: 7e5873d3755c ("MIPS: pci: Add MT7620a PCIE driver")
    Signed-off-by: John Crispin
    Cc: Ralf Baechle
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/15321/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    John Crispin
     
  • commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

    When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 856eb0916d181da6d043cc33e03f54d5c5bbe54a upstream.

    The structure srcu_struct can be very big, its size is proportional to the
    value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
    io_barrier in the struct mapped_device has 84kB in the debugging kernel
    and 50kB in the non-debugging kernel. The large size may result in failure
    of the function kzalloc_node.

    In order to avoid the allocation failure, we use the function
    kvzalloc_node, this function falls back to vmalloc if a large contiguous
    chunk of memory is not available. This patch also moves the field
    io_barrier to the last position of struct mapped_device - the reason is
    that on many processor architectures, short memory offsets result in
    smaller code than long memory offsets - on x86-64 it reduces code size by
    320 bytes.

    Note to stable kernel maintainers - the kernels 4.11 and older don't have
    the function kvzalloc_node, you can use the function vzalloc_node instead.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 5455f92b54e516995a9ca45bbf790d3629c27a93 upstream.

    If ovl_check_origin() fails, we should put upperdentry. We have a reference
    on it by now. So goto out_put_upper instead of out.

    Fixes: a9d019573e88 ("ovl: lookup non-dir copy-up-origin by file handle")
    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Vivek Goyal
     
  • commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.

    The default max_cache_size_bytes for dm-bufio is meant to be the lesser
    of 25% of the size of the vmalloc area and 2% of the size of lowmem.
    However, on 32-bit systems the intermediate result in the expression

    (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100

    overflows, causing the wrong result to be computed. For example, on a
    32-bit system where the vmalloc area is 520093696 bytes, the result is
    1174405 rather than the expected 130023424, which makes the maximum
    cache size much too small (far less than 2% of lowmem). This causes
    severe performance problems for dm-verity users on affected systems.

    Fix this by using mult_frac() to correctly multiply by a percentage. Do
    this for all places in dm-bufio that multiply by a percentage. Also
    replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
    to the comment is now defined in include/linux/vmalloc.h.

    Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
    Fixes: 95d402f057f2 ("dm: add bufio")
    Signed-off-by: Eric Biggers
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 9dc112e2daf87b40607fd8d357e2d7de32290d45 upstream.

    It is very normal to see allocation failure, especially with blk-mq
    request_queues, so it's unnecessary to report this error and annoy
    people.

    In practice this 'blk_get_request() returned -11' error gets logged
    quite frequently when a blk-mq DM multipath device sees heavy IO.

    This change is marked for stable@ because the annoying message in
    question was included in stable@ commit 7083abbbf.

    Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop")
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 114e025968b5990ad0b57bf60697ea64ee206aac upstream.

    The SCSI layer allows ZBC drives to have a smaller last runt zone. For
    such a device, specifying the entire capacity for a dm-zoned target
    table entry fails because the specified capacity is not aligned on a
    device zone size indicated in the request queue structure of the
    device.

    Fix this problem by ignoring the last runt zone in the entry length
    when seting up the dm-zoned target (ctr method) and when iterating table
    entries of the target (iterate_devices method). This allows dm-zoned
    users to still easily setup a target using the entire device capacity
    (as mandated by dm-zoned) or the aligned capacity excluding the last
    runt zone.

    While at it, replace direct references to the device queue chunk_sectors
    limit with calls to the accessor blk_queue_zone_sectors().

    Reported-by: Peter Desnoyers
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 0440d5c0ca9744b92a07aeb6df0a9a75db6f4280 upstream.

    When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
    this unaligned memory for its buffers (if an unaligned buffer crosses a
    page, XFS frees it and allocates a full page instead - see the function
    xfs_buf_allocate_memory).

    dm-crypt checks if bv_offset is aligned on page size and these checks
    fail with slub_debug and XFS.

    Fix this bug by removing the bv_offset checks. Switch to checking if
    bv_len is aligned instead of bv_offset (this check should be sufficient
    to prevent overruns if a bio with too small bv_len is received).

    Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size")
    Reported-by: Bruno Prémont
    Tested-by: Bruno Prémont
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Milan Broz
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit d1260e2a3f85f4c1010510a15f89597001318b1b upstream.

    When a DM cache in writeback mode moves data between the slow and fast
    device it can often avoid a copy if the triggering bio either:

    i) covers the whole block (no point copying if we're about to overwrite it)
    ii) the migration is a promotion and the origin block is currently discarded

    Prior to this fix there was a race with case (ii). The discard status
    was checked with a shared lock held (rather than exclusive). This meant
    another bio could run in parallel and write data to the origin, removing
    the discard state. After the promotion the parallel write would have
    been lost.

    With this fix the discard status is re-checked once the exclusive lock
    has been aquired. If the block is no longer discarded it falls back to
    the slower full copy path.

    Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     
  • commit 95b1369a9638cfa322ad1c0cde8efbe524059884 upstream.

    When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
    this unaligned memory for its buffers (if an unaligned buffer crosses a
    page, XFS frees it and allocates a full page instead - see the function
    xfs_buf_allocate_memory).

    dm-integrity checks if bv_offset is aligned on page size and this check
    fail with slub_debug and XFS.

    Fix this bug by removing the bv_offset check, leaving only the check for
    bv_len.

    Fixes: 7eada909bfd7 ("dm: add integrity target")
    Reported-by: Bruno Prémont
    Reviewed-by: Milan Broz
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 9ceace3c9c18c67676e75141032a65a8e01f9a7a upstream.

    This commit adds PCI ID for Raven platform

    Signed-off-by: Vijendar Mukunda
    Signed-off-by: Takashi Iwai
    Signed-off-by: Greg Kroah-Hartman

    Vijendar Mukunda
     
  • commit f2ddaf8dfd4a5071ad09074d2f95ab85d35c8a1e upstream.

    Extend the Cavium ThunderX ACS quirk to cover more device IDs and restrict
    it to only Root Ports.

    Signed-off-by: Vadim Lomovtsev
    [bhelgaas: changelog, stable tag]
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Greg Kroah-Hartman

    Vadim Lomovtsev
     
  • commit 7f342678634f16795892677204366e835e450dda upstream.

    The Cavium ThunderX (CN8XXX) family of PCIe Root Ports does not advertise
    an ACS capability. However, the RTL internally implements similar
    protection as if ACS had Request Redirection, Completion Redirection,
    Source Validation, and Upstream Forwarding features enabled.

    Change Cavium ACS capabilities quirk flags accordingly.

    Fixes: b404bcfbf035 ("PCI: Add ACS quirk for all Cavium devices")
    Signed-off-by: Vadim Lomovtsev
    [bhelgaas: tidy changelog, comment, stable tag]
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Greg Kroah-Hartman

    Vadim Lomovtsev
     
  • commit 79aa801e899417a56863d6713f76c4e108856000 upstream.

    The effective_affinity_mask is always set when an interrupt is assigned in
    __assign_irq_vector() -> apic->cpu_mask_to_apicid(), e.g. for struct apic
    apic_physflat: -> default_cpu_mask_to_apicid() ->
    irq_data_update_effective_affinity(), but it looks d->common->affinity
    remains all-1's before the user space or the kernel changes it later.

    In the early allocation/initialization phase of an IRQ, we should use the
    effective_affinity_mask, otherwise Hyper-V may not deliver the interrupt to
    the expected CPU. Without the patch, if we assign 7 Mellanox ConnectX-3
    VFs to a 32-vCPU VM, one of the VFs may fail to receive interrupts.

    Tested-by: Adrian Suhov
    Signed-off-by: Dexuan Cui
    Signed-off-by: Bjorn Helgaas
    Reviewed-by: Jake Oshins
    Cc: Jork Loeser
    Cc: Stephen Hemminger
    Cc: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • commit c00054f540bf81e592e1fee709b0bdbf20f478b5 upstream.

    Previously we programmed the LTR_L1.2_THRESHOLD in the parent (upstream)
    device using the capability pointer of the *child* (downstream) device,
    which corrupted some random word of the parent's config space.

    Use the parent's L1 SS capability pointer to program its
    LTR_L1.2_THRESHOLD.

    Fixes: aeda9adebab8 ("PCI/ASPM: Configure L1 substate settings")
    Signed-off-by: Bjorn Helgaas
    Reviewed-by: Vidya Sagar
    CC: Rajat Jain
    Signed-off-by: Greg Kroah-Hartman

    Bjorn Helgaas
     
  • commit 94ac327e043ee40d7fc57b54541da50507ef4e99 upstream.

    Every Port that supports the L1.2 substate advertises its Port
    Common_Mode_Restore_Time, i.e., the time the Port requires to re-establish
    common mode when exiting L1.2 (see PCIe r3.1, sec 7.33.2).

    Per sec 5.5.3.3.1, when exiting L1.2, the Downstream Port (the device at
    the upstream end of the link) must send TS1 training sequences for at least
    T(COMMONMODE) after it detects electrical idle exit on the Link. We want
    this to be long enough for both ends of the Link, so we should set it to
    the maximum of the Port Common_Mode_Restore_Time for the upstream and
    downstream components on the Link.

    Previously we only looked at the Port Common_Mode_Restore_Time of the
    upstream device, so if the downstream device required more time, we didn't
    program the upstream device's T(COMMONMODE) correctly.

    Fixes: f1f0366dd6be ("PCI/ASPM: Calculate and save the L1.2 timing parameters")
    Signed-off-by: Bjorn Helgaas
    Reviewed-by: Vidya Sagar
    Acked-by: Rajat Jain
    Signed-off-by: Greg Kroah-Hartman

    Bjorn Helgaas
     
  • commit 7978db344719dab1e56d05e6fc04aaaddcde0a5e upstream.

    The for_each_available_child_of_node() loop in _of_add_opp_table_v2()
    doesn't drop the reference to "np" on errors. Fix that.

    Fixes: 274659029c9d (PM / OPP: Add support to parse "operating-points-v2" bindings)
    Signed-off-by: Tobias Jordan
    [ VK: Improved commit log. ]
    Signed-off-by: Viresh Kumar
    Reviewed-by: Stephen Boyd
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Tobias Jordan
     
  • commit 6a468d5990ecd1c2d07dd85f8633bbdd0ba61c40 upstream.

    We can end up sleeping for a while waiting for the dead timeout, which
    means we could get the per request timer to fire. We did handle this
    case, but if the dead timeout happened right after we submitted we'd
    either tear down the connection or possibly requeue as we're handling an
    error and race with the endio which can lead to panics and other
    hilarity.

    Fixes: 560bc4b39952 ("nbd: handle dead connections")
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit ff57dc94faec023abc267cdc45766fccff497557 upstream.

    If we have a pending signal or the user kills their application then
    it'll bring down the whole device, which is less than awesome. Instead
    wait uninterruptible for the dead timeout so we're sure we gave it our
    best shot.

    Fixes: 560bc4b39952 ("nbd: handle dead connections")
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 0d63785c6b94b5d2f095f90755825f90eea791f5 upstream.

    The mvneta controller provides a 8-bit register to update the pending
    Tx descriptor counter. Then, a maximum of 255 Tx descriptors can be
    added at once. In the current code the mvneta_txq_pend_desc_add function
    assumes the caller takes care of this limit. But it is not the case. In
    some situations (xmit_more flag), more than 255 descriptors are added.
    When this happens, the Tx descriptor counter register is updated with a
    wrong value, which breaks the whole Tx queue management.

    This patch fixes the issue by allowing the mvneta_txq_pend_desc_add
    function to process more than 255 Tx descriptors.

    Fixes: 2a90f7e1d5d0 ("net: mvneta: add xmit_more support")
    Signed-off-by: Simon Guinot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Simon Guinot
     
  • commit 05a67cc258e75ac9758e6f13d26337b8be51162a upstream.

    There is a typo inside the pinmux setup code. The function is called
    refclk and not reclk.

    Fixes: 53263a1c6852 ("MIPS: ralink: add mt7628an support")
    Signed-off-by: Mathias Kresin
    Acked-by: John Crispin
    Cc: Ralf Baechle
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/16047/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    Mathias Kresin
     
  • commit 8ef4b43cd3794d63052d85898e42424fd3b14d24 upstream.

    According to the datasheet the REFCLK pin is shared with GPIO#37 and
    the PERST pin is shared with GPIO#36.

    Fixes: 53263a1c6852 ("MIPS: ralink: add mt7628an support")
    Signed-off-by: Mathias Kresin
    Acked-by: John Crispin
    Cc: Ralf Baechle
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/16046/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    Mathias Kresin
     
  • commit a3f143106596d739e7fbc4b84c96b1475247d876 upstream.

    __cmpxchg64_local_generic() is atomic only w.r.t tasks and interrupts
    on the same CPU (that's what the 'local' means). We can't use it to
    implement cmpxchg64() in SMP configurations.

    So, for 32-bit SMP configurations:

    - Don't define cmpxchg64()
    - Don't enable HAVE_VIRT_CPU_ACCOUNTING_GEN, which requires it

    Fixes: e2093c7b03c1 ("MIPS: Fall back to generic implementation of ...")
    Fixes: bb877e96bea1 ("MIPS: Add support for full dynticks CPU time accounting")
    Signed-off-by: Ben Hutchings
    Cc: Ralf Baechle
    Cc: Deng-Cheng Zhu
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/17413/
    Signed-off-by: James Hogan
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings