27 Nov, 2018

1 commit

  • [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]

    When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
    for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
    Juno or HiKey960), we get the following report:

    [ 0.748225] Call trace:
    [ 0.750685] lockdep_assert_cpus_held+0x30/0x40
    [ 0.755236] static_key_enable_cpuslocked+0x20/0xc8
    [ 0.760137] build_sched_domains+0x1034/0x1108
    [ 0.764601] sched_init_domains+0x68/0x90
    [ 0.768628] sched_init_smp+0x30/0x80
    [ 0.772309] kernel_init_freeable+0x278/0x51c
    [ 0.776685] kernel_init+0x10/0x108
    [ 0.780190] ret_from_fork+0x10/0x18

    The static_key in question is 'sched_asym_cpucapacity' introduced by
    commit:

    df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")

    In this particular case, we enable it because smp_prepare_cpus() will
    end up fetching the capacity-dmips-mhz entry from the devicetree,
    so we already have some asymmetry detected when entering sched_init_smp().

    This didn't get detected in tip/sched/core because we were missing:

    commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

    Calls to build_sched_domains() post sched_init_smp() will hold the
    hotplug lock, it just so happens that this very first call is a
    special case. As stated by a comment in sched_init_smp(), "There's no
    userspace yet to cause hotplug operations" so this is a harmless
    warning.

    However, to both respect the semantics of underlying
    callees and make lockdep happy, take the hotplug lock in
    sched_init_smp(). This also satisfies the comment atop
    sched_init_domains() that says "Callers must hold the hotplug lock".

    Reported-by: Sudeep Holla
    Tested-by: Sudeep Holla
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: morten.rasmussen@arm.com
    Cc: quentin.perret@arm.com
    Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

23 Nov, 2018

1 commit

  • This reverts commit 8a13906ae519b3ed95cd0fb73f1098b46362f6c4 which is
    commit 53c613fe6349994f023245519265999eed75957f upstream.

    It's not ready for the stable trees as there are major slowdowns
    involved with this patch.

    Reported-by: Jiri Kosina
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: "WoodhouseDavid"
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: "SchauflerCasey"
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

21 Nov, 2018

3 commits

  • commit fd5f7cde1b85d4c8e09ca46ce948e008a2377f64 upstream.

    This patch, basically, reverts commit 6b97a20d3a79 ("printk:
    set may_schedule for some of console_trylock() callers").
    That commit was a mistake, it introduced a big dependency
    on the scheduler, by enabling preemption under console_sem
    in printk()->console_unlock() path, which is rather too
    critical. The patch did not significantly reduce the
    possibilities of printk() lockups, but made it possible to
    stall printk(), as has been reported by Tetsuo Handa [1].

    Another issues is that preemption under console_sem also
    messes up with Steven Rostedt's hand off scheme, by making
    it possible to sleep with console_sem both in console_unlock()
    and in vprintk_emit(), after acquiring the console_sem
    ownership (anywhere between printk_safe_exit_irqrestore() in
    console_trylock_spinning() and printk_safe_enter_irqsave()
    in console_unlock()). This makes hand off less likely and,
    at the same time, may result in a significant amount of
    pending logbuf messages. Preempted console_sem owner makes
    it impossible for other CPUs to emit logbuf messages, but
    does not make it impossible for other CPUs to append new
    messages to the logbuf.

    Reinstate the old behavior and make printk() non-preemptible.
    Should any printk() lockup reports arrive they must be handled
    in a different way.

    [1] http://lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL%20()%20I-love%20!%20SAKURA%20!%20ne%20!%20jp
    Fixes: 6b97a20d3a79 ("printk: set may_schedule for some of console_trylock() callers")
    Link: http://lkml.kernel.org/r/20180116044716.GE6607@jagdpanzerIV
    To: Tetsuo Handa
    Cc: Sergey Senozhatsky
    Cc: Tejun Heo
    Cc: akpm@linux-foundation.org
    Cc: linux-mm@kvack.org
    Cc: Cong Wang
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jan Kara
    Cc: Mathieu Desnoyers
    Cc: Byungchul Park
    Cc: Pavel Machek
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Tetsuo Handa
    Reviewed-by: Steven Rostedt (VMware)
    Signed-off-by: Petr Mladek
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     
  • commit 568fb6f42ac6851320adaea25f8f1b94de14e40a upstream.

    Since commit ad67b74d2469 ("printk: hash addresses printed with %p"),
    all pointers printed with %p are printed with hashed addresses
    instead of real addresses in order to avoid leaking addresses in
    dmesg and syslog. But this applies to kdb too, with is unfortunate:

    Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry
    kdb> ps
    15 sleeping system daemon (state M) processes suppressed,
    use 'ps A' to see all.
    Task Addr Pid Parent [*] cpu State Thread Command
    0x(ptrval) 329 328 1 0 R 0x(ptrval) *sh

    0x(ptrval) 1 0 0 0 S 0x(ptrval) init
    0x(ptrval) 3 2 0 0 D 0x(ptrval) rcu_gp
    0x(ptrval) 4 2 0 0 D 0x(ptrval) rcu_par_gp
    0x(ptrval) 5 2 0 0 D 0x(ptrval) kworker/0:0
    0x(ptrval) 6 2 0 0 D 0x(ptrval) kworker/0:0H
    0x(ptrval) 7 2 0 0 D 0x(ptrval) kworker/u2:0
    0x(ptrval) 8 2 0 0 D 0x(ptrval) mm_percpu_wq
    0x(ptrval) 10 2 0 0 D 0x(ptrval) rcu_preempt

    The whole purpose of kdb is to debug, and for debugging real addresses
    need to be known. In addition, data displayed by kdb doesn't go into
    dmesg.

    This patch replaces all %p by %px in kdb in order to display real
    addresses.

    Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
    Cc:
    Signed-off-by: Christophe Leroy
    Signed-off-by: Daniel Thompson
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit dded2e159208a9edc21dd5c5f583afa28d378d39 upstream.

    On a powerpc 8xx, 'btc' fails as follows:

    Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
    kdb> btc
    btc: cpu status: Currently on cpu 0
    Available cpus: 0
    kdb_getarea: Bad address 0x0

    when booting the kernel with 'debug_boot_weak_hash', it fails as well

    Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry
    kdb> btc
    btc: cpu status: Currently on cpu 0
    Available cpus: 0
    kdb_getarea: Bad address 0xba99ad80

    On other platforms, Oopses have been observed too, see
    https://github.com/linuxppc/linux/issues/139

    This is due to btc calling 'btt' with %p pointer as an argument.

    This patch replaces %p by %px to get the real pointer value as
    expected by 'btt'

    Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
    Cc:
    Signed-off-by: Christophe Leroy
    Reviewed-by: Daniel Thompson
    Signed-off-by: Daniel Thompson
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

14 Nov, 2018

11 commits

  • commit 1ae80cf31938c8f77c37a29bbe29e7f1cd492be8 upstream.

    The map-in-map frequently serves as a mechanism for atomic
    snapshotting of state that a BPF program might record. The current
    implementation is dangerous to use in this way, however, since
    userspace has no way of knowing when all programs that might have
    retrieved the "old" value of the map may have completed.

    This change ensures that map update operations on map-in-map map types
    always wait for all references to the old map to drop before returning
    to userspace.

    Signed-off-by: Daniel Colascione
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Alexei Starovoitov
    [fengc@google.com: 4.14 backport: adjust context]
    Signed-off-by: Chenbo Feng
    Signed-off-by: Greg Kroah-Hartman

    Daniel Colascione
     
  • commit 746a923b863a1065ef77324e1e43f19b1a3eab5c upstream.

    Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
    threaded irqs") made detection of spurious interrupts work for threaded
    handlers by:

    a) incrementing a counter every time the thread returns IRQ_HANDLED, and
    b) checking whether that counter has increased every time the thread is
    woken.

    However for oneshot interrupts, the commit unmasks the interrupt before
    incrementing the counter. If another interrupt occurs right after
    unmasking but before the counter is incremented, that interrupt is
    incorrectly considered spurious:

    time
    | irq_thread()
    | irq_thread_fn()
    | action->thread_fn()
    | irq_finalize_oneshot()
    | unmask_threaded_irq() /* interrupt is unmasked */
    |
    | /* interrupt fires, incorrectly deemed spurious */
    |
    | atomic_inc(&desc->threads_handled); /* counter is incremented */
    v

    This is observed with a hi3110 CAN controller receiving data at high volume
    (from a separate machine sending with "cangen -g 0 -i -x"): The controller
    signals a huge number of interrupts (hundreds of millions per day) and
    every second there are about a dozen which are deemed spurious.

    In theory with high CPU load and the presence of higher priority tasks, the
    number of incorrectly detected spurious interrupts might increase beyond
    the 99,900 threshold and cause disablement of the interrupt.

    In practice it just increments the spurious interrupt count. But that can
    cause people to waste time investigating it over and over.

    Fix it by moving the accounting before the invocation of
    irq_finalize_oneshot().

    [ tglx: Folded change log update ]

    Fixes: 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of threaded irqs")
    Signed-off-by: Lukas Wunner
    Signed-off-by: Thomas Gleixner
    Cc: Mathias Duckeck
    Cc: Akshay Bhat
    Cc: Casey Fitzpatrick
    Cc: stable@vger.kernel.org # v3.16+
    Link: https://lkml.kernel.org/r/1dfd8bbd16163940648045495e3e9698e63b50ad.1539867047.git.lukas@wunner.de
    Signed-off-by: Greg Kroah-Hartman

    Lukas Wunner
     
  • commit 277fcdb2cfee38ccdbe07e705dbd4896ba0c9930 upstream.

    log_buf_len_setup does not check input argument before passing it to
    simple_strtoull. The argument would be a NULL pointer if "log_buf_len",
    without its value, is set in command line and thus causes the following
    panic.

    PANIC: early exception 0xe3 IP 10:ffffffffaaeacd0d error 0 cr2 0x0
    [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc4-yocto-standard+ #1
    [ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
    ...
    [ 0.000000] Call Trace:
    [ 0.000000] simple_strtoull+0x29/0x70
    [ 0.000000] memparse+0x26/0x90
    [ 0.000000] log_buf_len_setup+0x17/0x22
    [ 0.000000] do_early_param+0x57/0x8e
    [ 0.000000] parse_args+0x208/0x320
    [ 0.000000] ? rdinit_setup+0x30/0x30
    [ 0.000000] parse_early_options+0x29/0x2d
    [ 0.000000] ? rdinit_setup+0x30/0x30
    [ 0.000000] parse_early_param+0x36/0x4d
    [ 0.000000] setup_arch+0x336/0x99e
    [ 0.000000] start_kernel+0x6f/0x4ee
    [ 0.000000] x86_64_start_reservations+0x24/0x26
    [ 0.000000] x86_64_start_kernel+0x6f/0x72
    [ 0.000000] secondary_startup_64+0xa4/0xb0

    This patch adds a check to prevent the panic.

    Link: http://lkml.kernel.org/r/1538239553-81805-1-git-send-email-zhe.he@windriver.com
    Cc: stable@vger.kernel.org
    Cc: rostedt@goodmis.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: He Zhe
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek
    Signed-off-by: Greg Kroah-Hartman

    He Zhe
     
  • commit 6a32c2469c3fbfee8f25bcd20af647326650a6cf upstream.

    Building any configuration with 'make W=1' produces a warning:

    kernel/bounds.c:16:6: warning: no previous prototype for 'foo' [-Wmissing-prototypes]

    When also passing -Werror, this prevents us from building any other files.
    Nobody ever calls the function, but we can't make it 'static' either
    since we want the compiler output.

    Calling it 'main' instead however avoids the warning, because gcc
    does not insist on having a declaration for main.

    Link: http://lkml.kernel.org/r/20181005083313.2088252-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Reported-by: Kieran Bingham
    Reviewed-by: Kieran Bingham
    Cc: David Laight
    Cc: Masahiro Yamada
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit a36700589b85443e28170be59fa11c8a104130a5 upstream.

    While fixing an out of bounds array access in known_siginfo_layout
    reported by the kernel test robot it became apparent that the same bug
    exists in siginfo_layout and affects copy_siginfo_from_user32.

    The straight forward fix that makes guards against making this mistake
    in the future and should keep the code size small is to just take an
    unsigned signal number instead of a signed signal number, as I did to
    fix known_siginfo_layout.

    Cc: stable@vger.kernel.org
    Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • [ Upstream commit 3597dfe01d12f570bc739da67f857fd222a3ea66 ]

    Instead of playing whack-a-mole and changing SEND_SIG_PRIV to
    SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init
    gets signals sent by the kernel, stop allowing a pid namespace init to
    ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is
    only supposed to be able to ignore signals sent from itself and
    children with SIG_DFL.

    Fixes: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals")
    Reviewed-by: Thomas Gleixner
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • [ Upstream commit 819319fc93461c07b9cdb3064f154bd8cfd48172 ]

    Make reuse_unused_kprobe() to return error code if
    it fails to reuse unused kprobe for optprobe instead
    of calling BUG_ON().

    Signed-off-by: Masami Hiramatsu
    Cc: Anil S Keshavamurthy
    Cc: David S . Miller
    Cc: Linus Torvalds
    Cc: Naveen N . Rao
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     
  • [ Upstream commit 22839869f21ab3850fbbac9b425ccc4c0023926f ]

    The sigaltstack(2) system call fails with -ENOMEM if the new alternative
    signal stack is found to be smaller than SIGMINSTKSZ. On architectures
    such as arm64, where the native value for SIGMINSTKSZ is larger than
    the compat value, this can result in an unexpected error being reported
    to a compat task. See, for example:

    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385

    This patch fixes the problem by extending do_sigaltstack to take the
    minimum signal stack size as an additional parameter, allowing the
    native and compat system call entry code to pass in their respective
    values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
    been defined by the architecture.

    Cc: Arnd Bergmann
    Cc: Dominik Brodowski
    Cc: "Eric W. Biederman"
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Oleg Nesterov
    Reported-by: Steve McIntyre
    Tested-by: Steve McIntyre
    Signed-off-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • [ Upstream commit 9506a7425b094d2f1d9c877ed5a78f416669269b ]

    It was found that when debug_locks was turned off because of a problem
    found by the lockdep code, the system performance could drop quite
    significantly when the lock_stat code was also configured into the
    kernel. For instance, parallel kernel build time on a 4-socket x86-64
    server nearly doubled.

    Further analysis into the cause of the slowdown traced back to the
    frequent call to debug_locks_off() from the __lock_acquired() function
    probably due to some inconsistent lockdep states with debug_locks
    off. The debug_locks_off() function did an unconditional atomic xchg
    to write a 0 value into debug_locks which had already been set to 0.
    This led to severe cacheline contention in the cacheline that held
    debug_locks. As debug_locks is being referenced in quite a few different
    places in the kernel, this greatly slow down the system performance.

    To prevent that trashing of debug_locks cacheline, lock_acquired()
    and lock_contended() now checks the state of debug_locks before
    proceeding. The debug_locks_off() function is also modified to check
    debug_locks before calling __debug_locks_off().

    Signed-off-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1539913518-15598-1-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     
  • [ Upstream commit 9845c49cc9bbb317a0bc9e9cf78d8e09d54c9af0 ]

    The comment and the code around the update_min_vruntime() call in
    dequeue_entity() are not in agreement.

    >From commit:

    b60205c7c558 ("sched/fair: Fix min_vruntime tracking")

    I think that we want to update min_vruntime when a task is sleeping/migrating.
    So, the check is inverted there - fix it.

    Signed-off-by: Song Muchun
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: b60205c7c558 ("sched/fair: Fix min_vruntime tracking")
    Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Song Muchun
     
  • commit 53c613fe6349994f023245519265999eed75957f upstream.

    STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
    (once enabled) prevents cross-hyperthread control of decisions made by
    indirect branch predictors.

    Enable this feature if

    - the CPU is vulnerable to spectre v2
    - the CPU supports SMT and has SMT siblings online
    - spectre_v2 mitigation autoselection is enabled (default)

    After some previous discussion, this leaves STIBP on all the time, as wrmsr
    on crossing kernel boundary is a no-no. This could perhaps later be a bit
    more optimized (like disabling it in NOHZ, experiment with disabling it in
    idle, etc) if needed.

    Note that the synchronization of the mask manipulation via newly added
    spec_ctrl_mutex is currently not strictly needed, as the only updater is
    already being serialized by cpu_add_remove_lock, but let's make this a
    little bit more future-proof.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: "WoodhouseDavid"
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: "SchauflerCasey"
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     

10 Nov, 2018

2 commits

  • commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.

    With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
    distribute_cfs_runtime may not empty the throttled_list before it runs
    out of runtime to distribute. In that case, due to the change from
    c06f04c7048 to put throttled entries at the head of the list, later entries
    on the list will starve. Essentially, the same X processes will get pulled
    off the list, given CPU time and then, when expired, get put back on the
    head of the list where distribute_cfs_runtime will give runtime to the same
    set of processes leaving the rest.

    Fix the issue by setting a bit in struct cfs_bandwidth when
    distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
    decide to put the throttled entry on the tail or the head of the list. The
    bit is set/cleared by the callers of distribute_cfs_runtime while they hold
    cfs_bandwidth->lock.

    This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
    the live system. In some cases you can simply look at the throttled list and
    see the later entries are not changing:

    crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
    1 ffff90b56cb2d200 -976050
    2 ffff90b56cb2cc00 -484925
    3 ffff90b56cb2bc00 -658814
    4 ffff90b56cb2ba00 -275365
    5 ffff90b166a45600 -135138
    6 ffff90b56cb2da00 -282505
    7 ffff90b56cb2e000 -148065
    8 ffff90b56cb2fa00 -872591
    9 ffff90b56cb2c000 -84687
    10 ffff90b56cb2f000 -87237
    11 ffff90b166a40a00 -164582

    crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
    1 ffff90b56cb2d200 -994147
    2 ffff90b56cb2cc00 -306051
    3 ffff90b56cb2bc00 -961321
    4 ffff90b56cb2ba00 -24490
    5 ffff90b166a45600 -135138
    6 ffff90b56cb2da00 -282505
    7 ffff90b56cb2e000 -148065
    8 ffff90b56cb2fa00 -872591
    9 ffff90b56cb2c000 -84687
    10 ffff90b56cb2f000 -87237
    11 ffff90b166a40a00 -164582

    Sometimes it is easier to see by finding a process getting starved and looking
    at the sched_info:

    crash> task ffff8eb765994500 sched_info
    PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
    sched_info = {
    pcount = 8,
    run_delay = 697094208,
    last_arrival = 240260125039,
    last_queued = 240260327513
    },
    crash> task ffff8eb765994500 sched_info
    PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
    sched_info = {
    pcount = 8,
    run_delay = 697094208,
    last_arrival = 240260125039,
    last_queued = 240260327513
    },

    Signed-off-by: Phil Auld
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
    Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Phil Auld
     
  • commit 0962590e553331db2cc0aef2dc35c57f6300dbbe upstream.

    ALU operations on pointers such as scalar_reg += map_value_ptr are
    handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
    and range in the register state share a union, so transferring state
    through dst_reg->range = ptr_reg->range is just buggy as any new
    map_ptr in the dst_reg is then truncated (or null) for subsequent
    checks. Fix this by adding a raw member and use it for copying state
    over to dst_reg.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Daniel Borkmann
    Cc: Edward Cree
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Acked-by: Edward Cree
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     

04 Nov, 2018

4 commits

  • [ Upstream commit ba6b8de423f8d0dee48d6030288ed81c03ddf9f0 ]

    Relying on map_release hook to decrement the reference counts when a
    map is removed only works if the map is not being pinned. In the
    pinned case the ref is decremented immediately and the BPF programs
    released. After this BPF programs may not be in-use which is not
    what the user would expect.

    This patch moves the release logic into bpf_map_put_uref() and brings
    sockmap in-line with how a similar case is handled in prog array maps.

    Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    John Fastabend
     
  • [ Upstream commit e4a02ed2aaf447fa849e3254bfdb3b9b01e1e520 ]

    If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image
    in an arm64 virtual machine results in the following
    traceback if 8 CPUs are enabled:

    DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
    WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0
    ...
    Call trace:
    __mutex_unlock_slowpath()
    ww_mutex_unlock()
    test_cycle_work()
    process_one_work()
    worker_thread()
    kthread()
    ret_from_fork()

    If requesting b_mutex fails with -EDEADLK, the error variable
    is reassigned to the return value from calling ww_mutex_lock
    on a_mutex again. If this call fails, a_mutex is not locked.
    It is, however, unconditionally unlocked subsequently, causing
    the reported warning. Fix the problem by using two error variables.

    With this change, the selftest still fails as follows:

    cyclic deadlock not resolved, ret[7/8] = -35

    However, the traceback is gone.

    Signed-off-by: Guenter Roeck
    Cc: Chris Wilson
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: d1b42b800e5d0 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks")
    Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Guenter Roeck
     
  • [ Upstream commit cd6fb677ce7e460c25bdd66f689734102ec7d642 ]

    Some of the scheduling tracepoints allow the perf_tp_event
    code to write to ring buffer under different cpu than the
    code is running on.

    This results in corrupted ring buffer data demonstrated in
    following perf commands:

    # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 10 groups == 400 processes run

    Total time: 0.383 [sec]
    [ perf record: Woken up 8 times to write data ]
    0x42b890 [0]: failed to process type: -1765585640
    [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

    # perf report --stdio
    0x42b890 [0]: failed to process type: -1765585640

    The reason for the corruption are some of the scheduling tracepoints,
    that have __perf_task dfined and thus allow to store data to another
    cpu ring buffer:

    sched_waking
    sched_wakeup
    sched_wakeup_new
    sched_stat_wait
    sched_stat_sleep
    sched_stat_iowait
    sched_stat_blocked

    The perf_tp_event function first store samples for current cpu
    related events defined for tracepoint:

    hlist_for_each_entry_rcu(event, head, hlist_entry)
    perf_swevent_event(event, count, &data, regs);

    And then iterates events of the 'task' and store the sample
    for any task's event that passes tracepoint checks:

    ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

    list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
    if (event->attr.type != PERF_TYPE_TRACEPOINT)
    continue;
    if (event->attr.config != entry->type)
    continue;

    perf_swevent_event(event, count, &data, regs);
    }

    Above code can race with same code running on another cpu,
    ending up with 2 cpus trying to store under the same ring
    buffer, which is specifically not allowed.

    This patch prevents the problem, by allowing only events with the same
    current cpu to receive the event.

    NOTE: this requires the use of (per-task-)per-cpu buffers for this
    feature to work; perf-record does this.

    Signed-off-by: Jiri Olsa
    [peterz: small edits to Changelog]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Vagin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
    Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Jiri Olsa
     
  • [ Upstream commit a9f9772114c8b07ae75bcb3654bd017461248095 ]

    When we unregister a PMU, we fail to serialize the @pmu_idr properly.
    Fix that by doing the entire thing under pmu_lock.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 2e80a82a49c4 ("perf: Dynamic pmu types")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     

20 Oct, 2018

1 commit

  • commit 15d36fecd0bdc7510b70a0e5ec6671140b3fce0c upstream.

    When pmem namespaces created are smaller than section size, this can
    cause an issue during removal and gpf was observed:

    general protection fault: 0000 1 SMP PTI
    CPU: 36 PID: 3941 Comm: ndctl Tainted: G W 4.14.28-1.el7uek.x86_64 #2
    task: ffff88acda150000 task.stack: ffffc900233a4000
    RIP: 0010:__put_page+0x56/0x79
    Call Trace:
    devm_memremap_pages_release+0x155/0x23a
    release_nodes+0x21e/0x260
    devres_release_all+0x3c/0x48
    device_release_driver_internal+0x15c/0x207
    device_release_driver+0x12/0x14
    unbind_store+0xba/0xd8
    drv_attr_store+0x27/0x31
    sysfs_kf_write+0x3f/0x46
    kernfs_fop_write+0x10f/0x18b
    __vfs_write+0x3a/0x16d
    vfs_write+0xb2/0x1a1
    SyS_write+0x55/0xb9
    do_syscall_64+0x79/0x1ae
    entry_SYSCALL_64_after_hwframe+0x3d/0x0

    Add code to check whether we have a mapping already in the same section
    and prevent additional mappings from being created if that is the case.

    Link: http://lkml.kernel.org/r/152909478401.50143.312364396244072931.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Cc: Dan Williams
    Cc: Robert Elliott
    Cc: Jeff Moyer
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Dave Jiang
     

18 Oct, 2018

1 commit

  • commit 479adb89a97b0a33e5a9d702119872cc82ca21aa upstream.

    A cgroup which is already a threaded domain may be converted into a
    threaded cgroup if the prerequisite conditions are met. When this
    happens, all threaded descendant should also have their ->dom_cgrp
    updated to the new threaded domain cgroup. Unfortunately, this
    propagation was missing leading to the following failure.

    # cd /sys/fs/cgroup/unified
    # cat cgroup.subtree_control # show that no controllers are enabled

    # mkdir -p mycgrp/a/b/c
    # echo threaded > mycgrp/a/b/cgroup.type

    At this point, the hierarchy looks as follows:

    mycgrp [d]
    a [dt]
    b [t]
    c [inv]

    Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

    # echo threaded > mycgrp/a/cgroup.type

    By this point, we now have a hierarchy that looks as follows:

    mycgrp [dt]
    a [t]
    b [t]
    c [inv]

    But, when we try to convert the node "c" from "domain invalid" to
    "threaded", we get ENOTSUP on the write():

    # echo threaded > mycgrp/a/b/c/cgroup.type
    sh: echo: write error: Operation not supported

    This patch fixes the problem by

    * Moving the opencoded ->dom_cgrp save and restoration in
    cgroup_enable_threaded() into cgroup_{save|restore}_control() so
    that mulitple cgroups can be handled.

    * Updating all threaded descendants' ->dom_cgrp to point to the new
    dom_cgrp when enabling threaded mode.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: "Michael Kerrisk (man-pages)"
    Reported-by: Amin Jamali
    Reported-by: Joao De Almeida Pereira
    Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
    Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

13 Oct, 2018

1 commit

  • commit befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7 upstream.

    It is possible that a failure can occur during the scheduling of a
    pinned event. The initial portion of perf_event_read_local() contains
    the various error checks an event should pass before it can be
    considered valid. Ensure that the potential scheduling failure
    of a pinned event is checked for and have a credible error.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Reinette Chatre
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: fenghua.yu@intel.com
    Cc: tony.luck@intel.com
    Cc: acme@kernel.org
    Cc: gavin.hindman@intel.com
    Cc: jithu.joseph@intel.com
    Cc: dave.hansen@intel.com
    Cc: hpa@zytor.com
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Reinette Chatre
     

10 Oct, 2018

1 commit

  • commit b799207e1e1816b09e7a5920fbb2d5fcf6edd681 upstream.

    When I wrote commit 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification"), I
    assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
    is sufficient to just truncate the output to 32 bits; and so I just moved
    the register size coercion that used to be at the start of the function to
    the end of the function.

    That assumption is true for almost every op, but not for 32-bit right
    shifts, because those can propagate information towards the least
    significant bit. Fix it by always truncating inputs for 32-bit ops to 32
    bits.

    Also get rid of the coerce_reg_to_size() after the ALU op, since that has
    no effect.

    Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification")
    Acked-by: Daniel Borkmann
    Signed-off-by: Jann Horn
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

04 Oct, 2018

5 commits

  • [ Upstream commit 9b2e0388bec8ec5427403e23faff3b58dd1c3200 ]

    When sockmap code is using the stream parser it also handles the write
    space events in order to handle the case where (a) verdict redirects
    skb to another socket and (b) the sockmap then sends the skb but due
    to memory constraints (or other EAGAIN errors) needs to do a retry.

    But the initial code missed a third case where the
    skb_send_sock_locked() triggers an sk_wait_event(). A typically case
    would be when sndbuf size is exceeded. If this happens because we
    do not pass the write_space event to the lower layers we never wake
    up the event and it will wait for sndtimeo. Which as noted in ktls
    fix may be rather large and look like a hang to the user.

    To reproduce the best test is to reduce the sndbuf size and send
    1B data chunks to stress the memory handling. To fix this pass the
    event from the upper layer to the lower layer.

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     
  • [ Upstream commit 9f2d1e68cf4d641def734adaccfc3823d3575e6c ]

    Livepatch modules are special in that we preserve their entire symbol
    tables in order to be able to apply relocations after module load. The
    unwanted side effect of this is that undefined (SHN_UNDEF) symbols of
    livepatch modules are accessible via the kallsyms api and this can
    confuse symbol resolution in livepatch (klp_find_object_symbol()) and
    cause subtle bugs in livepatch.

    Have the module kallsyms api skip over SHN_UNDEF symbols. These symbols
    are usually not available for normal modules anyway as we cut down their
    symbol tables to just the core (non-undefined) symbols, so this should
    really just affect livepatch modules. Note that this patch doesn't
    affect the display of undefined symbols in /proc/kallsyms.

    Reported-by: Josh Poimboeuf
    Tested-by: Josh Poimboeuf
    Reviewed-by: Josh Poimboeuf
    Signed-off-by: Jessica Yu
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jessica Yu
     
  • [ Upstream commit 78c9c4dfbf8c04883941445a195276bb4bb92c76 ]

    The posix timer overrun handling is broken because the forwarding functions
    can return a huge number of overruns which does not fit in an int. As a
    consequence timer_getoverrun(2) and siginfo::si_overrun can turn into
    random number generators.

    The k_clock::timer_forward() callbacks return a 64 bit value now. Make
    k_itimer::ti_overrun[_last] 64bit as well, so the kernel internal
    accounting is correct. 3Remove the temporary (int) casts.

    Add a helper function which clamps the overrun value returned to user space
    via timer_getoverrun(2) or siginfo::si_overrun limited to a positive value
    between 0 and INT_MAX. INT_MAX is an indicator for user space that the
    overrun value has been clamped.

    Reported-by: Team OWL337
    Signed-off-by: Thomas Gleixner
    Acked-by: John Stultz
    Cc: Peter Zijlstra
    Cc: Michael Kerrisk
    Link: https://lkml.kernel.org/r/20180626132705.018623573@linutronix.de
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • [ Upstream commit 6fec64e1c92d5c715c6d0f50786daa7708266bde ]

    The posix timer ti_overrun handling is broken because the forwarding
    functions can return a huge number of overruns which does not fit in an
    int. As a consequence timer_getoverrun(2) and siginfo::si_overrun can turn
    into random number generators.

    As a first step to address that let the timer_forward() callbacks return
    the full 64 bit value.

    Cast it to (int) temporarily until k_itimer::ti_overrun is converted to
    64bit and the conversion to user space visible values is sanitized.

    Reported-by: Team OWL337
    Signed-off-by: Thomas Gleixner
    Acked-by: John Stultz
    Cc: Peter Zijlstra
    Cc: Michael Kerrisk
    Link: https://lkml.kernel.org/r/20180626132704.922098090@linutronix.de
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • [ Upstream commit 5f936e19cc0ef97dbe3a56e9498922ad5ba1edef ]

    Air Icy reported:

    UBSAN: Undefined behaviour in kernel/time/alarmtimer.c:811:7
    signed integer overflow:
    1529859276030040771 + 9223372036854775807 cannot be represented in type 'long long int'
    Call Trace:
    alarm_timer_nsleep+0x44c/0x510 kernel/time/alarmtimer.c:811
    __do_sys_clock_nanosleep kernel/time/posix-timers.c:1235 [inline]
    __se_sys_clock_nanosleep kernel/time/posix-timers.c:1213 [inline]
    __x64_sys_clock_nanosleep+0x326/0x4e0 kernel/time/posix-timers.c:1213
    do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290

    alarm_timer_nsleep() uses ktime_add() to add the current time and the
    relative expiry value. ktime_add() has no sanity checks so the addition
    can overflow when the relative timeout is large enough.

    Use ktime_add_safe() which has the necessary sanity checks in place and
    limits the result to the valid range.

    Fixes: 9a7adcf5c6de ("timers: Posix interface for alarm-timers")
    Reported-by: Team OWL337
    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1807020926360.1595@nanos.tec.linutronix.de
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 Sep, 2018

3 commits

  • Commit 0a0e0829f990 ("nohz: Fix missing tick reprogram when interrupting an
    inline softirq") got backported to stable trees and now causes the NOHZ
    softirq pending warning to trigger. It's not an upstream issue as the NOHZ
    update logic has been changed there.

    The problem is when a softirq disabled section gets interrupted and on
    return from interrupt the tick/nohz state is evaluated, which then can
    observe pending soft interrupts. These soft interrupts are legitimately
    pending because they cannot be processed as long as soft interrupts are
    disabled and the interrupted code will correctly process them when soft
    interrupts are reenabled.

    Add a check for softirqs disabled to the pending check to prevent the
    warning.

    Reported-by: Grygorii Strashko
    Reported-by: John Crispin
    Signed-off-by: Thomas Gleixner
    Tested-by: Grygorii Strashko
    Tested-by: John Crispin
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Anna-Maria Gleixner
    Cc: stable@vger.kernel.org
    Fixes: 2d898915ccf4838c ("nohz: Fix missing tick reprogram when interrupting an inline softirq")
    Acked-by: Frederic Weisbecker
    Tested-by: Geert Uytterhoeven

    Thomas Gleixner
     
  • commit d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe upstream.

    When a task which previously ran on a given CPU is remotely queued to
    wake up on that same CPU, there is a period where the task's state is
    TASK_WAKING and its vruntime is not normalized. This is not accounted
    for in vruntime_normalized() which will cause an error in the task's
    vruntime if it is switched from the fair class during this time.

    For example if it is boosted to RT priority via rt_mutex_setprio(),
    rq->min_vruntime will not be subtracted from the task's vruntime but
    it will be added again when the task returns to the fair class. The
    task's vruntime will have been erroneously doubled and the effective
    priority of the task will be reduced.

    Note this will also lead to inflation of all vruntimes since the doubled
    vruntime value will become the rq's min_vruntime when other tasks leave
    the rq. This leads to repeated doubling of the vruntime and priority
    penalty.

    Fix this by recognizing a WAKING task's vruntime as normalized only if
    sched_remote_wakeup is true. This indicates a migration, in which case
    the vruntime would have been normalized in migrate_task_rq_fair().

    Based on a similar patch from John Dias .

    Suggested-by: Peter Zijlstra
    Tested-by: Dietmar Eggemann
    Signed-off-by: Steve Muckle
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Redpath
    Cc: John Dias
    Cc: Linus Torvalds
    Cc: Miguel de Dios
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Paul Turner
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: kernel-team@android.com
    Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
    Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steve Muckle
     
  • commit 83f365554e47997ec68dc4eca3f5dce525cd15c3 upstream.

    When reducing ring buffer size, pages are removed by scheduling a work
    item on each CPU for the corresponding CPU ring buffer. After the pages
    are removed from ring buffer linked list, the pages are free()d in a
    tight loop. The loop does not give up CPU until all pages are removed.
    In a worst case behavior, when lot of pages are to be freed, it can
    cause system stall.

    After the pages are removed from the list, the free() can happen while
    the work is rescheduled. Call cond_resched() in the loop to prevent the
    system hangup.

    Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com

    Cc: stable@vger.kernel.org
    Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic")
    Reported-by: Jason Behmer
    Signed-off-by: Vaibhav Nagarnaik
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Vaibhav Nagarnaik
     

26 Sep, 2018

4 commits

  • [ Upstream commit 8fe5c5a937d0f4e84221631833a2718afde52285 ]

    When a new task wakes-up for the first time, its initial utilization
    is set to half of the spare capacity of its CPU. The current
    implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
    directly as a capacity reference. As a result, on a big.LITTLE system, a
    new task waking up on an idle little CPU will be given ~512 of util_avg,
    even if the CPU's capacity is significantly less than that.

    Fix this by computing the spare capacity with arch_scale_cpu_capacity().

    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Vincent Guittot
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Quentin Perret
     
  • [ Upstream commit 76e079fefc8f62bd9b2cd2950814d1ee806e31a5 ]

    wake_woken_function() synchronizes with wait_woken() as follows:

    [wait_woken] [wake_woken_function]

    entry->flags &= ~wq_flag_woken; condition = true;
    smp_mb(); smp_wmb();
    if (condition) wq_entry->flags |= wq_flag_woken;
    break;

    This commit replaces the above smp_wmb() with an smp_mb() in order to
    guarantee that either wait_woken() sees the wait condition being true
    or the store to wq_entry->flags in woken_wake_function() follows the
    store in wait_woken() in the coherence order (so that the former can
    eventually be observed by wait_woken()).

    The commit also fixes a comment associated to set_current_state() in
    wait_woken(): the comment pairs the barrier in set_current_state() to
    the above smp_wmb(), while the actual pairing involves the barrier in
    set_current_state() and the barrier executed by the try_to_wake_up()
    in wake_woken_function().

    Signed-off-by: Andrea Parri
    Signed-off-by: Paul E. McKenney
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akiyks@gmail.com
    Cc: boqun.feng@gmail.com
    Cc: dhowells@redhat.com
    Cc: j.alglave@ucl.ac.uk
    Cc: linux-arch@vger.kernel.org
    Cc: luc.maranget@inria.fr
    Cc: npiggin@gmail.com
    Cc: parri.andrea@gmail.com
    Cc: stern@rowland.harvard.edu
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     
  • [ Upstream commit baa2a4fdd525c8c4b0f704d20457195b29437839 ]

    audit_add_watch stores locally krule->watch without taking a reference
    on watch. Then, it calls audit_add_to_parent, and uses the watch stored
    locally.

    Unfortunately, it is possible that audit_add_to_parent updates
    krule->watch.
    When it happens, it also drops a reference of watch which
    could free the watch.

    How to reproduce (with KASAN enabled):

    auditctl -w /etc/passwd -F success=0 -k test_passwd
    auditctl -w /etc/passwd -F success=1 -k test_passwd2

    The second call to auditctl triggers the use-after-free, because
    audit_to_parent updates krule->watch to use a previous existing watch
    and drops the reference to the newly created watch.

    To fix the issue, we grab a reference of watch and we release it at the
    end of the function.

    Signed-off-by: Ronny Chevalier
    Reviewed-by: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ronny Chevalier
     
  • commit 02e184476eff848273826c1d6617bb37e5bcc7ad upstream.

    Perf can record user stack data in response to a synchronous request, such
    as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we
    end up reading user stack data using __copy_from_user_inatomic() under
    set_fs(KERNEL_DS). I think this conflicts with the intention of using
    set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64
    when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used.

    So fix this by forcing USER_DS when recording user stack data.

    Signed-off-by: Yabin Cui
    Acked-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 88b0193d9418 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()")
    Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Yabin Cui
     

20 Sep, 2018

2 commits

  • [ Upstream commit 363e934d8811d799c88faffc5bfca782fd728334 ]

    timer_base::must_forward_clock is indicating that the base clock might be
    stale due to a long idle sleep.

    The forwarding of the base clock takes place in the timer softirq or when a
    timer is enqueued to a base which is idle. If the enqueue of timer to an
    idle base happens from a remote CPU, then the following race can happen:

    CPU0 CPU1
    run_timer_softirq mod_timer

    base = lock_timer_base(timer);
    base->must_forward_clk = false
    if (base->must_forward_clk)
    forward(base); -> skipped

    enqueue_timer(base, timer, idx);
    -> idx is calculated high due to
    stale base
    unlock_timer_base(timer);
    base = lock_timer_base(timer);
    forward(base);

    The root cause is that timer_base::must_forward_clk is cleared outside the
    timer_base::lock held region, so the remote queuing CPU observes it as
    cleared, but the base clock is still stale. This can cause large
    granularity values for timers, i.e. the accuracy of the expiry time
    suffers.

    Prevent this by clearing the flag with timer_base::lock held, so that the
    forwarding takes place before the cleared flag is observable by a remote
    CPU.

    Signed-off-by: Gaurav Kohli
    Signed-off-by: Thomas Gleixner
    Cc: john.stultz@linaro.org
    Cc: sboyd@kernel.org
    Cc: linux-arm-msm@vger.kernel.org
    Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.org
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Gaurav Kohli
     
  • commit 69fa6eb7d6a64801ea261025cce9723d9442d773 upstream.

    When a teardown callback fails, the CPU hotplug code brings the CPU back to
    the previous state. The previous state becomes the new target state. The
    rollback happens in undo_cpu_down() which increments the state
    unconditionally even if the state is already the same as the target.

    As a consequence the next CPU hotplug operation will start at the wrong
    state. This is easily to observe when __cpu_disable() fails.

    Prevent the unconditional undo by checking the state vs. target before
    incrementing state and fix up the consequently wrong conditional in the
    unplug code which handles the failure of the final CPU take down on the
    control CPU side.

    Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
    Reported-by: Neeraj Upadhyay
    Signed-off-by: Thomas Gleixner
    Tested-by: Geert Uytterhoeven
    Tested-by: Sudeep Holla
    Tested-by: Neeraj Upadhyay
    Cc: josh@joshtriplett.org
    Cc: peterz@infradead.org
    Cc: jiangshanlai@gmail.com
    Cc: dzickus@redhat.com
    Cc: brendan.jackman@arm.com
    Cc: malat@debian.org
    Cc: sramana@codeaurora.org
    Cc: linux-arm-msm@vger.kernel.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    ----

    Thomas Gleixner