30 Nov, 2017

4 commits

  • commit 4f8413a3a799c958f7a10a6310a451e6b8aef5ad upstream.

    When requesting a shared interrupt, we assume that the firmware
    support code (DT or ACPI) has called irqd_set_trigger_type
    already, so that we can retrieve it and check that the requester
    is being reasonnable.

    Unfortunately, we still have non-DT, non-ACPI systems around,
    and these guys won't call irqd_set_trigger_type before requesting
    the interrupt. The consequence is that we fail the request that
    would have worked before.

    We can either chase all these use cases (boring), or address it
    in core code (easier). Let's have a per-irq_desc flag that
    indicates whether irqd_set_trigger_type has been called, and
    let's just check it when checking for a shared interrupt.
    If it hasn't been set, just take whatever the interrupt
    requester asks.

    Fixes: 382bd4de6182 ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs")
    Reported-and-tested-by: Petr Cvek
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

    When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.

    The current implementation of synchronize_sched_expedited() incorrectly
    assumes that resched_cpu() is unconditional, which it is not. This means
    that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
    fails as follows (analysis by Neeraj Upadhyay):

    o CPU1 is waiting for expedited wait to complete:

    sync_rcu_exp_select_cpus
    rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
    IPI sent to CPU5

    synchronize_sched_expedited_wait
    ret = swait_event_timeout(rsp->expedited_wq,
    sync_rcu_preempt_exp_done(rnp_root),
    jiffies_stall);

    expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

    o CPU5 handles IPI and fails to acquire rq lock.

    Handles IPI
    sync_sched_exp_handler
    resched_cpu
    returns while failing to try lock acquire rq->lock
    need_resched is not set

    o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
    idle (schedule() is not called).

    o CPU 1 reports RCU stall.

    Given that resched_cpu() is now used only by RCU, this commit fixes the
    assumption by making resched_cpu() unconditional.

    Reported-by: Neeraj Upadhyay
    Suggested-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.

    'cached_raw_freq' is used to get the next frequency quickly but should
    always be in sync with sg_policy->next_freq. There is a case where it is
    not and in such cases it should be reset to avoid switching to incorrect
    frequencies.

    Consider this case for example:

    - policy->cur is 1.2 GHz (Max)
    - New request comes for 780 MHz and we store that in cached_raw_freq.
    - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
    - We then see the CPU wasn't idle recently and choose to keep the next
    freq as 1.2 GHz.
    - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
    1.2 GHz.
    - Now if the utilization doesn't change in then next request, then the
    next target frequency will still be 780 MHz and it will match with
    cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

24 Nov, 2017

1 commit

  • commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.

    The pending-callbacks check in rcu_prepare_for_idle() is backwards.
    It should accelerate if there are pending callbacks, but the check
    rather uselessly accelerates only if there are no callbacks. This commit
    therefore inverts this check.

    Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
    Signed-off-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Greg Kroah-Hartman

    Neeraj Upadhyay
     

10 Nov, 2017

1 commit

  • Pull final power management fixes from Rafael Wysocki:
    "These fix a regression in the schedutil cpufreq governor introduced by
    a recent change and blacklist Dell XPS13 9360 from using the Low Power
    S0 Idle _DSM interface which triggers serious problems on one of these
    machines.

    Specifics:

    - Prevent the schedutil cpufreq governor from using the utilization
    of a wrong CPU in some cases which started to happen after one of
    the recent changes in it (Chris Redpath).

    - Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
    interface as that causes serious issue (related to NVMe) to appear
    on one of these machines, even though the other Dells XPS13 9360 in
    somewhat different HW configurations behave correctly (Rafael
    Wysocki)"

    * tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
    cpufreq: schedutil: Examine the correct CPU when we update util

    Linus Torvalds
     

09 Nov, 2017

1 commit


07 Nov, 2017

1 commit

  • Pull workqueue fix from Tejun Heo:
    "Another fix for a really old bug.

    It only affects drain_workqueue() which isn't used often and even then
    triggers only during a pretty small race window, so it isn't too
    surprising that it stayed hidden for so long.

    The fix is straight-forward and low-risk. Kudos to Li Bin for
    reporting and fixing the bug"

    * 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Fix NULL pointer dereference

    Linus Torvalds
     

06 Nov, 2017

1 commit


05 Nov, 2017

1 commit

  • After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
    callbacks) we stopped to always read the utilization for the CPU we
    are running the governor on, and instead we read it for the CPU
    which we've been told has updated utilization. This is stored in
    sugov_cpu->cpu.

    The value is set in sugov_register() but we clear it in sugov_start()
    which leads to always looking at the utilization of CPU0 instead of
    the correct one.

    Fix this by consolidating the initialization code into sugov_start().

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Signed-off-by: Chris Redpath
    Reviewed-by: Patrick Bellasi
    Reviewed-by: Brendan Jackman
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Chris Redpath
     

04 Nov, 2017

1 commit


03 Nov, 2017

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …el/git/gregkh/driver-core

    Pull initial SPDX identifiers from Greg KH:
    "License cleanup: add SPDX license identifiers to some files

    Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the
    'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
    binding shorthand, which can be used instead of the full boiler plate
    text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart
    and Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset
    of the use cases:

    - file had no licensing information it it.

    - file was a */uapi/* one with no licensing information in it,

    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to
    license had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied
    to a file was done in a spreadsheet of side by side results from of
    the output of two independent scanners (ScanCode & Windriver)
    producing SPDX tag:value files created by Philippe Ombredanne.
    Philippe prepared the base worksheet, and did an initial spot review
    of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537
    files assessed. Kate Stewart did a file by file comparison of the
    scanner results in the spreadsheet to determine which SPDX license
    identifier(s) to be applied to the file. She confirmed any
    determination that was not immediately clear with lawyers working with
    the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:

    - Files considered eligible had to be source code files.

    - Make and config files were included as candidates if they contained
    >5 lines of source

    - File already had some variant of a license header in it (even if <5
    lines).

    All documentation files were explicitly excluded.

    The following heuristics were used to determine which SPDX license
    identifiers to apply.

    - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.

    For non */uapi/* files that summary was:

    SPDX license identifier # files
    ---------------------------------------------------|-------
    GPL-2.0 11139

    and resulted in the first patch in this series.

    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
    was:

    SPDX license identifier # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note 930

    and resulted in the second patch in this series.

    - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point). Results summary:

    SPDX license identifier # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note 270
    GPL-2.0+ WITH Linux-syscall-note 169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
    LGPL-2.1+ WITH Linux-syscall-note 15
    GPL-1.0+ WITH Linux-syscall-note 14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
    LGPL-2.0+ WITH Linux-syscall-note 4
    LGPL-2.1 WITH Linux-syscall-note 3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

    and that resulted in the third patch in this series.

    - when the two scanners agreed on the detected license(s), that
    became the concluded license(s).

    - when there was disagreement between the two scanners (one detected
    a license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.

    - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply
    (and which scanner probably needed to revisit its heuristics).

    - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.

    - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.

    In total, over 70 hours of logged manual review was done on the
    spreadsheet to determine the SPDX license identifiers to apply to the
    source files by Kate, Philippe, Thomas and, in some cases,
    confirmation by lawyers working with the Linux Foundation.

    Kate also obtained a third independent scan of the 4.13 code base from
    FOSSology, and compared selected files where the other two scanners
    disagreed against that SPDX file, to see if there was new insights.
    The Windriver scanner is based on an older version of FOSSology in
    part, so they are related.

    Thomas did random spot checks in about 500 files from the spreadsheets
    for the uapi headers and agreed with SPDX license identifier in the
    files he inspected. For the non-uapi files Thomas did random spot
    checks in about 15000 files.

    In initial set of patches against 4.14-rc6, 3 files were found to have
    copy/paste license identifier errors, and have been fixed to reflect
    the correct identifier.

    Additionally Philippe spent 10 hours this week doing a detailed manual
    inspection and review of the 12,461 patched files from the initial
    patch version early this week with:

    - a full scancode scan run, collecting the matched texts, detected
    license ids and scores

    - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct

    - reviewing anything where there was no detection but the patch
    license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
    applied SPDX license was correct

    This produced a worksheet with 20 files needing minor correction. This
    worksheet was then exported into 3 different .csv files for the
    different types of files to be modified.

    These .csv files were then reviewed by Greg. Thomas wrote a script to
    parse the csv files and add the proper SPDX tag to the file, in the
    format that the file expected. This script was further refined by Greg
    based on the output to detect more types of files automatically and to
    distinguish between header and source .c files (which need different
    comment types.) Finally Greg ran the script using the .csv files to
    generate the patches.

    Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
    Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

    * tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    License cleanup: add SPDX license identifier to uapi header files with a license
    License cleanup: add SPDX license identifier to uapi header files with no license
    License cleanup: add SPDX GPL-2.0 license identifier to files with no license

    Linus Torvalds
     

02 Nov, 2017

6 commits

  • In commit 30d6e0a4190d ("futex: Remove duplicated code and fix undefined
    behaviour"), I let FUTEX_WAKE_OP to fail on invalid op. Namely when op
    should be considered as shift and the shift is out of range (< 0 or > 31).

    But strace's test suite does this madness:

    futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
    futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
    futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);

    When I pick the first 0xa0caffee, it decodes as:

    0x80000000 & 0xa0caffee: oparg is shift
    0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
    0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
    0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
    0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18

    That means the op tries to do this:

    (futex |= (1 << (-849))) == -18

    which is completely bogus. The new check of op in the code is:

    if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
    if (oparg < 0 || oparg > 31)
    return -EINVAL;
    oparg = 1 << oparg;
    }

    which results obviously in the "Invalid argument" errno:

    FAIL: futex
    ===========

    futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
    futex.test: failed test: ../futex failed with code 1

    So let us soften the failure to print only a (ratelimited) message, crop
    the value and continue as if it were right. When userspace keeps up, we
    can switch this to return -EINVAL again.

    [v2] Do not return 0 immediatelly, proceed with the cropped value.

    Fixes: 30d6e0a4190d ("futex: Remove duplicated code and fix undefined behaviour")
    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • Pull signal bugfix from Eric Biederman:
    "When making the generic support for SIGEMT conditional on the presence
    of SIGEMT I made a typo that causes it to fail to activate. It was
    noticed comparatively quickly but the bug report just made it to me
    today"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    signal: Fix name of SIGEMT in #if defined() check

    Linus Torvalds
     
  • Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
    added a check for SIGMET and NSIGEMT being defined. That SIGMET should
    in fact be SIGEMT, with SIGEMT being defined in
    arch/{alpha,mips,sparc}/include/uapi/asm/signal.h

    This was actually pointed out by BenHutchings in a lwn.net comment
    here https://lwn.net/Comments/734608/

    Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
    Signed-off-by: Andrew Clayton
    Signed-off-by: "Eric W. Biederman"

    Andrew Clayton
     
  • Guenter reported:
    There is still a problem. When running
    echo 6 > /proc/sys/kernel/watchdog_thresh
    echo 5 > /proc/sys/kernel/watchdog_thresh
    repeatedly, the message

    NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

    stops after a while (after ~10-30 iterations, with fluctuations).
    Maybe watchdog_cpus needs to be atomic ?

    That's correct as this again is affected by the asynchronous nature of the
    smpboot thread unpark mechanism.

    CPU 0 CPU1 CPU2
    write(watchdog_thresh, 6)
    stop()
    park()
    update()
    start()
    unpark()
    thread->unpark()
    cnt++;
    write(watchdog_thresh, 5) thread->unpark()
    stop()
    park() thread->park()
    cnt--; cnt++;
    update()
    start()
    unpark()

    That's not a functional problem, it just affects the informational message.

    Convert watchdog_cpus to atomic_t to prevent the problem

    Reported-and-tested-by: Guenter Roeck
    Signed-off-by: Don Zickus
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Link: https://lkml.kernel.org/r/20171101181126.j727fqjmdthjz4xk@redhat.com

    Don Zickus
     
  • …fy deferred event destroy")

    Guenter reported a crash in the watchdog/perf code, which is caused by
    cleanup() and enable() running concurrently. The reason for this is:

    The watchdog functions are serialized via the watchdog_mutex and cpu
    hotplug locking, but the enable of the perf based watchdog happens in
    context of the unpark callback of the smpboot thread. But that unpark
    function is not synchronous inside the locking. The unparking of the thread
    just wakes it up and leaves so there is no guarantee when the thread is
    executing.

    If it starts running _before_ the cleanup happened then it will create a
    event and overwrite the dead event pointer. The new event is then cleaned
    up because the event is marked dead.

    lock(watchdog_mutex);
    lockup_detector_reconfigure();
    cpus_read_lock();
    stop();
    park()
    update();
    start();
    unpark()
    cpus_read_unlock(); thread runs()
    overwrite dead event ptr
    cleanup();
    free new event, which is active inside perf....
    unlock(watchdog_mutex);

    The park side is safe as that actually waits for the thread to reach
    parked state.

    Commit a33d44843d45 removed the protection against this kind of scenario
    under the stupid assumption that the hotplug serialization and the
    watchdog_mutex cover everything.

    Bring it back.

    Reverts: a33d44843d45 ("watchdog/hardlockup/perf: Simplify deferred event destroy")
    Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
    Signed-off-by: Thomas Feels-stupid Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Don Zickus <dzickus@redhat.com>
    Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710312145190.1942@nanos

    Thomas Gleixner
     

01 Nov, 2017

2 commits

  • Dmitry (through syzbot) reported being able to trigger the WARN in
    get_pi_state() and a use-after-free on:

    raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);

    Both are due to this race:

    exit_pi_state_list() put_pi_state()

    lock(&curr->pi_lock)
    while() {
    pi_state = list_first_entry(head);
    hb = hash_futex(&pi_state->key);
    unlock(&curr->pi_lock);

    dec_and_test(&pi_state->refcount);

    lock(&hb->lock)
    lock(&pi_state->pi_mutex.wait_lock) // uaf if pi_state free'd
    lock(&curr->pi_lock);

    ....

    unlock(&curr->pi_lock);
    get_pi_state(); // WARN; refcount==0

    The problem is we take the reference count too late, and don't allow it
    being 0. Fix it by using inc_not_zero() and simply retrying the loop
    when we fail to get a refcount. In that case put_pi_state() should
    remove the entry from the list.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Gratian Crisan
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: dvhart@infradead.org
    Cc: syzbot
    Cc: syzkaller-bugs@googlegroups.com
    Cc:
    Fixes: c74aef2d06a9 ("futex: Fix pi_state->owner serialization")
    Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that SK_REDIRECT is no longer a valid return code. Remove it
    from the UAPI completely. Then do a namespace remapping internal
    to sockmap so SK_REDIRECT is no longer externally visible.

    Patchs primary change is to do a namechange from SK_REDIRECT to
    __SK_REDIRECT

    Reported-by: Alexei Starovoitov
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

30 Oct, 2017

2 commits

  • When queue_work() is used in irq (not in task context), there is
    a potential case that trigger NULL pointer dereference.
    ----------------------------------------------------------------
    worker_thread()
    |-spin_lock_irq()
    |-process_one_work()
    |-worker->current_pwq = pwq
    |-spin_unlock_irq()
    |-worker->current_func(work)
    |-spin_lock_irq()
    |-worker->current_pwq = NULL
    |-spin_unlock_irq()

    //interrupt here
    |-irq_handler
    |-__queue_work()
    //assuming that the wq is draining
    |-is_chained_work(wq)
    |-current_wq_worker()
    //Here, 'current' is the interrupted worker!
    |-current->current_pwq is NULL here!
    |-schedule()
    ----------------------------------------------------------------

    Avoid it by checking for task context in current_wq_worker(), and
    if not in task context, we shouldn't use the 'current' to check the
    condition.

    Reported-by: Xiaofei Tan
    Signed-off-by: Li Bin
    Reviewed-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()")
    Cc: stable@vger.kernel.org # v3.9+

    Li Bin
     
  • The following commit:

    864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")

    made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event
    targets %current's cgroup.

    This breaks perf_event's hierarchical support because events which target one
    of the ancestors get ignored.

    Fix it by using cgroup_is_descendant() test instead of equality.

    Signed-off-by: Tejun Heo
    Acked-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: David Carrillo-Cisneros
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: kernel-team@fb.com
    Cc: stable@vger.kernel.org # v4.9+
    Fixes: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
    Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.com
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

29 Oct, 2017

3 commits

  • Pull networking fixes from David Miller:

    1) Fix route leak in xfrm_bundle_create().

    2) In mac80211, validate user rate mask before configuring it. From
    Johannes Berg.

    3) Properly enforce memory limits in fair queueing code, from Toke
    Hoiland-Jorgensen.

    4) Fix lockdep splat in inet_csk_route_req(), from Eric Dumazet.

    5) Fix TSO header allocation and management in mvpp2 driver, from Yan
    Markman.

    6) Don't take socket lock in BH handler in strparser code, from Tom
    Herbert.

    7) Don't show sockets from other namespaces in AF_UNIX code, from
    Andrei Vagin.

    8) Fix double free in error path of tap_open(), from Girish Moodalbail.

    9) Fix TX map failure path in igb and ixgbe, from Jean-Philippe Brucker
    and Alexander Duyck.

    10) Fix DCB mode programming in stmmac driver, from Jose Abreu.

    11) Fix err_count handling in various tunnels (ipip, ip6_gre). From Xin
    Long.

    12) Properly align SKB head before building SKB in tuntap, from Jason
    Wang.

    13) Avoid matching qdiscs with a zero handle during lookups, from Cong
    Wang.

    14) Fix various endianness bugs in sctp, from Xin Long.

    15) Fix tc filter callback races and add selftests which trigger the
    problem, from Cong Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
    selftests: Introduce a new test case to tc testsuite
    selftests: Introduce a new script to generate tc batch file
    net_sched: fix call_rcu() race on act_sample module removal
    net_sched: add rtnl assertion to tcf_exts_destroy()
    net_sched: use tcf_queue_work() in tcindex filter
    net_sched: use tcf_queue_work() in rsvp filter
    net_sched: use tcf_queue_work() in route filter
    net_sched: use tcf_queue_work() in u32 filter
    net_sched: use tcf_queue_work() in matchall filter
    net_sched: use tcf_queue_work() in fw filter
    net_sched: use tcf_queue_work() in flower filter
    net_sched: use tcf_queue_work() in flow filter
    net_sched: use tcf_queue_work() in cgroup filter
    net_sched: use tcf_queue_work() in bpf filter
    net_sched: use tcf_queue_work() in basic filter
    net_sched: introduce a workqueue for RCU callbacks of tc filter
    sctp: fix some type cast warnings introduced since very beginning
    sctp: fix a type cast warnings that causes a_rwnd gets the wrong value
    sctp: fix some type cast warnings introduced by transport rhashtable
    sctp: fix some type cast warnings introduced by stream reconf
    ...

    Linus Torvalds
     
  • Recent additions to support multiple programs in cgroups impose
    a strict requirement, "all yes is yes, any no is no". To enforce
    this the infrastructure requires the 'no' return code, SK_DROP in
    this case, to be 0.

    To apply these rules to SK_SKB program types the sk_actions return
    codes need to be adjusted.

    This fix adds SK_PASS and makes 'SK_DROP = 0'. Finally, remove
    SK_ABORTED to remove any chance that the API may allow aborted
    program flows to be passed up the stack. This would be incorrect
    behavior and allow programs to break existing policies.

    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • SK_SKB program types use bpf_compute_data to store the end of the
    packet data. However, bpf_compute_data assumes the cb is stored in the
    qdisc layer format. But, for SK_SKB this is the wrong layer of the
    stack for this type.

    It happens to work (sort of!) because in most cases nothing happens
    to be overwritten today. This is very fragile and error prone.
    Fortunately, we have another hole in tcp_skb_cb we can use so lets
    put the data_end value there.

    Note, SK_SKB program types do not use data_meta, they are failed by
    sk_skb_is_valid_access().

    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

23 Oct, 2017

1 commit

  • Pull workqueue fix from Tejun Heo:
    "This is a fix for an old bug in workqueue. Workqueue used a mutex to
    arbitrate who gets to be the manager of a pool. When the manager role
    gets released, the mutex gets unlocked while holding the pool's
    irqsafe spinlock. This can lead to deadlocks as mutex's internal
    spinlock isn't irqsafe. This got discovered by recent fixes to mutex
    lockdep annotations.

    The fix is a bit invasive for rc6 but if anything were wrong with the
    fix it would likely have already blown up in -next, and we want the
    fix in -stable anyway"

    * 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: replace pool->manager_arb mutex with a flag

    Linus Torvalds
     

22 Oct, 2017

6 commits

  • Pull smp/hotplug fix from Thomas Gleixner:
    "The recent rework of the callback invocation missed to cleanup the
    leftovers of the operation, so under certain circumstances a
    subsequent CPU hotplug operation accesses stale data and crashes.
    Clean it up."

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpu/hotplug: Reset node state after operation

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "A set of small fixes mostly in the irq drivers area:

    - Make the tango irq chip work correctly, which requires a new
    function in the generiq irq chip implementation

    - A set of updates to the GIC-V3 ITS driver removing a bogus BUG_ON()
    and parsing the VCPU table size correctly"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack()
    irqchip/tango: Use irq_gc_mask_disable_and_ack_set
    genirq: generic chip: Add irq_gc_mask_disable_and_ack_set()
    irqchip/gic-v3-its: Add missing changes to support 52bit physical address
    irqchip/gic-v3-its: Fix the incorrect parsing of VCPU table size
    irqchip/gic-v3-its: Fix the incorrect BUG_ON in its_init_vpe_domain()
    DT: arm,gic-v3: Update the ITS size in the examples

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A little more than usual this time around. Been travelling, so that is
    part of it.

    Anyways, here are the highlights:

    1) Deal with memcontrol races wrt. listener dismantle, from Eric
    Dumazet.

    2) Handle page allocation failures properly in nfp driver, from Jaku
    Kicinski.

    3) Fix memory leaks in macsec, from Sabrina Dubroca.

    4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault.

    5) Several fixes in bnxt_en driver, including preventing potential
    NVRAM parameter corruption from Michael Chan.

    6) Fix for KRACK attacks in wireless, from Johannes Berg.

    7) rtnetlink event generation fixes from Xin Long.

    8) Deadlock in mlxsw driver, from Ido Schimmel.

    9) Disallow arithmetic operations on context pointers in bpf, from
    Jakub Kicinski.

    10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from
    Xin Long.

    11) Only TCP is supported for sockmap, make that explicit with a
    check, from John Fastabend.

    12) Fix IP options state races in DCCP and TCP, from Eric Dumazet.

    13) Fix panic in packet_getsockopt(), also from Eric Dumazet.

    14) Add missing locked in hv_sock layer, from Dexuan Cui.

    15) Various aquantia bug fixes, including several statistics handling
    cures. From Igor Russkikh et al.

    16) Fix arithmetic overflow in devmap code, from John Fastabend.

    17) Fix busted socket memory accounting when we get a fault in the tcp
    zero copy paths. From Willem de Bruijn.

    18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
    stmmac: Don't access tx_q->dirty_tx before netif_tx_lock
    ipv6: flowlabel: do not leave opt->tot_len with garbage
    of_mdio: Fix broken PHY IRQ in case of probe deferral
    textsearch: fix typos in library helpers
    rxrpc: Don't release call mutex on error pointer
    net: stmmac: Prevent infinite loop in get_rx_timestamp_status()
    net: stmmac: Fix stmmac_get_rx_hwtstamp()
    net: stmmac: Add missing call to dev_kfree_skb()
    mlxsw: spectrum_router: Configure TIGCR on init
    mlxsw: reg: Add Tunneling IPinIP General Configuration Register
    net: ethtool: remove error check for legacy setting transceiver type
    soreuseport: fix initialization race
    net: bridge: fix returning of vlan range op errors
    sock: correct sk_wmem_queued accounting on efault in tcp zerocopy
    bpf: add test cases to bpf selftests to cover all access tests
    bpf: fix pattern matches for direct packet access
    bpf: fix off by one for range markings with L{T, E} patterns
    bpf: devmap fix arithmetic overflow in bitmap_size calculation
    net: aquantia: Bad udp rate on default interrupt coalescing
    net: aquantia: Enable coalescing management via ethtool interface
    ...

    Linus Torvalds
     
  • Alexander had a test program with direct packet access, where
    the access test was in the form of data + X > data_end. In an
    unrelated change to the program LLVM decided to swap the branches
    and emitted code for the test in form of data + X
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • During review I noticed that the current logic for direct packet
    access marking in check_cond_jmp_op() has an off by one for the
    upper right range border when marking in find_good_pkt_pointers()
    with BPF_JLT and BPF_JLE. It's not really harmful given access
    up to pkt_end is always safe, but we should nevertheless correct
    the range marking before it becomes ABI. If pkt_data' denotes a
    pkt_data derived pointer (pkt_data + X), then for pkt_data' < pkt_end
    in the true branch as well as for pkt_end < pkt_end the verifier simulation cannot
    deduce that a byte load of pkt_data' - 1 would succeed in this
    branch.

    Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • An integer overflow is possible in dev_map_bitmap_size() when
    calculating the BITS_TO_LONG logic which becomes, after macro
    replacement,

    (((n) + (d) - 1)/ (d))

    where 'n' is a __u32 and 'd' is (8 * sizeof(long)). To avoid
    overflow cast to u64 before arithmetic.

    Reported-by: Richard Weinberger
    Acked-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

21 Oct, 2017

2 commits

  • The recent rework of the cpu hotplug internals changed the usage of the per
    cpu state->node field, but missed to clean it up after usage.

    So subsequent hotplug operations use the stale pointer from a previous
    operation and hand it into the callback functions. The callbacks then
    dereference a pointer which either belongs to a different facility or
    points to freed and potentially reused memory. In either case data
    corruption and crashes are the obvious consequence.

    Reset the node and the last pointers in the per cpu state to NULL after the
    operation which set them has completed.

    Fixes: 96abb968549c ("smp/hotplug: Allow external multi-instance rollback")
    Reported-by: Tvrtko Ursulin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Boris Ostrovsky
    Cc: "Paul E. McKenney"
    Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710211606130.3213@nanos

    Thomas Gleixner
     
  • As pointed out by Linus and David, the earlier waitid() fix resulted in
    a (currently harmless) unbalanced user_access_end() call. This fixes it
    to just directly return EFAULT on access_ok() failure.

    Fixes: 96ca579a1ecc ("waitid(): Add missing access_ok() checks")
    Acked-by: David Daney
    Cc: Al Viro
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Oct, 2017

5 commits

  • Devmap is used with XDP which requires CAP_NET_ADMIN so lets also
    make CAP_NET_ADMIN required to use the map.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Restrict sockmap to CAP_NET_ADMIN.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • SK_SKB BPF programs are run from the socket/tcp context but early in
    the stack before much of the TCP metadata is needed in tcp_skb_cb. So
    we can use some unused fields to place BPF metadata needed for SK_SKB
    programs when implementing the redirect function.

    This allows us to drop the preempt disable logic. It does however
    require an API change so sk_redirect_map() has been updated to
    additionally provide ctx_ptr to skb. Note, we do however continue to
    disable/enable preemption around actual BPF program running to account
    for map updates.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Only TCP sockets have been tested and at the moment the state change
    callback only handles TCP sockets. This adds a check to ensure that
    sockets actually being added are TCP sockets.

    For net-next we can consider UDP support.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Because many of RCU's files have not been included into docbook, a
    number of errors have accumulated. This commit fixes them.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Linus Torvalds

    Paul E. McKenney