28 Apr, 2016

1 commit

  • This commit replaces an #ifdef with IS_ENABLED(), saving five lines.

    Signed-off-by: Paul E. McKenney
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: corbet@lwn.net
    Cc: dave@stgolabs.net
    Cc: dhowells@redhat.com
    Cc: linux-doc@vger.kernel.org
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1461691328-5429-4-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

13 Apr, 2016

3 commits

  • This function compiles to 1328 bytes of machine code. Three callsites.

    Registering a new lock class is definitely not *that* time-critical to inline it.

    Signed-off-by: Denys Vlasenko
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460141926-13069-5-git-send-email-dvlasenk@redhat.com
    Signed-off-by: Ingo Molnar

    Denys Vlasenko
     
  • It has been found that paths that invoke cleanups through
    lock_torture_cleanup() can trigger NULL pointer dereferencing
    bugs during the statistics printing phase. This is mainly
    because we should not be calling into statistics before we are
    sure things have been set up correctly.

    Specifically, early checks (and the need for handling this in
    the cleanup call) only include parameter checks and basic
    statistics allocation. Once we start write/read kthreads
    we then consider the test as started. As such, update the function
    in question to check for cxt.lwsa writer stats, if not set,
    we either have a bogus parameter or -ENOMEM situation and
    therefore only need to deal with general torture calls.

    Reported-and-tested-by: Kefeng Wang
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bobby.prani@gmail.com
    Cc: dhowells@redhat.com
    Cc: dipankar@in.ibm.com
    Cc: dvhart@linux.intel.com
    Cc: edumazet@google.com
    Cc: fweisbec@gmail.com
    Cc: jiangshanlai@gmail.com
    Cc: josh@joshtriplett.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: oleg@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1460476038-27060-2-git-send-email-paulmck@linux.vnet.ibm.com
    [ Improved the changelog. ]
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • For the case of rtmutex torturing we will randomly call into the
    boost() handler, including upon module exiting when the tasks are
    deboosted before stopping. In such cases the task may or may not have
    already been boosted, and therefore the NULL being explicitly passed
    can occur anywhere. Currently we only assume that the task will is
    at a higher prio, and in consequence, dereference a NULL pointer.

    This patch fixes the case of a rmmod locktorture exploding while
    pounding on the rtmutex lock (partial trace):

    task: ffff88081026cf80 ti: ffff880816120000 task.ti: ffff880816120000
    RSP: 0018:ffff880816123eb0 EFLAGS: 00010206
    RAX: ffff88081026cf80 RBX: ffff880816bfa630 RCX: 0000000000160d1b
    RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
    RBP: ffff88081026cf80 R08: 000000000000001f R09: ffff88017c20ca80
    R10: 0000000000000000 R11: 000000000048c316 R12: ffffffffa05d1840
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88203f880000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 0000000001c0a000 CR4: 00000000000406e0
    Stack:
    ffffffffa05d141d ffff880816bfa630 ffffffffa05d1922 ffff88081e70c2c0
    ffff880816bfa630 ffffffff81095fed 0000000000000000 ffffffff8107bf60
    ffff880816bfa630 ffffffff00000000 ffff880800000000 ffff880816123f08
    Call Trace:
    [] kthread+0xbd/0xe0
    [] ret_from_fork+0x3f/0x70

    This patch ensures that if the random state pointer is not NULL and current
    is not boosted, then do nothing.

    RIP: 0010:[] [] torture_random+0x5/0x60 [torture]
    [] torture_rtmutex_boost+0x1d/0x90 [locktorture]
    [] lock_torture_writer+0xe2/0x170 [locktorture]

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bobby.prani@gmail.com
    Cc: dhowells@redhat.com
    Cc: dipankar@in.ibm.com
    Cc: dvhart@linux.intel.com
    Cc: edumazet@google.com
    Cc: fweisbec@gmail.com
    Cc: jiangshanlai@gmail.com
    Cc: josh@joshtriplett.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: oleg@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1460476038-27060-1-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

04 Apr, 2016

1 commit

  • Fix this:

    kernel/locking/lockdep.c:2051:13: warning: ‘print_collision’ defined but not used [-Wunused-function]
    static void print_collision(struct task_struct *curr,
    ^

    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1459759327-2880-1-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

03 Apr, 2016

2 commits

  • Pull perf fixes from Ingo Molnar:
    "Misc kernel side fixes:

    - fix event leak
    - fix AMD PMU driver bug
    - fix core event handling bug
    - fix build bug on certain randconfigs

    Plus misc tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/amd/ibs: Fix pmu::stop() nesting
    perf/core: Don't leak event in the syscall error path
    perf/core: Fix time tracking bug with multiplexing
    perf jit: genelf makes assumptions about endian
    perf hists: Fix determination of a callchain node's childlessness
    perf tools: Add missing initialization of perf_sample.cpumode in synthesized samples
    perf tools: Fix build break on powerpc
    perf/x86: Move events_sysfs_show() outside CPU_SUP_INTEL
    perf bench: Fix detached tarball building due to missing 'perf bench memcpy' headers
    perf tests: Fix tarpkg build test error output redirection

    Linus Torvalds
     
  • Pull core kernel fixes from Ingo Molnar:
    "This contains the nohz/atomic cleanup/fix for the fetch_or() ugliness
    you noted during the original nohz pull request, plus there's also
    misc fixes:

    - fix liblockdep build bug
    - fix uapi header build bug
    - print more lockdep hash collision info to help debug recent reports
    of hash collisions
    - update MAINTAINERS email address"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    MAINTAINERS: Update my email address
    locking/lockdep: Print chain_key collision information
    uapi/linux/stddef.h: Provide __always_inline to userspace headers
    tools/lib/lockdep: Fix unsupported 'basename -s' in run_tests.sh
    locking/atomic, sched: Unexport fetch_or()
    timers/nohz: Convert tick dependency mask to atomic_t
    locking/atomic: Introduce atomic_fetch_or()

    Linus Torvalds
     

02 Apr, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Missing device reference in IPSEC input path results in crashes
    during device unregistration. From Subash Abhinov Kasiviswanathan.

    2) Per-queue ISR register writes not being done properly in macb
    driver, from Cyrille Pitchen.

    3) Stats accounting bugs in bcmgenet, from Patri Gynther.

    4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
    Quentin Armitage.

    5) SXGBE driver has off-by-one in probe error paths, from Rasmus
    Villemoes.

    6) Fix race in save/swap/delete options in netfilter ipset, from
    Vishwanath Pai.

    7) Ageing time of bridge not set properly when not operating over a
    switchdev device. Fix from Haishuang Yan.

    8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
    Duyck.

    9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

    10) FEC driver should only access registers that actually exist on the
    given chipset, fix from Fabio Estevam.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
    net: mvneta: fix changing MTU when using per-cpu processing
    stmmac: fix MDIO settings
    Revert "stmmac: Fix 'eth0: No PHY found' regression"
    stmmac: fix TX normal DESC
    net: mvneta: use cache_line_size() to get cacheline size
    net: mvpp2: use cache_line_size() to get cacheline size
    net: mvpp2: fix maybe-uninitialized warning
    tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
    net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
    rtnl: fix msg size calculation in if_nlmsg_size()
    fec: Do not access unexisting register in Coldfire
    net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
    net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
    net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
    net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
    bpf: make padding in bpf_tunnel_key explicit
    ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
    bnxt_en: Fix ethtool -a reporting.
    bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
    bnxt_en: Implement proper firmware message padding.
    ...

    Linus Torvalds
     

31 Mar, 2016

3 commits

  • A sequence of pairs [class_idx -> corresponding chain_key iteration]
    is printed for both the current held_lock chain and the cached chain.

    That exposes the two different class_idx sequences that led to that
    particular hash value.

    This helps with debugging hash chain collision reports.

    Signed-off-by: Alfredo Alvarez Fernandez
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-fsdevel@vger.kernel.org
    Cc: sedat.dilek@gmail.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Alfredo Alvarez Fernandez
     
  • In the error path, event_file not being NULL is used to determine
    whether the event itself still needs to be free'd, so fix it up to
    avoid leaking.

    Reported-by: Leon Yu
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 130056275ade ("perf: Do not double free")
    Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Stephane reported that commit:

    3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")

    introduced a regression wrt. time tracking, as easily observed by:

    > This patch introduce a bug in the time tracking of events when
    > multiplexing is used.
    >
    > The issue is easily reproducible with the following perf run:
    >
    > $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
    > 1.000730239 652,394 branches (66.41%)
    > 1.000730239 597,809 branches (66.41%)
    > 1.000730239 593,870 branches (66.63%)
    > 1.000730239 651,440 branches (67.03%)
    > 1.000730239 656,725 branches (66.96%)
    > 1.000730239 branches
    >
    > One branches event is shown as not having run. Yet, with
    > multiplexing, all events should run especially with a 1s (-I 1000)
    > interval. The delta for time_running comes out to 0. Yet, the event
    > has run because the kernel is actually multiplexing the events. The
    > problem is that the time tracking is the kernel and especially in
    > ctx_sched_out() is wrong now.
    >
    > The problem is that in case that the kernel enters ctx_sched_out() with the
    > following state:
    > ctx->is_active=0x7 event_type=0x1
    > Call Trace:
    > [] dump_stack+0x63/0x82
    > [] ctx_sched_out+0x2bc/0x2d0
    > [] perf_mux_hrtimer_handler+0xf6/0x2c0
    > [] ? __perf_install_in_context+0x130/0x130
    > [] __hrtimer_run_queues+0xf8/0x2f0
    > [] hrtimer_interrupt+0xb7/0x1d0
    > [] local_apic_timer_interrupt+0x38/0x60
    > [] smp_apic_timer_interrupt+0x3d/0x50
    > [] apic_timer_interrupt+0x8c/0xa0
    >
    > In that case, the test:
    > if (is_active & EVENT_TIME)
    >
    > will be false and the time will not be updated. Time must always be updated on
    > sched out.

    Fix this by always updating time if EVENT_TIME was set, as opposed to
    only updating time when EVENT_TIME changed.

    Reported-by: Stephane Eranian
    Tested-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: kan.liang@intel.com
    Cc: namhyung@kernel.org
    Fixes: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
    Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Mar, 2016

2 commits

  • This patch functionally reverts:

    5fd7a09cfb8c ("atomic: Export fetch_or()")

    During the merge Linus observed that the generic version of fetch_or()
    was messy:

    " This makes the ugly "fetch_or()" macro that the scheduler used
    internally a new generic helper, and does a bad job at it. "

    e23604edac2a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Now that we have introduced atomic_fetch_or(), fetch_or() is only used
    by the scheduler in order to deal with thread_info flags which type
    can vary across architectures.

    Lets confine fetch_or() back to the scheduler so that we encourage
    future users to use the more robust and well typed atomic_t version
    instead.

    While at it, fetch_or() gets robustified, pasting improvements from a
    previous patch by Ingo Molnar that avoids needless expression
    re-evaluations in the loop.

    Reported-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1458830281-4255-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The tick dependency mask was intially unsigned long because this is the
    type on which clear_bit() operates on and fetch_or() accepts it.

    But now that we have atomic_fetch_or(), we can instead use
    atomic_andnot() to clear the bit. This consolidates the type of our
    tick dependency mask, reduce its size on structures and benefit from
    possible architecture optimizations on atomic_t operations.

    Suggested-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1458830281-4255-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

26 Mar, 2016

3 commits

  • KASAN needs to know whether the allocation happens in an IRQ handler.
    This lets us strip everything below the IRQ entry point to reduce the
    number of unique stack traces needed to be stored.

    Move the definition of __irq_entry to so that the
    users don't need to pull in . Also introduce the
    __softirq_entry macro which is similar to __irq_entry, but puts the
    corresponding functions to the .softirqentry.text section.

    Signed-off-by: Alexander Potapenko
    Acked-by: Steven Rostedt
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When oom_reaper manages to unmap all the eligible vmas there shouldn't
    be much of the freable memory held by the oom victim left anymore so it
    makes sense to clear the TIF_MEMDIE flag for the victim and allow the
    OOM killer to select another task.

    The lack of TIF_MEMDIE also means that the victim cannot access memory
    reserves anymore but that shouldn't be a problem because it would get
    the access again if it needs to allocate and hits the OOM killer again
    due to the fatal_signal_pending resp. PF_EXITING check. We can safely
    hide the task from the OOM killer because it is clearly not a good
    candidate anymore as everyhing reclaimable has been torn down already.

    This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
    and thus hold off further global OOM killer actions granted the oom
    reaper is able to take mmap_sem for the associated mm struct. This is
    not guaranteed now but further steps should make sure that mmap_sem for
    write should be blocked killable which will help to reduce such a lock
    contention. This is not done by this patch.

    Note that exit_oom_victim might be called on a remote task from
    __oom_reap_task now so we have to check and clear the flag atomically
    otherwise we might race and underflow oom_victims or wake up waiters too
    early.

    Signed-off-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This will be needed in the patch "mm, oom: introduce oom reaper".

    Acked-by: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

25 Mar, 2016

6 commits

  • Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
    tc can check for them when loading objects from a pinned entry, e.g.
    if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
    pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf:
    pre-allocate hash map elements"), so that tc can still support this
    with v4.6.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pull more power management and ACPI updates from Rafael Wysocki:
    "The second batch of power management and ACPI updates for v4.6.

    Included are fixups on top of the previous PM/ACPI pull request and
    other material that didn't make into it but still should go into 4.6.

    Among other things, there's a fix for an intel_pstate driver issue
    uncovered by recent cpufreq changes, a workaround for a boot hang on
    Skylake-H related to the handling of deep C-states by the platform and
    a PCI/ACPI fix for the handling of IO port resources on non-x86
    architectures plus some new device IDs and similar.

    Specifics:

    - Fix for an intel_pstate driver issue related to the handling of MSR
    updates uncovered by the recent cpufreq rework (Rafael Wysocki).

    - cpufreq core cleanups related to starting governors and frequency
    synchronization during resume from system suspend and a locking fix
    for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

    - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
    Michael Neuling, Richard Cochran, Shilpasri Bhat).

    - intel_idle driver update preventing some Skylake-H systems from
    hanging during initialization by disabling deep C-states mishandled
    by the platform in the problematic configurations (Len Brown).

    - Intel Xeon Phi Processor x200 support for intel_idle
    (Dasaratharaman Chandramouli).

    - cpuidle menu governor updates to make it always honor PM QoS
    latency constraints (and prevent C1 from being used as the fallback
    C-state on x86 when they are set below its exit latency) and to
    restore the previous behavior to fall back to C1 if the next timer
    event is set far enough in the future that was changed in 4.4 which
    led to an energy consumption regression (Rik van Riel, Rafael
    Wysocki).

    - New device ID for a future AMD UART controller in the ACPI driver
    for AMD SoCs (Wang Hongcheng).

    - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
    scaling (AVS) driver (David Wu).

    - ACPI PCI resources management fix for the handling of IO space
    resources on architectures where the IO space is memory mapped
    (IA64 and ARM64) broken by the introduction of common ACPI
    resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

    - Fix for the ACPI backend of the generic device properties API to
    make it parse non-device (data node only) children of an ACPI
    device correctly (Irina Tirdea).

    - Fixes for the handling of global suspend flags (introduced in 4.4)
    during hibernation and resume from it (Lukas Wunner).

    - Support for obtaining configuration information from Device Trees
    in the PM clocks framework (Jon Hunter).

    - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
    King, Geert Uytterhoeven)"

    * tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399
    intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
    intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate
    cpufreq: governor: Always schedule work on the CPU running update
    cpufreq: Always update current frequency before startig governor
    cpufreq: Introduce cpufreq_update_current_freq()
    cpufreq: Introduce cpufreq_start_governor()
    cpufreq: powernv: Add sysfs attributes to show throttle stats
    cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
    PCI: ACPI: IA64: fix IO port generic range check
    ACPI / util: cast data to u64 before shifting to fix sign extension
    cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
    cpuidle: menu: Fall back to polling if next timer event is near
    cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
    intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
    cpufreq: Make cpufreq_quick_get() safe to call
    ACPI / property: fix data node parsing in acpi_get_next_subnode()
    ACPI / APD: Add device HID for future AMD UART controller
    ...

    Linus Torvalds
     
  • * pm-avs:
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399

    * pm-clk:
    PM / clk: Add support for obtaining clocks from device-tree

    * pm-devfreq:
    PM / devfreq: Spelling s/frequnecy/frequency/

    * pm-sleep:
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate

    Rafael J. Wysocki
     
  • Pull tracing updates from Steven Rostedt:
    "Nothing major this round. Mostly small clean ups and fixes.

    Some visible changes:

    - A new flag was added to distinguish traces done in NMI context.

    - Preempt tracer now shows functions where preemption is disabled but
    interrupts are still enabled.

    Other notes:

    - Updates were done to function tracing to allow better performance
    with perf.

    - Infrastructure code has been added to allow for a new histogram
    feature for recording live trace event histograms that can be
    configured by simple user commands. The feature itself was just
    finished, but needs a round in linux-next before being pulled.

    This only includes some infrastructure changes that will be needed"

    * tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
    tracing: Record and show NMI state
    tracing: Fix trace_printk() to print when not using bprintk()
    tracing: Remove redundant reset per-CPU buff in irqsoff tracer
    x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
    tracing: Fix crash from reading trace_pipe with sendfile
    tracing: Have preempt(irqs)off trace preempt disabled functions
    tracing: Fix return while holding a lock in register_tracer()
    ftrace: Use kasprintf() in ftrace_profile_tracefs()
    ftrace: Update dynamic ftrace calls only if necessary
    ftrace: Make ftrace_hash_rec_enable return update bool
    tracing: Fix typoes in code comment and printk in trace_nop.c
    tracing, writeback: Replace cgroup path to cgroup ino
    tracing: Use flags instead of bool in trigger structure
    tracing: Add an unreg_all() callback to trigger commands
    tracing: Add needs_rec flag to event triggers
    tracing: Add a per-event-trigger 'paused' field
    tracing: Add get_syscall_name()
    tracing: Add event record param to trigger_ops.func()
    tracing: Make event trigger functions available
    tracing: Make ftrace_event_field checking functions available
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "This tree contains various perf fixes on the kernel side, plus three
    hw/event-enablement late additions:

    - Intel Memory Bandwidth Monitoring events and handling
    - the AMD Accumulated Power Mechanism reporting facility
    - more IOMMU events

    ... and a final round of perf tooling updates/fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    perf llvm: Use strerror_r instead of the thread unsafe strerror one
    perf llvm: Use realpath to canonicalize paths
    perf tools: Unexport some methods unused outside strbuf.c
    perf probe: No need to use formatting strbuf method
    perf help: Use asprintf instead of adhoc equivalents
    perf tools: Remove unused perf_pathdup, xstrdup functions
    perf tools: Do not include stringify.h from the kernel sources
    tools include: Copy linux/stringify.h from the kernel
    tools lib traceevent: Remove redundant CPU output
    perf tools: Remove needless 'extern' from function prototypes
    perf tools: Simplify die() mechanism
    perf tools: Remove unused DIE_IF macro
    perf script: Remove lots of unused arguments
    perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
    perf machine: Rename perf_event__preprocess_sample to machine__resolve
    perf tools: Add cpumode to struct perf_sample
    perf tests: Forward the perf_sample in the dwarf unwind test
    perf tools: Remove misplaced __maybe_unused
    perf list: Fix documentation of :ppp
    perf bench numa: Fix assertion for nodes bitfield
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a
    cputime fix and two cpuacct cleanups"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cpuacct: Simplify the cpuacct code
    sched/cpuacct: Rename parameter in cpuusage_write() for readability
    sched/fair: Add comments to explain select_idle_sibling()
    sched/fair: Fix fairness issue on migration
    sched/cgroup: Fix/cleanup cgroup teardown/init
    sched/cputime: Fix steal time accounting vs. CPU hotplug

    Linus Torvalds
     

23 Mar, 2016

16 commits

  • When suspending to RAM, waking up and later suspending to disk,
    we gratuitously runtime resume devices after the thaw phase.
    This does not occur if we always suspend to RAM or always to disk.

    pm_complete_with_resume_check(), which gets called from
    pci_pm_complete() among others, schedules a runtime resume
    if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
    a suspend-to-RAM cycle. It is cleared at the beginning of
    the suspend-to-RAM cycle but not afterwards and it is not
    cleared during a suspend-to-disk cycle at all. Fix it.

    Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
    Signed-off-by: Lukas Wunner
    Cc: 4.4+ # 4.4+
    Signed-off-by: Rafael J. Wysocki

    Lukas Wunner
     
  • Use the more common logging method with the eventual goal of removing
    pr_warning altogether.

    Miscellanea:

    - Realign arguments
    - Coalesce formats
    - Add missing space between a few coalesced formats

    Signed-off-by: Joe Perches
    Acked-by: Rafael J. Wysocki [kernel/power/suspend.c]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Add a flag to memremap() for writecombine mappings. Mappings satisfied
    by this flag will not be cached, however writes may be delayed or
    combined into more efficient bursts. This is most suitable for buffers
    written sequentially by the CPU for use by other DMA devices.

    Signed-off-by: Brian Starkey
    Reviewed-by: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • These patches implement a MEMREMAP_WC flag for memremap(), which can be
    used to obtain writecombine mappings. This is then used for setting up
    dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.

    The motivation is to fix an alignment fault on arm64, and the suggestion
    to implement MEMREMAP_WC for this case was made at [1]. That particular
    issue is handled in patch 4, which makes sure that the appropriate
    memset function is used when zeroing allocations mapped as IO memory.

    This patch (of 4):

    Don't modify the flags input argument to memremap(). MEMREMAP_WB is
    already a special case so we can check for it directly instead of
    clearing flag bits in each mapper.

    Signed-off-by: Brian Starkey
    Cc: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
    padding) of the part of the struct siginfo that is before the union, and
    it is then used to calculate the needed padding (SI_PAD_SIZE) to make
    the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.

    Depending on the target architecture and word width it equals to either
    3 or 4 times sizeof int.

    Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
    parisc architecture for the 64bit kernel build. It's even more
    frustrating, because it can easily be checked at compile time if the
    value was defined correctly.

    This patch adds such a check for the correctness of
    __ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
    future architectures from running into the same problem.

    I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
    copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
    make any difference and b) it's used in the Documentation/kmemcheck.txt
    example.

    I ran this patch through the 0-DAY kernel test infrastructure and only
    the parisc architecture triggered as expected. That means that this
    patch should be OK for all major architectures.

    Signed-off-by: Helge Deller
    Cc: Stephen Rothwell
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • A couple of functions and variables in the profile implementation are
    used only on SMP systems by the procfs code, but are unused if either
    procfs is disabled or in uniprocessor kernels. gcc prints a harmless
    warning about the unused symbols:

    kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_flip_buffers(void)
    ^
    kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_discard_flip_buffers(void)
    ^
    kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
    static int profile_cpu_callback(struct notifier_block *info,
    ^

    This adds further #ifdef to the file, to annotate exactly in which cases
    they are used. I have done several thousand ARM randconfig kernels with
    this patch applied and no longer get any warnings in this file.

    Signed-off-by: Arnd Bergmann
    Cc: Vlastimil Babka
    Cc: Robin Holt
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Commit 1717f2096b54 ("panic, x86: Fix re-entrance problem due to panic
    on NMI") and commit 58c5661f2144 ("panic, x86: Allow CPUs to save
    registers even if looping in NMI context") introduced nmi_panic() which
    prevents concurrent/recursive execution of panic(). It also saves
    registers for the crash dump on x86.

    However, there are some cases where NMI handlers still use panic().
    This patch set partially replaces them with nmi_panic() in those cases.

    Even this patchset is applied, some NMI or similar handlers (e.g. MCE
    handler) continue to use panic(). This is because I can't test them
    well and actual problems won't happen. For example, the possibility
    that normal panic and panic on MCE happen simultaneously is very low.

    This patch (of 3):

    Convert nmi_panic() to a proper function and export it instead of
    exporting internal implementation details to modules, for obvious
    reasons.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Borislav Petkov
    Acked-by: Michal Nazarewicz
    Cc: Michal Hocko
    Cc: Rasmus Villemoes
    Cc: Nicolas Iooss
    Cc: Javi Merino
    Cc: Gobinda Charan Maji
    Cc: "Steven Rostedt (Red Hat)"
    Cc: Thomas Gleixner
    Cc: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • This test-case (simplified version of generated by syzkaller)

    #include
    #include
    #include

    void test(void)
    {
    for (;;) {
    if (fork()) {
    wait(NULL);
    continue;
    }

    ptrace(PTRACE_SEIZE, getppid(), 0, 0);
    ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
    _exit(0);
    }
    }

    int main(void)
    {
    int np;

    for (np = 0; np < 8; ++np)
    if (!fork())
    test();

    while (wait(NULL) > 0)
    ;
    return 0;
    }

    triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
    problem is that __ptrace_unlink() clears task->jobctl under siglock but
    task->ptrace is cleared without this lock held; this fools the "else"
    branch which assumes that !PT_SEIZED means PT_PTRACED.

    Note also that most of other PTRACE_SEIZE checks can race with detach
    from the exiting tracer too. Say, the callers of ptrace_trap_notify()
    assume that SEIZED can't go away after it was checked.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Tejun Heo
    Cc: syzkaller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Except on SPARC, this is what the code always did. SPARC compat seccomp
    was buggy, although the impact of the bug was limited because SPARC
    32-bit and 64-bit syscall numbers are the same.

    Signed-off-by: Andy Lutomirski
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
    translation should check ptrace() ABI, not caller task ABI.

    This is an ABI change on SPARC. Let's hope that no one relied on the
    old buggy ABI.

    Signed-off-by: Andy Lutomirski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Seccomp wants to know the syscall bitness, not the caller task bitness,
    when it selects the syscall whitelist.

    As far as I know, this makes no difference on any architecture, so it's
    not a security problem. (It generates identical code everywhere except
    sparc, and, on sparc, the syscall numbering is the same for both ABIs.)

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
    khungtaskd is interrupted and again sleeps for full timeout duration.

    This means that hang task will not be checked if new timeout is written
    periodically within old timeout duration and/or checking of hang task
    will be delayed for up to previous timeout duration. Fix this by
    remembering last time khungtaskd checked hang task.

    This change will allow other watchdog tasks (if any) to share khungtaskd
    by sleeping for minimal timeout diff of all watchdog tasks. Doing more
    watchdog tasks from khungtaskd will reduce the possibility of printk()
    collisions by multiple watchdog threads.

    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • The latency tracer format has a nice column to indicate IRQ state, but
    this is not able to tell us about NMI state.

    When tracing perf interrupt handlers (which often run in NMI context)
    it is very useful to see how the events nest.

    Link: http://lkml.kernel.org/r/20160318153022.105068893@infradead.org

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Steven Rostedt

    Peter Zijlstra
     
  • The trace_printk() code will allocate extra buffers if the compile detects
    that a trace_printk() is used. To do this, the format of the trace_printk()
    is saved to the __trace_printk_fmt section, and if that section is bigger
    than zero, the buffers are allocated (along with a message that this has
    happened).

    If trace_printk() uses a format that is not a constant, and thus something
    not guaranteed to be around when the print happens, the compiler optimizes
    the fmt out, as it is not used, and the __trace_printk_fmt section is not
    filled. This means the kernel will not allocate the special buffers needed
    for the trace_printk() and the trace_printk() will not write anything to the
    tracing buffer.

    Adding a "__used" to the variable in the __trace_printk_fmt section will
    keep it around, even though it is set to NULL. This will keep the string
    from being printed in the debugfs/tracing/printk_formats section as it is
    not needed.

    Reported-by: Vlastimil Babka
    Fixes: 07d777fe8c398 "tracing: Add percpu buffers for trace_printk()"
    Cc: stable@vger.kernel.org # v3.5+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

21 Mar, 2016

1 commit

  • - Use for() instead of while() loop in some functions
    to make the code simpler.

    - Use this_cpu_ptr() instead of per_cpu_ptr() to make the code
    cleaner and a bit faster.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Zhao Lei
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/d8a7ef9592f55224630cb26dea239f05b6398a4e.1458187654.git.zhaolei@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Zhao Lei