18 Aug, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

    2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

    3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0. From
    Satish Baddipadige and Siva Reddy Kallam.

    4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

    5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer. From David Forster.

    6) Several sctp_diag fixes from Phil Sutter.

    7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

    8) Checksum fixup fixes in bpf from Daniel Borkmann.

    9) Memork leaks in nfnetlink, from Liping Zhang.

    10) Use after free in rxrpc, from David Howells.

    11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

    12) Calipso resource leak, from Colin Ian King.

    13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

    14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

    15) Fix lockdep splats in macsec, from Sabrina Dubroca.

    16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

    17) Various tc-action bug fixes, from CONG Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net_sched: allow flushing tc police actions
    net_sched: unify the init logic for act_police
    net_sched: convert tcf_exts from list to pointer array
    net_sched: move tc offload macros to pkt_cls.h
    net_sched: fix a typo in tc_for_each_action()
    net_sched: remove an unnecessary list_del()
    net_sched: remove the leftover cleanup_a()
    mlxsw: spectrum: Allow packets to be trapped from any PG
    mlxsw: spectrum: Unmap 802.1Q FID before destroying it
    mlxsw: spectrum: Add missing rollbacks in error path
    mlxsw: reg: Fix missing op field fill-up
    mlxsw: spectrum: Trap loop-backed packets
    mlxsw: spectrum: Add missing packet traps
    mlxsw: spectrum: Mark port as active before registering it
    mlxsw: spectrum: Create PVID vPort before registering netdevice
    mlxsw: spectrum: Remove redundant errors from the code
    mlxsw: spectrum: Don't return upon error in removal path
    i40e: check for and deal with non-contiguous TCs
    ixgbe: Re-enable ability to toggle VLAN filtering
    ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
    ...

    Linus Torvalds
     

13 Aug, 2016

9 commits

  • While hashing out BPF's current_task_under_cgroup helper bits, it came
    to discussion that the skb_in_cgroup helper name was suboptimally chosen.

    Tejun says:

    So, I think in_cgroup should mean that the object is in that
    particular cgroup while under_cgroup in the subhierarchy of that
    cgroup. Let's rename the other subhierarchy test to under too. I
    think that'd be a lot less confusing going forward.

    [...]

    It's more intuitive and gives us the room to implement the real
    "in" test if ever necessary in the future.

    Since this touches uapi bits, we need to change this as long as v4.8
    is not yet officially released. Thus, change the helper enum and rename
    related bits.

    Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
    Reference: http://patchwork.ozlabs.org/patch/658500/
    Suggested-by: Sargun Dhillon
    Suggested-by: Tejun Heo
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Pull power management fixes from Rafael Wysocki:
    "Two hibernation fixes allowing it to work with the recently added
    randomization of the kernel identity mapping base on x86-64 and one
    cpufreq driver regression fix.

    Specifics:

    - Fix the x86 identity mapping creation helpers to avoid the
    assumption that the base address of the mapping will always be
    aligned at the PGD level, as it may be aligned at the PUD level if
    address space randomization is enabled (Rafael Wysocki).

    - Fix the hibernation core to avoid executing tracing functions
    before restoring the processor state completely during resume
    (Thomas Garnier).

    - Fix a recently introduced regression in the powernv cpufreq driver
    that causes it to crash due to an out-of-bounds array access
    (Akshay Adiga)"

    * tag 'pm-4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / hibernate: Restore processor state before using per-CPU variables
    x86/power/64: Always create temporary identity mapping correctly
    cpufreq: powernv: Fix crash in gpstate_timer_handler()

    Linus Torvalds
     
  • Pull timer fixes from Ingo Molnar:
    "Misc fixes: a /dev/rtc regression fix, two APIC timer period
    calibration fixes, an ARM clocksource driver fix and a NOHZ
    power use regression fix"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup
    x86/timers/apic: Inform TSC deadline clockevent device about recalibration
    x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error
    timers: Fix get_next_timer_interrupt() computation
    clocksource/arm_arch_timer: Force per-CPU interrupt to be level-triggered

    Linus Torvalds
     
  • * pm-sleep:
    PM / hibernate: Restore processor state before using per-CPU variables
    x86/power/64: Always create temporary identity mapping correctly

    * pm-cpufreq:
    cpufreq: powernv: Fix crash in gpstate_timer_handler()

    Rafael J. Wysocki
     
  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes: cputime fixes, two deadline scheduler fixes and a cgroups
    scheduling fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Fix omitted ticks passed in parameter
    sched/cputime: Fix steal time accounting
    sched/deadline: Fix lock pinning warning during CPU hotplug
    sched/cputime: Mitigate performance regression in times()/clock_gettime()
    sched/fair: Fix typo in sync_throttle()
    sched/deadline: Fix wrap-around in DL heap

    Linus Torvalds
     
  • Restore the processor state before calling any other functions to
    ensure per-CPU variables can be used with KASLR memory randomization.

    Tracing functions use per-CPU variables (GS based on x86) and one was
    called just before restoring the processor state fully. It resulted
    in a double fault when both the tracing & the exception handler
    functions tried to use a per-CPU variable.

    Fixes: bb3632c6101b (PM / sleep: trace events for suspend/resume)
    Reported-and-tested-by: Borislav Petkov
    Reported-by: Jiri Kosina
    Tested-by: Rafael J. Wysocki
    Tested-by: Jiri Kosina
    Signed-off-by: Thomas Garnier
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Thomas Garnier
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, plus two uncore-PMU fixes, an uprobes fix, a
    perf-cgroups fix and an AUX events fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/uncore: Add enable_box for client MSR uncore
    perf/x86/intel/uncore: Fix uncore num_counters
    uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions
    perf/core: Set cgroup in CPU contexts for new cgroup events
    perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash
    perf probe ppc64le: Fix probe location when using DWARF
    perf probe: Add function to post process kernel trace events
    tools: Sync cpufeatures headers with the kernel
    toops: Sync tools/include/uapi/linux/bpf.h with the kernel
    tools: Sync cpufeatures.h and vmx.h with the kernel
    perf probe: Support signedness casting
    perf stat: Avoid skew when reading events
    perf probe: Fix module name matching
    perf probe: Adjust map->reloc offset when finding kernel symbol from map
    perf hists: Trim libtraceevent trace_seq buffers
    perf script: Add 'bpf-output' field to usage message

    Linus Torvalds
     
  • Pull locking fixes from Ingo Molnar:
    "Misc fixes: lockstat fix, futex fix on !MMU systems, big endian fix
    for qrwlocks and a race fix for pvqspinlocks"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/pvqspinlock: Fix a bug in qstat_read()
    locking/pvqspinlock: Fix double hash race
    locking/qrwlock: Fix write unlock bug on big endian systems
    futex: Assume all mappings are private on !MMU systems

    Linus Torvalds
     
  • Pull irq fix from Ingo Molnar:
    "A fix for an MSI regression"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq/msi: Make sure PCI MSIs are activated early

    Linus Torvalds
     

11 Aug, 2016

2 commits

  • Commit:

    f9bcf1e0e014 ("sched/cputime: Fix steal time accounting")

    ... fixes a leak on steal time accounting but forgets to account
    the ticks passed in parameters, assuming there is only one to
    take into account.

    Let's consider that parameter back.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Wanpeng Li
    Cc: Linus Torvalds
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Cc: linux-tip-commits@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160811125822.GB4214@lerouge
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Commit:

    57430218317 ("sched/cputime: Count actually elapsed irq & softirq time")

    ... didn't take steal time into consideration with passing the noirqtime
    kernel parameter.

    As Paolo pointed out before:

    | Why not? If idle=poll, for example, any time the guest is suspended (and
    | thus cannot poll) does count as stolen time.

    This patch fixes it by reducing steal time from idle time accounting when
    the noirqtime parameter is true. The average idle time drops from 56.8%
    to 54.75% for nohz idle kvm guest(noirqtime, idle=poll, four vCPUs running
    on one pCPU).

    Signed-off-by: Wanpeng Li
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra (Intel)
    Cc: Peter Zijlstra
    Cc: Radim
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1470893795-3527-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

10 Aug, 2016

11 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • It's obviously wrong to set stat to NULL. So lets remove it.
    Otherwise it is always zero when we check the latency of kick/wake.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     
  • When the lock holder vCPU is racing with the queue head:

    CPU 0 (lock holder) CPU1 (queue head)
    =================== =================
    spin_lock(); spin_lock();
    pv_kick_node(): pv_wait_head_or_lock():
    if (!lp) {
    lp = pv_hash(lock, pn);
    xchg(&l->locked, _Q_SLOW_VAL);
    }
    WRITE_ONCE(pn->state, vcpu_halted);
    cmpxchg(&pn->state,
    vcpu_halted, vcpu_hashed);
    WRITE_ONCE(l->locked, _Q_SLOW_VAL);
    (void)pv_hash(lock, pn);

    In this case, lock holder inserts the pv_node of queue head into the
    hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
    restoring/setting vcpu_hashed state after failing adaptive locking
    spinning.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Pan Xinhui
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The following warning can be triggered by hot-unplugging the CPU
    on which an active SCHED_DEADLINE task is running on:

    WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
    releasing a pinned lock
    Call Trace:
    dump_stack+0x99/0xd0
    __warn+0xd1/0xf0
    ? dl_task_timer+0x1a1/0x2b0
    warn_slowpath_fmt+0x4f/0x60
    ? sched_clock+0x13/0x20
    lock_release+0x690/0x6a0
    ? enqueue_pushable_dl_task+0x9b/0xa0
    ? enqueue_task_dl+0x1ca/0x480
    _raw_spin_unlock+0x1f/0x40
    dl_task_timer+0x1a1/0x2b0
    ? push_dl_task.part.31+0x190/0x190
    WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
    unpinning an unpinned lock
    Call Trace:
    dump_stack+0x99/0xd0
    __warn+0xd1/0xf0
    warn_slowpath_fmt+0x4f/0x60
    lock_unpin_lock+0x181/0x1a0
    dl_task_timer+0x127/0x2b0
    ? push_dl_task.part.31+0x190/0x190

    As per the comment before this code, its safe to drop the RQ lock
    here, and since we (potentially) change rq, unpin and repin to avoid
    the splat.

    Signed-off-by: Wanpeng Li
    [ Rewrote changelog. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Commit:

    6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")

    fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
    allow a task to wake early. It addressed the problem by calling the scheduling
    classes update_curr() when the cputimer starts.

    Said change induced a considerable performance regression on the syscalls
    times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
    debuggers and applications that monitor their own performance that
    accidentally depend on the performance of these specific calls.

    This patch mitigates the performace loss by prefetching data in the CPU
    cache, as stalls due to cache misses appear to be where most time is spent
    in our benchmarks.

    Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
    box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
    variable number of threads, from 2 to 4*num_cpus; the results are in
    seconds and correspond to the average of 10 runs; the percentage gain is
    computed with (before-after)/before so a positive value is an improvement
    (it's faster). The improvement varies between a few percents for 5-20
    threads and more than 10% for 2 or >20 threads.

    pound_clock_gettime:

    threads 4.7-rc7 patched 4.7-rc7
    [num] [secs] [secs (percent)]
    2 3.48 3.06 ( 11.83%)
    5 3.33 3.25 ( 2.40%)
    8 3.37 3.26 ( 3.30%)
    12 3.32 3.37 ( -1.60%)
    21 4.01 3.90 ( 2.74%)
    30 3.63 3.36 ( 7.41%)
    48 3.71 3.11 ( 16.27%)
    79 3.75 3.16 ( 15.74%)
    110 3.81 3.25 ( 14.80%)
    128 3.88 3.31 ( 14.76%)

    pound_times:

    threads 4.7-rc7 patched 4.7-rc7
    [num] [secs] [secs (percent)]
    2 3.65 3.25 ( 11.03%)
    5 3.45 3.17 ( 7.92%)
    8 3.52 3.22 ( 8.69%)
    12 3.29 3.36 ( -2.04%)
    21 4.07 3.92 ( 3.78%)
    30 3.87 3.40 ( 12.17%)
    48 3.79 3.16 ( 16.61%)
    79 3.88 3.28 ( 15.42%)
    110 3.90 3.38 ( 13.35%)
    128 4.00 3.38 ( 15.45%)

    pound_clock_gettime and pound_clock_gettime are two benchmarks included in
    the MMTests framework. They launch a given number of threads which
    repeatedly call times() or clock_gettimes(). The results above can be
    reproduced with cloning MMTests from github.com and running the "poundtime"
    workload:

    $ git clone https://github.com/gormanm/mmtests.git
    $ cd mmtests
    $ cp configs/config-global-dhp__workload_poundtime config
    $ ./run-mmtests.sh --run-monitor $(uname -r)

    The above will run "poundtime" measuring the kernel currently running on
    the machine; Once a new kernel is installed and the machine rebooted,
    running again

    $ cd mmtests
    $ ./run-mmtests.sh --run-monitor $(uname -r)

    will produce results to compare with. A comparison table will be output
    with:

    $ cd mmtests/work/log
    $ ../../compare-kernels.sh

    the table will contain a lot of entries; grepping for "Amean" (as in
    "arithmetic mean") will give the tables presented above. The source code
    for the two benchmarks is reported at the end of this changelog for
    clairity.

    The cache misses addressed by this patch were found using a combination of
    `perf top`, `perf record` and `perf annotate`. The incriminated lines were
    found to be

    struct sched_entity *curr = cfs_rq->curr;

    and

    delta_exec = now - curr->exec_start;

    in the function update_curr() from kernel/sched/fair.c. This patch
    prefetches the data from memory just before update_curr is called in the
    interested execution path.

    A comparison of the total number of cycles before and after the patch
    follows; the data is obtained using `perf stat -r 10 -ddd `
    running over the same sequence of number of threads used above (a positive
    gain is an improvement):

    threads cycles before cycles after gain

    2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
    5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
    8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
    12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
    21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
    30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
    48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
    79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
    110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
    128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%

    A comparison of cache miss vs total cache loads ratios, before and after
    the patch (again from the `perf stat -r 10 -ddd ` tables):

    threads L1 misses/total*100 L1 misses/total*100 gain
    before after
    2 7.43 +-4.90% 7.36 +-4.70% 0.94%
    5 13.09 +-4.74% 13.52 +-3.73% -3.28%
    8 13.79 +-5.61% 12.90 +-3.27% 6.45%
    12 11.57 +-2.44% 8.71 +-1.40% 24.72%
    21 12.39 +-3.92% 9.97 +-1.84% 19.53%
    30 13.91 +-2.53% 11.73 +-2.28% 15.67%
    48 13.71 +-1.59% 12.32 +-1.97% 10.14%
    79 14.44 +-0.66% 13.40 +-1.06% 7.20%
    110 15.86 +-0.50% 14.46 +-0.59% 8.83%
    128 16.51 +-0.32% 15.06 +-0.78% 8.78%

    As a final note, the following shows the evolution of performance figures
    in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
    ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
    major source of degradation, mostly unaddressed to this day (figures
    expressed in seconds).

    pound_clock_gettime:

    threads parent of 6e998916dfe3 4.7-rc7
    6e998916dfe3 itself
    2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
    5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
    8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
    12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
    21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
    30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
    48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
    79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
    110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
    128 3.05 6.42 (-110.08%) 3.88 (-27.04%)

    pound_times:

    threads parent of 6e998916dfe3 4.7-rc7
    6e998916dfe3 itself
    2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
    5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
    8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
    12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
    21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
    30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
    48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
    79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
    110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
    128 3.10 6.35 (-104.61%) 4.00 (-28.87%)

    The source code of the two benchmarks follows. To compile the two:

    NR_THREADS=42
    for FILE in pound_times pound_clock_gettime; do
    gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
    done

    ==== BEGIN pound_times.c ====

    struct tms start;

    void *pound (void *threadid)
    {
    struct tms end;
    int oldutime = 0;
    int utime;
    int i;
    for (i = 0; i < 5000000 / NUM_THREADS; i++) {
    times(&end);
    utime = ((int)end.tms_utime - (int)start.tms_utime);
    if (oldutime > utime) {
    printf("utime decreased, was %d, now %d!\n", oldutime, utime);
    }
    oldutime = utime;
    }
    pthread_exit(NULL);
    }

    int main()
    {
    pthread_t th[NUM_THREADS];
    long i;
    times(&start);
    for (i = 0; i < NUM_THREADS; i++) {
    pthread_create (&th[i], NULL, pound, (void *)i);
    }
    pthread_exit(NULL);
    return 0;
    }
    ==== END pound_times.c ====

    ==== BEGIN pound_clock_gettime.c ====

    void *pound (void *threadid)
    {
    struct timespec ts;
    int rc, i;
    unsigned long prev = 0, this = 0;

    for (i = 0; i < 5000000 / NUM_THREADS; i++) {
    rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
    if (rc < 0)
    perror("clock_gettime");
    this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
    if (0 && this < prev)
    printf("%lu ns timewarp at iteration %d\n", prev - this, i);
    prev = this;
    }
    pthread_exit(NULL);
    }

    int main()
    {
    pthread_t th[NUM_THREADS];
    long rc, i;
    pid_t pgid;

    for (i = 0; i < NUM_THREADS; i++) {
    rc = pthread_create(&th[i], NULL, pound, (void *)i);
    if (rc < 0)
    perror("pthread_create");
    }

    pthread_exit(NULL);
    return 0;
    }
    ==== END pound_clock_gettime.c ====

    Suggested-by: Mike Galbraith
    Signed-off-by: Giovanni Gherdovich
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
    Signed-off-by: Ingo Molnar

    Giovanni Gherdovich
     
  • We should update cfs_rq->throttled_clock_task, not
    pcfs_rq->throttle_clock_task.

    The effects of this bug was probably occasionally erratic
    group scheduling, particularly in cgroups-intense workloads.

    Signed-off-by: Xunlei Pang
    [ Added changelog. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 55e16d30bd99 ("sched/fair: Rework throttle_count sync")
    Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.com
    Signed-off-by: Ingo Molnar

    Xunlei Pang
     
  • Current code in cpudeadline.c has a bug in re-heapifying when adding a
    new element at the end of the heap, because a deadline value of 0 is
    temporarily set in the new elem, then cpudl_change_key() is called
    with the actual elem deadline as param.

    However, the function compares the new deadline to set with the one
    previously in the elem, which is 0. So, if current absolute deadlines
    grew so much to have negative values as s64, the comparison in
    cpudl_change_key() makes the wrong decision. Instead, as from
    dl_time_before(), the kernel should handle correctly abs deadlines
    wrap-arounds.

    This patch fixes the problem with a minimally invasive change that
    forces cpudl_change_key() to heapify up in this case.

    Signed-off-by: Tommaso Cucinotta
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Luca Abeni
    Cc: Juri Lelli
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.it
    Signed-off-by: Ingo Molnar

    Tommaso Cucinotta
     
  • There's a perf stat bug easy to observer on a machine with only one cgroup:

    $ perf stat -e cycles -I 1000 -C 0 -G /
    # time counts unit events
    1.000161699 cycles /
    2.000355591 cycles /
    3.000565154 cycles /
    4.000951350 cycles /

    We'd expect some output there.

    The underlying problem is that there is an optimization in
    perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
    if the old and new cgroups in a task switch are the same.

    This optimization interacts with the current code in two ways
    that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
    cgroup event matches the current task. These are:

    1. On creation of the first cgroup event in a CPU: In current code,
    cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
    aforesaid optimization, perf_cgroup_sched_in will run until the next
    cgroup switches in that CPU. This may happen late or never happen,
    depending on system's number of cgroups, CPU load, etc.

    2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
    cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
    because cpuctx->cgrp == NULL until a cgroup switch occurs and
    perf_cgroup_sched_in is executed (updating cpuctx->cgrp).

    This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
    mirroring what list_del_event does when removing a cgroup event from CPU
    context, as introduced in:

    commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")

    With this patch, cpuctx->cgrp is always set/clear when installing/removing
    the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
    correctly set, event_filter_match works as intended when events are
    sched in/out.

    After the fix, the output is as expected:

    $ perf stat -e cycles -I 1000 -a -G /
    # time counts unit events
    1.004699159 627342882 cycles /
    2.007397156 615272690 cycles /
    3.010019057 616726074 cycles /

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • Vegard Nossum reported that perf fuzzing generates a NULL
    pointer dereference crash:

    > Digging a bit deeper into this, it seems the event itself is getting
    > created by perf_event_open() and it gets added to the pmu_event_list
    > through:
    >
    > perf_event_open()
    > - perf_event_alloc()
    > - account_event()
    > - account_pmu_sb_event()
    > - attach_sb_event()
    >
    > so at this point the event is being attached but its ->ctx is still
    > NULL. It seems like ->ctx is set just a bit later in
    > perf_event_open(), though.
    >
    > But before that, __schedule() comes along and creates a stack trace
    > similar to the one above:
    >
    > __schedule()
    > - __perf_event_task_sched_out()
    > - perf_iterate_sb()
    > - perf_iterate_sb_cpu()
    > - event_filter_match()
    > - perf_cgroup_match()
    > - __get_cpu_context()
    > - (dereference ctx which is NULL)
    >
    > So I guess the question is... should the event be attached (= put on
    > the list) before ->ctx gets set? Or should the cgroup code check for a
    > NULL ->ctx?

    The latter seems like the simplest solution. Moving the list-add later
    creates a bit of a mess.

    Reported-by: Vegard Nossum
    Tested-by: Vegard Nossum
    Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: David Carrillo-Cisneros
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Fixes: f2fb6bef9251 ("perf/core: Optimize side-band event delivery")
    Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This reverts commit 874f9c7da9a4acbc1b9e12ca722579fb50e4d142.

    Geert Uytterhoeven reports:
    "This change seems to have an (unintendent?) side-effect.

    Before, pr_*() calls without a trailing newline characters would be
    printed with a newline character appended, both on the console and in
    the output of the dmesg command.

    After this commit, no new line character is appended, and the output
    of the next pr_*() call of the same type may be appended, like in:

    - Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000
    - Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)
    + Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)"

    Joe Perches says:
    "No, that is not intentional.

    The newline handling code inside vprintk_emit is a bit involved and
    for now I suggest a revert until this has all the same behavior as
    earlier"

    Reported-by: Geert Uytterhoeven
    Requested-by: Joe Perches
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Aug, 2016

3 commits

  • The tick_nohz_stop_sched_tick() routine is not properly
    canceling the sched timer when nothing is pending, because
    get_next_timer_interrupt() is no longer returning KTIME_MAX in
    that case. This causes periodic interrupts when none are needed.

    When determining the next interrupt time, we first use
    __next_timer_interrupt() to get the first expiring timer in the
    timer wheel. If no timer is found, we return the base clock value
    plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
    timer wheel.

    Back in get_next_timer_interrupt(), we set the "expires" value
    by converting the timer wheel expiry (in ticks) to a nsec value.
    But we don't want to do this if the timer wheel expiry value
    indicates no timer; we want to return KTIME_MAX.

    Prior to commit 500462a9de65 ("timers: Switch to a non-cascading
    wheel") we checked base->active_timers to see if any timers
    were active, and if not, we didn't touch the expiry value and so
    properly returned KTIME_MAX. Now we don't have active_timers.

    To fix this, we now just check the timer wheel expiry value to
    see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
    try to compute a new value based on it, but instead simply let the
    KTIME_MAX value in expires remain.

    Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
    Signed-off-by: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Thomas Gleixner

    Chris Metcalf
     
  • Bharat Kumar Gogada reported issues with the generic MSI code, where the
    end-point ended up with garbage in its MSI configuration (both for the vector
    and the message).

    It turns out that the two MSI paths in the kernel are doing slightly different
    things:

    generic MSI: disable MSI -> allocate MSI -> enable MSI -> setup EP
    PCI MSI: disable MSI -> allocate MSI -> setup EP -> enable MSI

    And it turns out that end-points are allowed to latch the content of the MSI
    configuration registers as soon as MSIs are enabled. In Bharat's case, the
    end-point ends up using whatever was there already, which is not what you
    want.

    In order to make things converge, we introduce a new MSI domain flag
    (MSI_FLAG_ACTIVATE_EARLY) that is unconditionally set for PCI/MSI. When set,
    this flag forces the programming of the end-point as soon as the MSIs are
    allocated.

    A consequence of this is that we have an extra activate in irq_startup, but
    that should be without much consequence.

    tglx:

    - Several people reported a VMWare regression with PCI/MSI-X passthrough. It
    turns out that the patch also cures that issue.

    - We need to have a look at the MSI disable interrupt path, where we write
    the msg to all zeros without disabling MSI in the PCI device. Is that
    correct?

    Fixes: 52f518a3a7c2 "x86/MSI: Use hierarchical irqdomains to manage MSI interrupts"
    Reported-and-tested-by: Bharat Kumar Gogada
    Reported-and-tested-by: Foster Snowhill
    Reported-by: Matthias Prager
    Reported-by: Jason Taylor
    Signed-off-by: Marc Zyngier
    Acked-by: Bjorn Helgaas
    Cc: linux-pci@vger.kernel.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1468426713-31431-1-git-send-email-marc.zyngier@arm.com
    Signed-off-by: Thomas Gleixner

    Marc Zyngier
     
  • In commit 874f9c7da9a4 ("printk: create pr_ functions"), new
    pr_level defines were added to printk.c.

    These new defines are guarded by an #ifdef CONFIG_PRINTK - however,
    there is already a surrounding #ifdef CONFIG_PRINTK starting a lot
    earlier in line 249 which means the newly introduced #ifdef is
    unnecessary.

    Let's remove it to avoid confusion.

    Signed-off-by: Andreas Ziegler
    Cc: Joe Perches
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Ziegler
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Aug, 2016

1 commit

  • The introduction of pre-allocated hash elements inadvertently broke
    the behavior of bpf hash maps where users expected to call
    bpf_map_update_elem() without considering that the map can be full.
    Some programs do:
    old_value = bpf_map_lookup_elem(map, key);
    if (old_value) {
    ... prepare new_value on stack ...
    bpf_map_update_elem(map, key, new_value);
    }
    Before pre-alloc the update() for existing element would work even
    in 'map full' condition. Restore this behavior.

    The above program could have updated old_value in place instead of
    update() which would be faster and most programs use that approach,
    but sometimes the values are large and the programs use update()
    helper to do atomic replacement of the element.
    Note we cannot simply update element's value in-place like percpu
    hash map does and have to allocate extra num_possible_cpu elements
    and use this extra reserve when the map is full.

    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

06 Aug, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling fixes and some late tooling updates, plus two perf
    related printk message fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf tests bpf: Use SyS_epoll_wait alias
    perf tests: objdump output can contain multi byte chunks
    perf record: Add --sample-cpu option
    perf hists: Introduce output_resort_cb method
    perf tools: Move config/Makefile into Makefile.config
    perf tests: Add test for bitmap_scnprintf function
    tools lib: Add bitmap_and function
    tools lib: Add bitmap_scnprintf function
    tools lib: Add bitmap_alloc function
    tools lib traceevent: Ignore generated library files
    perf tools: Fix build failure on perl script context
    perf/core: Change log level for duration warning to KERN_INFO
    perf annotate: Plug filename string leak
    perf annotate: Introduce strerror for handling symbol__disassemble() errors
    perf annotate: Rename symbol__annotate() to symbol__disassemble()
    perf/x86: Modify error message in virtualized environment
    perf target: str_error_r() always returns the buffer it receives
    perf annotate: Use pipe + fork instead of popen
    perf evsel: Introduce constructor for cycles event

    Linus Torvalds
     

05 Aug, 2016

1 commit

  • Pull more powerpc updates from Michael Ellerman:
    "These were delayed for various reasons, so I let them sit in next a
    bit longer, rather than including them in my first pull request.

    Fixes:
    - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
    - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
    - Move register_process_table() out of ppc_md from Michael Ellerman

    Use jump_label use for [cpu|mmu]_has_feature():
    - Add mmu_early_init_devtree() from Michael Ellerman
    - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
    - Do hash device tree scanning earlier from Michael Ellerman
    - Do radix device tree scanning earlier from Michael Ellerman
    - Do feature patching before MMU init from Michael Ellerman
    - Check features don't change after patching from Michael Ellerman
    - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
    - Convert mmu_has_feature() to returning bool from Michael Ellerman
    - Convert cpu_has_feature() to returning bool from Michael Ellerman
    - Define radix_enabled() in one place & use static inline from Michael Ellerman
    - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
    - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
    - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
    - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
    - Remove mfvtb() from Kevin Hao
    - Move cpu_has_feature() to a separate file from Kevin Hao
    - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
    - Add option to use jump label for cpu_has_feature() from Kevin Hao
    - Add option to use jump label for mmu_has_feature() from Kevin Hao
    - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
    - Annotate jump label assembly from Michael Ellerman

    TLB flush enhancements from Aneesh Kumar K.V:
    - radix: Implement tlb mmu gather flush efficiently
    - Add helper for finding SLBE LLP encoding
    - Use hugetlb flush functions
    - Drop multiple definition of mm_is_core_local
    - radix: Add tlb flush of THP ptes
    - radix: Rename function and drop unused arg
    - radix/hugetlb: Add helper for finding page size
    - hugetlb: Add flush_hugetlb_tlb_range
    - remove flush_tlb_page_nohash

    Add new ptrace regsets from Anshuman Khandual and Simon Guo:
    - elf: Add powerpc specific core note sections
    - Add the function flush_tmregs_to_thread
    - Enable in transaction NT_PRFPREG ptrace requests
    - Enable in transaction NT_PPC_VMX ptrace requests
    - Enable in transaction NT_PPC_VSX ptrace requests
    - Adapt gpr32_get, gpr32_set functions for transaction
    - Enable support for NT_PPC_CGPR
    - Enable support for NT_PPC_CFPR
    - Enable support for NT_PPC_CVMX
    - Enable support for NT_PPC_CVSX
    - Enable support for TM SPR state
    - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    - Enable support for EBB registers
    - Enable support for Performance Monitor registers"

    * tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
    powerpc/mm: Move register_process_table() out of ppc_md
    powerpc/perf: Fix incorrect event codes in power9-event-list
    powerpc/32: Fix early access to cpu_spec relocation
    powerpc/ptrace: Enable support for Performance Monitor registers
    powerpc/ptrace: Enable support for EBB registers
    powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    powerpc/ptrace: Enable support for TM SPR state
    powerpc/ptrace: Enable support for NT_PPC_CVSX
    powerpc/ptrace: Enable support for NT_PPC_CVMX
    powerpc/ptrace: Enable support for NT_PPC_CFPR
    powerpc/ptrace: Enable support for NT_PPC_CGPR
    powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
    powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
    powerpc/process: Add the function flush_tmregs_to_thread
    elf: Add powerpc specific core note sections
    powerpc/mm: remove flush_tlb_page_nohash
    powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
    ...

    Linus Torvalds
     

04 Aug, 2016

9 commits

  • Pull module updates from Rusty Russell:
    "The only interesting thing here is Jessica's patch to add
    ro_after_init support to modules. The rest are all trivia"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    extable.h: add stddef.h so "NULL" definition is not implicit
    modules: add ro_after_init support
    jump_label: disable preemption around __module_text_address().
    exceptions: fork exception table content from module.h into extable.h
    modules: Add kernel parameter to blacklist modules
    module: Do a WARN_ON_ONCE() for assert module mutex not held
    Documentation/module-signing.txt: Note need for version info if reusing a key
    module: Invalidate signatures on force-loaded modules
    module: Issue warnings when tainting kernel
    module: fix redundant test.
    module: fix noreturn attribute for __module_put_and_exit()

    Linus Torvalds
     
  • The current jump_label.h includes bug.h for things such as WARN_ON().
    This makes the header problematic for inclusion by kernel.h or any
    headers that kernel.h includes, since bug.h includes kernel.h (circular
    dependency). The inclusion of atomic.h is similarly problematic. Thus,
    this should make jump_label.h 'includable' from most places.

    Link: http://lkml.kernel.org/r/7060ce35ddd0d20b33bf170685e6b0fab816bdf2.1467837322.git.jbaron@akamai.com
    Signed-off-by: Jason Baron
    Cc: "David S. Miller"
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Heiko Carstens
    Cc: Joe Perches
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • The use of config_enabled() against config options is ambiguous. In
    practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
    author might have used it for the meaning of IS_ENABLED(). Using
    IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention
    clearer.

    This commit replaces config_enabled() with IS_ENABLED() where possible.
    This commit is only touching bool config options.

    I noticed two cases where config_enabled() is used against a tristate
    option:

    - config_enabled(CONFIG_HWMON)
    [ drivers/net/wireless/ath/ath10k/thermal.c ]

    - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
    [ drivers/gpu/drm/gma500/opregion.c ]

    I did not touch them because they should be converted to IS_BUILTIN()
    in order to keep the logic, but I was not sure it was the authors'
    intention.

    Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Acked-by: Kees Cook
    Cc: Stas Sergeev
    Cc: Matt Redfearn
    Cc: Joshua Kinard
    Cc: Jiri Slaby
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Markos Chandras
    Cc: "Dmitry V. Levin"
    Cc: yu-cheng yu
    Cc: James Hogan
    Cc: Brian Gerst
    Cc: Johannes Berg
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Will Drewry
    Cc: Nikolay Martynov
    Cc: Huacai Chen
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Daniel Borkmann
    Cc: Leonid Yegoshin
    Cc: Rafal Milecki
    Cc: James Cowgill
    Cc: Greg Kroah-Hartman
    Cc: Ralf Baechle
    Cc: Alex Smith
    Cc: Adam Buchbinder
    Cc: Qais Yousef
    Cc: Jiang Liu
    Cc: Mikko Rapeli
    Cc: Paul Gortmaker
    Cc: Denys Vlasenko
    Cc: Brian Norris
    Cc: Hidehiro Kawai
    Cc: "Luis R. Rodriguez"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Roland McGrath
    Cc: Paul Burton
    Cc: Kalle Valo
    Cc: Viresh Kumar
    Cc: Tony Wu
    Cc: Huaitong Han
    Cc: Sumit Semwal
    Cc: Alexei Starovoitov
    Cc: Juergen Gross
    Cc: Jason Cooper
    Cc: "David S. Miller"
    Cc: Oleg Nesterov
    Cc: Andrea Gelmini
    Cc: David Woodhouse
    Cc: Marc Zyngier
    Cc: Rabin Vincent
    Cc: "Maciej W. Rozycki"
    Cc: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Add ro_after_init support for modules by adding a new page-aligned section
    in the module layout (after rodata) for ro_after_init data and enabling RO
    protection for that section after module init runs.

    Signed-off-by: Jessica Yu
    Acked-by: Kees Cook
    Signed-off-by: Rusty Russell

    Jessica Yu
     
  • Steven reported a warning caused by not holding module_mutex or
    rcu_read_lock_sched: his backtrace was corrupted but a quick audit
    found this possible cause. It's wrong anyway...

    Reported-by: Steven Rostedt
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Blacklisting a module in linux has long been a problem. The current
    procedure is to use rd.blacklist=module_name, however, that doesn't
    cover the case after the initramfs and before a boot prompt (where one
    is supposed to use /etc/modprobe.d/blacklist.conf to blacklist
    runtime loading). Using rd.shell to get an early prompt is hit-or-miss,
    and doesn't cover all situations AFAICT.

    This patch adds this functionality of permanently blacklisting a module
    by its name via the kernel parameter module_blacklist=module_name.

    [v2]: Rusty, use core_param() instead of __setup() which simplifies
    things.

    [v3]: Rusty, undo wreckage from strsep()

    [v4]: Rusty, simpler version of blacklisted()

    Signed-off-by: Prarit Bhargava
    Cc: Jonathan Corbet
    Cc: Rusty Russell
    Cc: linux-doc@vger.kernel.org
    Signed-off-by: Rusty Russell

    Prarit Bhargava
     
  • When running with lockdep enabled, I triggered the WARN_ON() in the
    module code that asserts when module_mutex or rcu_read_lock_sched are
    not held. The issue I have is that this can also be called from the
    dump_stack() code, causing us to enter an infinite loop...

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
    Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375253c #14
    Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
    ffff880215e8fa70 ffff880215e8fa70 ffffffff812fc8e3 0000000000000000
    ffffffff81d3e55b ffff880215e8fac0 ffffffff8104fc88 ffffffff8104fcab
    0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
    Call Trace:
    [] dump_stack+0x67/0x90
    [] __warn+0xcb/0xe9
    [] ? warn_slowpath_null+0x5/0x1f
    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
    Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375253c #14
    Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
    ffff880215e8f7a0 ffff880215e8f7a0 ffffffff812fc8e3 0000000000000000
    ffffffff81d3e55b ffff880215e8f7f0 ffffffff8104fc88 ffffffff8104fcab
    0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
    Call Trace:
    [] dump_stack+0x67/0x90
    [] __warn+0xcb/0xe9
    [] ? warn_slowpath_null+0x5/0x1f
    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
    Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375253c #14
    Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
    ffff880215e8f4d0 ffff880215e8f4d0 ffffffff812fc8e3 0000000000000000
    ffffffff81d3e55b ffff880215e8f520 ffffffff8104fc88 ffffffff8104fcab
    0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
    Call Trace:
    [] dump_stack+0x67/0x90
    [] __warn+0xcb/0xe9
    [] ? warn_slowpath_null+0x5/0x1f
    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
    [...]

    Which gives us rather useless information. Worse yet, there's some race
    that causes this, and I seldom trigger it, so I have no idea what
    happened.

    This would not be an issue if that warning was a WARN_ON_ONCE().

    Signed-off-by: Steven Rostedt
    Signed-off-by: Rusty Russell

    Steven Rostedt
     
  • Using per-register incrementing ID can lead to
    find_good_pkt_pointers() confusing registers which
    have completely different values. Consider example:

    0: (bf) r6 = r1
    1: (61) r8 = *(u32 *)(r6 +76)
    2: (61) r0 = *(u32 *)(r6 +80)
    3: (bf) r7 = r8
    4: (07) r8 += 32
    5: (2d) if r8 > r0 goto pc+9
    R0=pkt_end R1=ctx R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=0,off=32,r=32) R10=fp
    6: (bf) r8 = r7
    7: (bf) r9 = r7
    8: (71) r1 = *(u8 *)(r7 +0)
    9: (0f) r8 += r1
    10: (71) r1 = *(u8 *)(r7 +1)
    11: (0f) r9 += r1
    12: (07) r8 += 32
    13: (2d) if r8 > r0 goto pc+1
    R0=pkt_end R1=inv56 R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=1,off=32,r=32) R9=pkt(id=1,off=0,r=32) R10=fp
    14: (71) r1 = *(u8 *)(r9 +16)
    15: (b7) r7 = 0
    16: (bf) r0 = r7
    17: (95) exit

    We need to get a UNKNOWN_VALUE with imm to force id
    generation so lines 0-5 make r7 a valid packet pointer.
    We then read two different bytes from the packet and
    add them to copies of the constructed packet pointer.
    r8 (line 9) and r9 (line 11) will get the same id of 1,
    independently. When either of them is validated (line
    13) - find_good_pkt_pointers() will also mark the other
    as safe. This leads to access on line 14 being mistakenly
    considered safe.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Pull tracing fixes from Steven Rostedt:
    "A few updates and fixes:

    - move the suppressing of the __builtin_return_address >0 warning to
    the tracing directory only.

    - metag recordmcount fix for newer glibc's

    - two tracing histogram fixes that were reported by KASAN"

    * tag 'trace-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix use-after-free in hist_register_trigger()
    tracing: Fix use-after-free in hist_unreg_all/hist_enable_unreg_all
    Makefile: Mute warning for __builtin_return_address(>0) for tracing only
    ftrace/recordmcount: Work around for addition of metag magic but not relocations

    Linus Torvalds
     

03 Aug, 2016

1 commit

  • Copy the config fragments from the AOSP common kernel android-4.4
    branch. It is becoming possible to run mainline kernels with Android,
    but the kernel defconfigs don't work as-is and debugging missing config
    options is a pain. Adding the config fragments into the kernel tree,
    makes configuring a mainline kernel as simple as:

    make ARCH=arm multi_v7_defconfig android-base.config android-recommended.config

    The following non-upstream config options were removed:

    CONFIG_NETFILTER_XT_MATCH_QTAGUID
    CONFIG_NETFILTER_XT_MATCH_QUOTA2
    CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
    CONFIG_PPPOLAC
    CONFIG_PPPOPNS
    CONFIG_SECURITY_PERF_EVENTS_RESTRICT
    CONFIG_USB_CONFIGFS_F_MTP
    CONFIG_USB_CONFIGFS_F_PTP
    CONFIG_USB_CONFIGFS_F_ACC
    CONFIG_USB_CONFIGFS_F_AUDIO_SRC
    CONFIG_USB_CONFIGFS_UEVENT
    CONFIG_INPUT_KEYCHORD
    CONFIG_INPUT_KEYRESET

    Link: http://lkml.kernel.org/r/1466708235-28593-1-git-send-email-robh@kernel.org
    Signed-off-by: Rob Herring
    Cc: Amit Pundir
    Cc: John Stultz
    Cc: Dmitry Shmidt
    Cc: Rom Lemarchand
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Herring