09 Oct, 2013

1 commit

  • Pull perf fixes from Ingo Molnar:
    "Various fixlets:

    On the kernel side:

    - fix a race
    - fix a bug in the handling of the perf ring-buffer data page

    On the tooling side:

    - fix the handling of certain corrupted perf.data files
    - fix a bug in 'perf probe'
    - fix a bug in 'perf record + perf sched'
    - fix a bug in 'make install'
    - fix a bug in libaudit feature-detection on certain distros"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf session: Fix infinite loop on invalid perf.data file
    perf tools: Fix installation of libexec components
    perf probe: Fix to find line information for probe list
    perf tools: Fix libaudit test
    perf stat: Set child_pid after perf_evlist__prepare_workload()
    perf tools: Add default handler for mmap2 events
    perf/x86: Clean up cap_user_time* setting
    perf: Fix perf_pmu_migrate_context

    Linus Torvalds
     

05 Oct, 2013

1 commit

  • Pull ACPI and power management fixes from Rafael Wysocki:

    - The resume part of user space driven hibernation (s2disk) is now
    broken after the change that moved the creation of memory bitmaps to
    after the freezing of tasks, because I forgot that the resume utility
    loaded the image before freezing tasks and needed the bitmaps for
    that. The fix adds special handling for that case.

    - One of recent commits changed the export of acpi_bus_get_device() to
    EXPORT_SYMBOL_GPL(), which was technically correct but broke existing
    binary modules using that function including one in particularly
    widespread use. Change it back to EXPORT_SYMBOL().

    - The intel_pstate driver sometimes fails to disable turbo if its
    no_turbo sysfs attribute is set. Fix from Srinivas Pandruvada.

    - One of recent cpufreq fixes forgot to update a check in cpufreq-cpu0
    which still (incorrectly) treats non-NULL as non-error. Fix from
    Philipp Zabel.

    - The SPEAr cpufreq driver uses a wrong variable type in one place
    preventing it from catching errors returned by one of the functions
    called by it. Fix from Sachin Kamat.

    * tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI: Use EXPORT_SYMBOL() for acpi_bus_get_device()
    intel_pstate: fix no_turbo
    cpufreq: cpufreq-cpu0: NULL is a valid regulator, part 2
    cpufreq: SPEAr: Fix incorrect variable type
    PM / hibernate: Fix user space driven resume regression

    Linus Torvalds
     

04 Oct, 2013

1 commit

  • While auditing the list_entry usage due to a trinity bug I found that
    perf_pmu_migrate_context violates the rules for
    perf_event::event_entry.

    The problem is that perf_event::event_entry is a RCU list element, and
    hence we must wait for a full RCU grace period before re-using the
    element after deletion.

    Therefore the usage in perf_pmu_migrate_context() which re-uses the
    entry immediately is broken. For now introduce another list_head into
    perf_event for this specific usage.

    This doesn't actually fix the trinity report because that never goes
    through this code.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Oct, 2013

1 commit


01 Oct, 2013

4 commits

  • The commit facd8b80c67a3cf64a467c4a2ac5fb31f2e6745b
    ("irq: Sanitize invoke_softirq") converted irq exit
    calls of do_softirq() to __do_softirq() on all architectures,
    assuming it was only used there for its irq disablement
    properties.

    But as a side effect, the softirqs processed in the end
    of the hardirq are always called on the inline current
    stack that is used by irq_exit() instead of the softirq
    stack provided by the archs that override do_softirq().

    The result is mostly safe if the architecture runs irq_exit()
    on a separate irq stack because then softirqs are processed
    on that same stack that is near empty at this stage (assuming
    hardirq aren't nesting).

    Otherwise irq_exit() runs in the task stack and so does the softirq
    too. The interrupted call stack can be randomly deep already and
    the softirq can dig through it even further. To add insult to the
    injury, this softirq can be interrupted by a new hardirq, maximizing
    the chances for a stack overrun as reported in powerpc for example:

    do_IRQ: stack overflow: 1920
    CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1
    Call Trace:
    [c0000000050a8740] .show_stack+0x130/0x200 (unreliable)
    [c0000000050a8810] .dump_stack+0x28/0x3c
    [c0000000050a8880] .do_IRQ+0x2b8/0x2c0
    [c0000000050a8930] hardware_interrupt_common+0x154/0x180
    --- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp]
    LR = .cp_start_xmit+0x390/0x820 [8139cp]
    [c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640
    [c0000000050a8e00] .sch_direct_xmit+0x110/0x260
    [c0000000050a8ea0] .dev_queue_xmit+0x260/0x630
    [c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge]
    [c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge]
    [c0000000050a9070] .dev_hard_start_xmit+0x394/0x640
    [c0000000050a9130] .dev_queue_xmit+0x428/0x630
    [c0000000050a91d0] .ip_finish_output+0x2a4/0x550
    [c0000000050a9290] .ip_local_out+0x50/0x70
    [c0000000050a9310] .ip_queue_xmit+0x148/0x420
    [c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0
    [c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0
    [c0000000050a9520] .tcp_rcv_established+0x1e8/0x930
    [c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570
    [c0000000050a96c0] .tcp_v4_rcv+0x734/0x930
    [c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360
    [c0000000050a9840] .ip_rcv_finish+0x148/0x400
    [c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00
    [c0000000050a99d0] .netif_receive_skb+0x44/0x110
    [c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge]
    [c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge]
    [c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge]
    [c0000000050a9c70] .nf_iterate+0x114/0x130
    [c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0
    [c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge]
    [c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00
    [c0000000050a9fa0] .netif_receive_skb+0x44/0x110
    [c0000000050aa040] .napi_gro_receive+0xe8/0x120
    [c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp]
    [c0000000050aa1d0] .net_rx_action+0x1dc/0x310
    [c0000000050aa2b0] .__do_softirq+0x158/0x330
    [c0000000050aa3b0] .irq_exit+0xc8/0x110
    [c0000000050aa430] .do_IRQ+0xdc/0x2c0
    [c0000000050aa4e0] hardware_interrupt_common+0x154/0x180
    --- Exception: 501 at .bad_range+0x1c/0x110
    LR = .get_page_from_freelist+0x908/0xbb0
    [c0000000050aa7d0] .list_del+0x18/0x50 (unreliable)
    [c0000000050aa850] .get_page_from_freelist+0x908/0xbb0
    [c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0
    [c0000000050aaba0] .alloc_pages_vma+0xd0/0x210
    [c0000000050aac60] .handle_pte_fault+0x814/0xb70
    [c0000000050aad50] .__get_user_pages+0x1a4/0x640
    [c0000000050aae60] .get_user_pages_fast+0xec/0x160
    [c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm]
    [c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm]
    [c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm]
    [c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm]
    [c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm]
    [c0000000050ab320] kvm_start_lightweight+0x1ec/0x1fc [kvm]
    [c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm]
    [c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
    [c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm]
    [c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
    [c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0
    [c0000000050abd80] .SyS_ioctl+0xd4/0xf0
    [c0000000050abe30] syscall_exit+0x0/0x98

    Since this is a regression, this patch proposes a minimalistic
    and low-risk solution by blindly forcing the hardirq exit processing of
    softirqs on the softirq stack. This way we should reduce significantly
    the opportunities for task stack overflow dug by softirqs.

    Longer term solutions may involve extending the hardirq stack coverage to
    irq_exit(), etc...

    Reported-by: Benjamin Herrenschmidt
    Acked-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Cc: #3.9..
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: James Hogan
    Cc: James E.J. Bottomley
    Cc: Helge Deller
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David S. Miller
    Cc: Andrew Morton

    Frederic Weisbecker
     
  • "case 0" in free_pid() assumes that disable_pid_allocation() should
    clear PIDNS_HASH_ADDING before the last pid goes away.

    However this doesn't happen if the first fork() fails to create the
    child reaper which should call disable_pid_allocation().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
    dereference happens upon core dump because argv_split("") returns
    argv[0] == NULL.

    This bug was once fixed by commit 264b83c07a84 ("usermodehelper: check
    subprocess_info->path != NULL") but was by error reintroduced by commit
    7f57cfa4e2aa ("usermodehelper: kill the sub_info->path[0] check").

    This bug seems to exist since 2.6.19 (the version which core dump to
    pipe was added). Depending on kernel version and config, some side
    effect might happen immediately after this oops (e.g. kernel panic with
    2.6.32-358.18.1.el6).

    Signed-off-by: Tetsuo Handa
    Acked-by: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Recent commit 8fd37a4 (PM / hibernate: Create memory bitmaps after
    freezing user space) broke the resume part of the user space driven
    hibernation (s2disk), because I forgot that the resume utility
    loaded the image into memory without freezing user space (it still
    freezes tasks after loading the image). This means that during user
    space driven resume we need to create the memory bitmaps at the
    "device open" time rather than at the "freeze tasks" time, so make
    that happen (that's a special case anyway, so it needs to be treated
    in a special way).

    Reported-and-tested-by: Ronald
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

29 Sep, 2013

2 commits

  • …nt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull scheduler, timer and x86 fixes from Ingo Molnar:
    - A context tracking ARM build and functional fix
    - A handful of ARM clocksource/clockevent driver fixes
    - An AMD microcode patch level sysfs reporting fixlet

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arm: Fix build error with context tracking calls

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
    clocksource: of: Respect device tree node status
    clocksource: exynos_mct: Set IRQ affinity when the CPU goes online
    arm: clocksource: mvebu: Use the main timer as clock source from DT

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/microcode/AMD: Fix patch level reporting for family 15h

    Linus Torvalds
     
  • Commit 6072ddc8520b ("kernel: replace strict_strto*() with kstrto*()")
    broke the handling of signed integer types, fix it.

    Signed-off-by: Jean Delvare
    Reported-by: Christian Kujau
    Tested-by: Christian Kujau
    Cc: Jingoo Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     

28 Sep, 2013

1 commit


27 Sep, 2013

1 commit

  • ad65782fba50 (context_tracking: Optimize main APIs off case
    with static key) converted context tracking main APIs to inline
    function and left ARM asm callers behind.

    This can be easily fixed by making ARM calling the post static
    keys context tracking function. We just need to replicate the
    static key checks there. We'll remove these later when ARM will
    support the context tracking static keys.

    Reported-by: Guenter Roeck
    Reported-by: Russell King
    Signed-off-by: Frederic Weisbecker
    Tested-by: Kevin Hilman
    Cc: Nicolas Pitre
    Cc: Anil Kumar
    Cc: Tony Lindgren
    Cc: Benoit Cousson
    Cc: Guenter Roeck
    Cc: Russell King
    Cc: Kevin Hilman

    Frederic Weisbecker
     

26 Sep, 2013

2 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Three small fixes"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/balancing: Fix cfs_rq->task_h_load calculation
    sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
    sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Assorted standalone fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel: Add model number for Avoton Silvermont
    perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
    perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group()
    perf: Update ABI comment
    tools lib lk: Uninclude linux/magic.h in debugfs.c
    perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms()
    perf probe: Fix finder to find lines of given function
    perf session: Check for SIGINT in more loops
    perf tools: Fix compile with libelf without get_phdrnum
    perf tools: Fix buildid cache handling of kallsyms with kcore
    perf annotate: Fix objdump line parsing offset validation
    perf tools: Fill in new definitions for madvise()/mmap() flags
    perf tools: Sharpen the libaudit dependencies test

    Linus Torvalds
     

25 Sep, 2013

4 commits

  • Commit 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic
    kernel") did some cleanup for reboot= command line, but it made the
    reboot_default inoperative.

    The default value of variable reboot_default should be 1, and if command
    line reboot= is not set, system will use the default reboot mode.

    [akpm@linux-foundation.org: fix comment layout]
    Signed-off-by: Li Fei
    Signed-off-by: liu chuansheng
    Acked-by: Robin Holt
    Cc: [3.11.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuansheng Liu
     
  • After commit 829199197a43 ("kernel/audit.c: avoid negative sleep
    durations") audit emitters will block forever if userspace daemon cannot
    handle backlog.

    After the timeout the waiting loop turns into busy loop and runs until
    daemon dies or returns back to work. This is a minimal patch for that
    bug.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Luiz Capitulino
    Cc: Richard Guy Briggs
    Cc: Eric Paris
    Cc: Chuck Anderson
    Cc: Dan Duval
    Cc: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • watchdog_tresh controls how often nmi perf event counter checks per-cpu
    hrtimer_interrupts counter and blows up if the counter hasn't changed
    since the last check. The counter is updated by per-cpu
    watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
    period which guarantees that hrtimer is scheduled 2 times per the main
    period. Both hrtimer and perf event are started together when the
    watchdog is enabled.

    So far so good. But...

    But what happens when watchdog_thresh is updated from sysctl handler?

    proc_dowatchdog will set a new sampling period and hrtimer callback
    (watchdog_timer_fn) will use the new value in the next round. The
    problem, however, is that nobody tells the perf event that the sampling
    period has changed so it is ticking with the period configured when it
    has been set up.

    This might result in an ear ripping dissonance between perf and hrtimer
    parts if the watchdog_thresh is increased. And even worse it might lead
    to KABOOM if the watchdog is configured to panic on such a spurious
    lockup.

    This patch fixes the issue by updating both nmi perf even counter and
    hrtimers if the threshold value has changed.

    The nmi one is disabled and then reinitialized from scratch. This has
    an unpleasant side effect that the allocation of the new event might
    fail theoretically so the hard lockup detector would be disabled for
    such cpus. On the other hand such a memory allocation failure is very
    unlikely because the original event is deallocated right before.

    It would be much nicer if we just changed perf event period but there
    doesn't seem to be any API to do that right now. It is also unfortunate
    that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
    cannot use on_each_cpu() and do the same thing from the per-cpu context.
    The update from the current CPU should be safe because
    perf_event_disable removes the event atomically before it clears the
    per-cpu watchdog_ev so it cannot change anything under running handler
    feet.

    The hrtimer is simply restarted (thanks to Don Zickus who has pointed
    this out) if it is queued because we cannot rely it will fire&adopt to
    the new sampling period before a new nmi event triggers (when the
    treshold is decreased).

    [akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
    Signed-off-by: Michal Hocko
    Acked-by: Don Zickus
    Cc: Frederic Weisbecker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Fabio Estevam
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • proc_dowatchdog doesn't synchronize multiple callers which might lead to
    confusion when two parallel callers might confuse watchdog_enable_all_cpus
    resp watchdog_disable_all_cpus (eg watchdog gets enabled even if
    watchdog_thresh was set to 0 already).

    This patch adds a local mutex which synchronizes callers to the sysctl
    handler.

    Signed-off-by: Michal Hocko
    Cc: Frederic Weisbecker
    Acked-by: Don Zickus
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 Sep, 2013

4 commits

  • Patch a003a2 (sched: Consider runnable load average in move_tasks())
    sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
    always 0. This mistype leads to all tasks having weight 0 when load
    balancing in a cpu-cgroup enabled setup. There obviously should be sum
    of weights of all runnable tasks there instead. Fix it.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com
    Signed-off-by: Ingo Molnar

    Vladimir Davydov
     
  • In busiest->group_imb case we can come to fix_small_imbalance() with
    local->avg_load > busiest->avg_load. This can result in wrong imbalance
    fix-up, because there is the following check there where all the
    members are unsigned:

    if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
    (scaled_busy_load_per_task * imbn)) {
    env->imbalance = busiest->load_per_task;
    return;
    }

    As a result we can end up constantly bouncing tasks from one cpu to
    another if there are pinned tasks.

    Fix it by substituting the subtraction with an equivalent addition in
    the check.

    [ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
    belonging to different cores on an HT-enabled machine with N logical
    cpus: just look at se.nr_migrations growth. ]

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.com
    Signed-off-by: Ingo Molnar

    Vladimir Davydov
     
  • In busiest->group_imb case we can come to calculate_imbalance() with
    local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
    in imbalance overflow, because it is calculated as follows

    env->imbalance = min(
    max_pull * busiest->group_power,
    (sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE;

    As a result we can end up constantly bouncing tasks from one cpu to
    another if there are pinned tasks.

    Fix this by skipping the assignment and assuming imbalance=0 in case
    local->avg_load > sds->avg_load.

    [ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
    belonging to different cores on an HT-enabled machine with N logical
    cpus: just look at se.nr_migrations growth. ]

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.com
    Signed-off-by: Ingo Molnar

    Vladimir Davydov
     
  • Solve the problems around the broken definition of perf_event_mmap_page::
    cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
    fixed by:

    860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

    The problem with the fix (merged in v3.12-rc1 and not yet released
    officially), noticed by Vince Weaver is that the new behavior is
    not detectable by new user-space, and that due to the reuse of the
    field names it's easy to mis-compile a binary if old headers are used
    on a new kernel or new headers are used on an old kernel.

    To solve all that make this change explicit, detectable and self-contained,
    by iterating the ABI the following way:

    - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
    confuse old user-space binaries. RDPMC will be marked as unavailable
    to old binaries but that's within the ABI, this is a capability bit.

    - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
    libraries can reliably detect that bit 0 is deprecated and perma-zero
    without having to check the kernel version.

    - Use bits 2, 3, 4 for the newly defined, correct functionality:

    cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
    cap_user_time : 1, /* The time_* fields are used */
    cap_user_time_zero : 1, /* The time_zero field is used */

    - Rename all the bitfield names in perf_event.h to be different from the
    old names, to make sure it's not possible to mis-compile it
    accidentally with old assumptions.

    The 'size' field can then be used in the future to add new fields and it
    will act as a natural ABI version indicator as well.

    Also adjust tools/perf/ userspace for the new definitions, noticed by
    Adrian Hunter.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Also-Fixed-by: Adrian Hunter
    Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Sep, 2013

2 commits


16 Sep, 2013

1 commit

  • sched_info_depart seems to be only called from
    sched_info_switch(), so only on involuntary task switch.

    Fix the comment to match.

    Signed-off-by: Michael S. Tsirkin
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com
    Signed-off-by: Ingo Molnar

    Michael S. Tsirkin
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

13 Sep, 2013

12 commits

  • After the last architecture switched to generic hard irqs the config
    options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
    for !CONFIG_GENERIC_HARDIRQS can be removed.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • The usage of strict_strto*() is not preferred, because strict_strto*() is
    obsolete. Thus, kstrto*() should be used.

    Signed-off-by: Jingoo Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jingoo Han
     
  • This function dereferences res far too often, so optimize it.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Since PAGE_ALIGN is aligning up(the next page boundary), so after
    PAGE_ALIGN, the value might be overflow, such as write the MAX value to
    *.limit_in_bytes.

    $ cat /cgroup/memory/memory.limit_in_bytes
    18446744073709551615

    # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
    bash: echo: write error: Invalid argument

    Some user programs might depend on such behaviours(like libcg, we read
    the value in snapshot, then use the value to reset cgroup later), and
    that will cause confusion. So we need to fix it.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     
  • Pull ACPI and power management fixes from Rafael Wysocki:
    "All of these commits are fixes that have emerged recently and some of
    them fix bugs introduced during this merge window.

    Specifics:

    1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

    After the recent ACPIPHP changes we've seen some interesting
    breakage on a system that triggers device check notifications
    during boot for non-existing devices. Although those
    notifications are really spurious, we should be able to deal with
    them nevertheless and that shouldn't introduce too much overhead.
    Four commits to make that work properly.

    2) Memory hotplug and hibernation mutual exclusion rework

    This was maent to be a cleanup, but it happens to fix a classical
    ABBA deadlock between system suspend/hibernation and ACPI memory
    hotplug which is possible if they are started roughly at the same
    time. Three commits rework memory hotplug so that it doesn't
    acquire pm_mutex and make hibernation use device_hotplug_lock
    which prevents it from racing with memory hotplug.

    3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

    The ACPI LPSS driver crashes during boot on Apple Macbook Air with
    Haswell that has slightly unusual BIOS configuration in which one
    of the LPSS device's _CRS method doesn't return all of the
    information expected by the driver. Fix from Mika Westerberg, for
    stable.

    4) ACPICA fix related to Store->ArgX operation

    AML interpreter fix for obscure breakage that causes AML to be
    executed incorrectly on some machines (observed in practice).
    From Bob Moore.

    5) ACPI core fix for PCI ACPI device objects lookup

    There still are cases in which there is more than one ACPI device
    object matching a given PCI device and we don't choose the one
    that the BIOS expects us to choose, so this makes the lookup take
    more criteria into account in those cases.

    6) Fix to prevent cpuidle from crashing in some rare cases

    If the result of cpuidle_get_driver() is NULL, which can happen on
    some systems, cpuidle_driver_ref() will crash trying to use that
    pointer and the Daniel Fu's fix prevents that from happening.

    7) cpufreq fixes related to CPU hotplug

    Stephen Boyd reported a number of concurrency problems with
    cpufreq related to CPU hotplug which are addressed by a series of
    fixes from Srivatsa S Bhat and Viresh Kumar.

    8) cpufreq fix for time conversion in time_in_state attribute

    Time conversion carried out by cpufreq when user space attempts to
    read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
    won't work correcty if cputime_t doesn't map directly to jiffies.
    Fix from Andreas Schwab.

    9) Revert of a troublesome cpufreq commit

    Commit 7c30ed5 (cpufreq: make sure frequency transitions are
    serialized) was intended to address some known concurrency
    problems in cpufreq related to the ordering of transitions, but
    unfortunately it introduced several problems of its own, so I
    decided to revert it now and address the original problems later
    in a more robust way.

    10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

    11) cpufreq fixes related to system suspend/resume

    The recent cpufreq changes that made it preserve CPU sysfs
    attributes over suspend/resume cycles introduced a possible NULL
    pointer dereference that caused it to crash during the second
    attempt to suspend. Three commits from Srivatsa S Bhat fix that
    problem and a couple of related issues.

    12) cpufreq locking fix

    cpufreq_policy_restore() should acquire the lock for reading, but
    it acquires it for writing. Fix from Lan Tianyu"

    * tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
    cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
    cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
    cpufreq: Restructure if/else block to avoid unintended behavior
    cpufreq: Fix crash in cpufreq-stats during suspend/resume
    intel_pstate: Add Haswell CPU models
    Revert "cpufreq: make sure frequency transitions are serialized"
    cpufreq: Use signed type for 'ret' variable, to store negative error values
    cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
    cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
    cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
    cpufreq: Split __cpufreq_remove_dev() into two parts
    cpufreq: Fix wrong time unit conversion
    cpufreq: serialize calls to __cpufreq_governor()
    cpufreq: don't allow governor limits to be changed when it is disabled
    ACPI / bind: Prefer device objects with _STA to those without it
    ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
    ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
    ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
    ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
    ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Various fixes.

    The -g perf report lockup you reported is only partially addressed,
    patches that fix the excessive runtime are still being worked on"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Fix uncore PCI fixed counter handling
    uprobes: Fix utask->depth accounting in handle_trampoline()
    perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING
    perf: Fix up MMAP2 buffer space reservation
    perf tools: Add attr->mmap2 support
    perf kvm: Fix sample_type manipulation
    perf evlist: Fix id pos in perf_evlist__open()
    perf trace: Handle perf.data files with no tracepoints
    perf session: Separate progress bar update when processing events
    perf trace: Check if MAP_32BIT is defined
    perf hists: Fix formatting of long symbol names
    perf evlist: Fix parsing with no sample_id_all bit set
    perf tools: Add test for parsing with no sample_id_all bit
    perf trace: Check control+C more often

    Linus Torvalds
     
  • Pull scheduler fix from Ingo Molnar:
    "Performance regression fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix load balancing performance regression in should_we_balance()

    Linus Torvalds
     
  • Emmanuel reported that /proc/sched_debug didn't report the right PIDs
    when using namespaces, cure this.

    Reported-by: Emmanuel Deloget
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is a small race between copy_process() and cgroup_attach_task()
    where child->se.parent,cfs_rq points to invalid (old) ones.

    parent doing fork() | someone moving the parent to another cgroup
    -------------------------------+---------------------------------------------
    copy_process()
    + dup_task_struct()
    -> parent->se is copied to child->se.
    se.parent,cfs_rq of them point to old ones.

    cgroup_attach_task()
    + cgroup_task_migrate()
    -> parent->cgroup is updated.
    + cpu_cgroup_attach()
    + sched_move_task()
    + task_move_group_fair()
    +- set_task_rq()
    -> se.parent,cfs_rq of parent
    are updated.

    + cgroup_fork()
    -> parent->cgroup is copied to child->cgroup. (*1)
    + sched_fork()
    + task_fork_fair()
    -> se.parent,cfs_rq of child are accessed
    while they point to old ones. (*2)

    In the worst case, this bug can lead to "use-after-free" and cause a panic,
    because it's new cgroup's refcount that is incremented at (*1),
    so the old cgroup(and related data) can be freed before (*2).

    In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
    [] place_entity+0x75/0xa0
    [] task_fork_fair+0xaa/0x160
    [] sched_fork+0x6b/0x140
    [] copy_process+0x5b2/0x1450
    [] ? wake_up_new_task+0xd9/0x130
    [] do_fork+0x94/0x460
    [] ? sys_wait4+0xae/0x100
    [] sys_clone+0x28/0x30
    [] stub_clone+0x13/0x20
    [] ? system_call_fastpath+0x16/0x1b

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     

12 Sep, 2013

2 commits

  • Currently utask->depth is simply the number of allocated/pending
    return_instance's in uprobe_task->return_instances list.

    handle_trampoline() should decrement this counter every time we
    handle/free an instance, but due to typo it does this only if
    ->chained == T. This means that in the likely case this counter
    is never decremented and the probed task can't report more than
    MAX_URETPROBE_DEPTH events.

    Reported-by: Mikhail Kulemin
    Reported-by: Hemant Kumar Shaw
    Signed-off-by: Oleg Nesterov
    Acked-by: Anton Arapov
    Cc: masami.hiramatsu.pt@hitachi.com
    Cc: srikar@linux.vnet.ibm.com
    Cc: systemtap@sourceware.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Gerlando Falauto reported that when HRTICK is enabled, it is
    possible to trigger system deadlocks. These were hard to
    reproduce, as HRTICK has been broken in the past, but seemed
    to be connected to the timekeeping_seq lock.

    Since seqlock/seqcount's aren't supported w/ lockdep, I added
    some extra spinlock based locking and triggered the following
    lockdep output:

    [ 15.849182] ntpd/4062 is trying to acquire lock:
    [ 15.849765] (&(&pool->lock)->rlock){..-...}, at: [] __queue_work+0x145/0x480
    [ 15.850051]
    [ 15.850051] but task is already holding lock:
    [ 15.850051] (timekeeper_lock){-.-.-.}, at: [] do_adjtimex+0x7f/0x100

    [ 15.850051] Chain exists of: &(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock
    [ 15.850051] Possible unsafe locking scenario:
    [ 15.850051]
    [ 15.850051] CPU0 CPU1
    [ 15.850051] ---- ----
    [ 15.850051] lock(timekeeper_lock);
    [ 15.850051] lock(&p->pi_lock);
    [ 15.850051] lock(timekeeper_lock);
    [ 15.850051] lock(&(&pool->lock)->rlock);
    [ 15.850051]
    [ 15.850051] *** DEADLOCK ***

    The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping:
    Hold timekeepering locks in do_adjtimex and hardpps") in 3.10

    This patch avoids this deadlock, by moving the call to
    schedule_delayed_work() outside of the timekeeper lock
    critical section.

    Reported-by: Gerlando Falauto
    Tested-by: Lin Ming
    Signed-off-by: John Stultz
    Cc: Mathieu Desnoyers
    Cc: stable #3.11, 3.10
    Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz