26 Jan, 2014

1 commit

  • commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5 upstream.

    Serge Hallyn writes:
    > Hi Oleg,
    >
    > commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
    > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
    > breaks lxc-attach in 3.12. That code forks a child which does
    > setns() and then does a clone(CLONE_PARENT). That way the
    > grandchild can be in the right namespaces (which the child was
    > not) and be a child of the original task, which is the monitor.
    >
    > lxc-attach in 3.11 was working fine with no side effects that I
    > could see. Is there a real danger in allowing CLONE_PARENT
    > when current->nsproxy->pidns_for_children is not our pidns,
    > or was this done out of an "over-abundance of caution"? Can we
    > safely revert that new extra check?

    The two fundamental things I know we can not allow are:
    - A shared signal queue aka CLONE_THREAD. Because we compute the pid
    and uid of the signal when we place it in the queue.

    - Changing the pid and by extention pid_namespace of an existing
    process.

    From a parents perspective there is nothing special about the pid
    namespace, to deny CLONE_PARENT, because the parent simply won't know or
    care.

    From the childs perspective all that is special really are shared signal
    queues.

    User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
    in different pid namespaces is almost certainly going to break because
    it is complicated. But shared signal handlers can look at per thread
    information to know which pid namespace a process is in, so I don't know
    of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
    at the kernel level. It would be absolutely stupid to implement but
    that is a different thing.

    So hmm.

    Because it can do no harm, and because it is a regression let's remove
    the CLONE_PARENT check and send it stable.

    Acked-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

16 Jan, 2014

4 commits

  • commit 0ac9b1c21874d2490331233b3242085f8151e166 upstream.

    Currently, group entity load-weights are initialized to zero. This
    admits some races with respect to the first time they are re-weighted in
    earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

    Suppose that we have root->a and that a enters a throttled state,
    immediately followed by a[0]->t1 (the only task running on cpu[0])
    blocking:

    put_prev_task(group_cfs_rq(a[0]), t1)
    put_prev_entity(..., t1)
    check_cfs_rq_runtime(group_cfs_rq(a[0]))
    throttle_cfs_rq(group_cfs_rq(a[0]))

    Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
    time:

    enqueue_task_fair(rq[0], t2)
    enqueue_entity(group_cfs_rq(b[0]), t2)
    enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
    account_entity_enqueue(group_cfs_ra(b[0]), t2)
    update_cfs_shares(group_cfs_rq(b[0]))
    < skipped because b is part of a throttled hierarchy >
    enqueue_entity(group_cfs_rq(a[0]), b[0])
    ...

    We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
    which violates invariants in several code-paths. Eliminate the
    possibility of this by initializing group entity weight.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    Cc: Chris J Arges
    Signed-off-by: Greg Kroah-Hartman

    Paul Turner
     
  • commit 927b54fccbf04207ec92f669dce6806848cbec7d upstream.

    __start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
    waiting for the hrtimer to finish. However, if sched_cfs_period_timer
    runs for another loop iteration, the hrtimer can attempt to take
    rq->lock, resulting in deadlock.

    Fix this by ensuring that cfs_b->timer_active is cleared only if the
    _latest_ call to do_sched_cfs_period_timer is returning as idle. Then
    __start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for
    that to succeed or timer_active == 1.

    Signed-off-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    Cc: Chris J Arges
    Signed-off-by: Greg Kroah-Hartman

    Ben Segall
     
  • commit db06e78cc13d70f10877e0557becc88ab3ad2be8 upstream.

    hrtimer_expires_remaining does not take internal hrtimer locks and thus
    must be guarded against concurrent __hrtimer_start_range_ns (but
    returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe.

    Signed-off-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    Cc: Chris J Arges
    Signed-off-by: Greg Kroah-Hartman

    Ben Segall
     
  • commit 1ee14e6c8cddeeb8a490d7b54cd9016e4bb900b4 upstream.

    When we transition cfs_bandwidth_used to false, any currently
    throttled groups will incorrectly return false from cfs_rq_throttled.
    While tg_set_cfs_bandwidth will unthrottle them eventually, currently
    running code (including at least dequeue_task_fair and
    distribute_cfs_runtime) will cause errors.

    Fix this by turning off cfs_bandwidth_used only after unthrottling all
    cfs_rqs.

    Tested: toggle bandwidth back and forth on a loaded cgroup. Caused
    crashes in minutes without the patch, hasn't crashed with it.

    Signed-off-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    Cc: Chris J Arges
    Signed-off-by: Greg Kroah-Hartman

    Ben Segall
     

10 Jan, 2014

7 commits

  • commit 20841405940e7be0617612d521e206e4b6b325db upstream.

    There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rik van Riel
     
  • commit 3c67f474558748b604e247d92b55dfe89654c81d upstream.

    Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557 upstream.

    Freezable kthreads and workqueues are fundamentally problematic in
    that they effectively introduce a big kernel lock widely used in the
    kernel and have already been the culprit of several deadlock
    scenarios. This is the latest occurrence.

    During resume, libata rescans all the ports and revalidates all
    pre-existing devices. If it determines that a device has gone
    missing, the device is removed from the system which involves
    invalidating block device and flushing bdi while holding driver core
    layer locks. Unfortunately, this can race with the rest of device
    resume. Because freezable kthreads and workqueues are thawed after
    device resume is complete and block device removal depends on
    freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
    progress, this can lead to deadlock - block device removal can't
    proceed because kthreads are frozen and kthreads can't be thawed
    because device resume is blocked behind block device removal.

    839a8e8660b6 ("writeback: replace custom worker pool implementation
    with unbound workqueue") made this particular deadlock scenario more
    visible but the underlying problem has always been there - the
    original forker task and jbd2 are freezable too. In fact, this is
    highly likely just one of many possible deadlock scenarios given that
    freezer behaves as a big kernel lock and we don't have any debug
    mechanism around it.

    I believe the right thing to do is getting rid of freezable kthreads
    and workqueues. This is something fundamentally broken. For now,
    implement a funny workaround in libata - just avoid doing block device
    hot[un]plug while the system is frozen. Kernel engineering at its
    finest. :(

    v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
    as a module.

    v3: Comment updated and polling interval changed to 10ms as suggested
    by Rafael.

    v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
    defined when FREEZER is not configured thus breaking build.
    Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Reported-by: Tomaž Šolc
    Reviewed-by: "Rafael J. Wysocki"
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
    Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
    Cc: Greg Kroah-Hartman
    Cc: Len Brown
    Cc: Oleg Nesterov
    Cc: kbuild test robot
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 266ccd505e8acb98717819cef9d91d66c7b237cc upstream.

    ae7f164a09 ("cgroup: move cgroup->subsys[] assignment to
    online_css()") moved cgroup->subsys[] assignements later in
    cgroup_create() but didn't update error handling path accordingly
    leading to the following oops and leaking later css's after an
    online_css() failure. The oops is from cgroup destruction path being
    invoked on the partially constructed cgroup which is not ready to
    handle empty slots in cgrp->subsys[] array.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] cgroup_destroy_locked+0x118/0x2f0
    PGD a780a067 PUD aadbe067 PMD 0
    Oops: 0000 [#1] SMP
    Modules linked in:
    CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
    Hardware name:
    task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
    RIP: 0010:[] [] cgroup_destroy_locked+0x118/0x2f0
    RSP: 0018:ffff8800a781bd98 EFLAGS: 00010282
    RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
    RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
    RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
    R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
    R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
    FS: 00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
    Stack:
    ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
    ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
    ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
    Call Trace:
    [] cgroup_mkdir+0x55f/0x5f0
    [] vfs_mkdir+0xee/0x140
    [] SyS_mkdirat+0x6e/0xf0
    [] SyS_mkdir+0x19/0x20
    [] system_call_fastpath+0x16/0x1b

    This patch moves reference bumping inside online_css() loop, clears
    css_ar[] as css's are brought online successfully, and updates
    err_destroy path so that either a css is fully online and destroyed by
    cgroup_destroy_locked() or the error path frees it. This creates a
    duplicate css free logic in the error path but it will be cleaned up
    soon.

    v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
    invoked with a cgroup which doesn't have all css's populated.
    Update cgroup_destroy_locked() so that it skips NULL css's.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Reported-by: Vladimir Davydov
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 757dfcaa41844595964f1220f1d33182dae49976 upstream.

    This patch touches the RT group scheduling case.

    Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
    priority, while rt_rq passed to them may be not the top-level rt_rq.
    This is wrong, because changing of priority on a child level does not
    guarantee that the priority is the highest all over the rq. So, this
    leak makes RT balancing unusable.

    The short example: the task having the highest priority among all rq's
    RT tasks (no one other task has the same priority) are waking on a
    throttle rt_rq. The rq's cpupri is set to the task's priority
    equivalent, but real rq->rt.highest_prio.curr is less.

    The patch below fixes the problem.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    CC: Steven Rostedt
    Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     
  • commit c4602c1c818bd6626178d6d3fcc152d9f2f48ac0 upstream.

    Ftrace currently initializes only the online CPUs. This implementation has
    two problems:
    - If we online a CPU after we enable the function profile, and then run the
    test, we will lose the trace information on that CPU.
    Steps to reproduce:
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # cd /tracing/
    # echo >> set_ftrace_filter
    # echo 1 > function_profile_enabled
    # echo 1 > /sys/devices/system/cpu/cpu1/online
    # run test
    - If we offline a CPU before we enable the function profile, we will not clear
    the trace information when we enable the function profile. It will trouble
    the users.
    Steps to reproduce:
    # cd /tracing/
    # echo >> set_ftrace_filter
    # echo 1 > function_profile_enabled
    # run test
    # cat trace_stat/function*
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # echo 0 > function_profile_enabled
    # echo 1 > function_profile_enabled
    # cat trace_stat/function*
    # run test
    # cat trace_stat/function*

    So it is better that we initialize the ftrace profiler for each possible cpu
    every time we enable the function profile instead of just the online ones.

    Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com

    Signed-off-by: Miao Xie
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Miao Xie
     
  • commit c97102ba96324da330078ad8619ba4dfe840dbe3 upstream.

    Commit 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic
    kernel") moved reboot= handling to generic code. In the process it also
    removed the code in native_machine_shutdown() which are moving reboot
    process to reboot_cpu/cpu0.

    I guess that thought must have been that all reboot paths are calling
    migrate_to_reboot_cpu(), so we don't need this special handling. But
    kexec reboot path (kernel_kexec()) is not calling
    migrate_to_reboot_cpu() so above change broke kexec. Now reboot can
    happen on non-boot cpu and when INIT is sent in second kerneo to bring
    up BP, it brings down the machine.

    So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
    this problem.

    Bisected by WANG Chao.

    Reported-by: Matthew Whitehead
    Reported-by: Dave Young
    Signed-off-by: Vivek Goyal
    Tested-by: Baoquan He
    Tested-by: WANG Chao
    Acked-by: H. Peter Anvin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vivek Goyal
     

20 Dec, 2013

3 commits

  • commit f9f9ffc237dd924f048204e8799da74f9ecf40cf upstream.

    throttle_cfs_rq() doesn't check to make sure that period_timer is running,
    and while update_curr/assign_cfs_runtime does, a concurrently running
    period_timer on another cpu could cancel itself between this cpu's
    update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
    in the tg to restart the timer, this causes the cfs_rq to be stranded
    forever.

    Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
    inactive.

    (Also add some sched_debug lines for cfs_bandwidth.)

    Tested: make a run/sleep task in a cgroup, loop switching the cgroup
    between 1ms/100ms quota and unlimited, checking for timer_active=0 and
    throttled=1 as a failure. With the throttle_cfs_rq() change commented out
    this fails, with the full patch it passes.

    Signed-off-by: Ben Segall
    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    Cc: Chris J Arges
    Signed-off-by: Greg Kroah-Hartman

    Ben Segall
     
  • commit f12d5bfceb7e1f9051563381ec047f7f13956c3c upstream.

    The hugepage code had the exact same bug that regular pages had in
    commit 7485d0d3758e ("futexes: Remove rw parameter from
    get_futex_key()").

    The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
    regression with read only mappings"), but the transparent hugepage case
    (added in a5b338f2b0b1: "thp: update futex compound knowledge") case
    remained broken.

    Found by Dave Jones and his trinity tool.

    Reported-and-tested-by: Dave Jones
    Acked-by: Thomas Gleixner
    Cc: Mel Gorman
    Cc: Darren Hart
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 4fc9bbf98fd66f879e628d8537ba7c240be2b58e upstream.

    Add a flag to tell the PCI subsystem that kernel is shutting down in
    preparation to kexec a kernel. Add code in PCI subsystem to use this flag
    to clear Bus Master bit on PCI devices only in case of kexec reboot.

    This fixes a power-off problem on Acer Aspire V5-573G and likely other
    machines and avoids any other issues caused by clearing Bus Master bit on
    PCI devices in normal shutdown path. The problem was introduced by
    b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown").

    This patch is based on discussion at
    http://marc.info/?l=linux-pci&m=138425645204355&w=2

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
    Reported-by: Chang Liu
    Signed-off-by: Khalid Aziz
    Signed-off-by: Bjorn Helgaas
    Acked-by: Konstantin Khlebnikov
    Signed-off-by: Greg Kroah-Hartman

    Khalid Aziz
     

12 Dec, 2013

2 commits

  • commit ac01810c9d2814238f08a227062e66a35a0e1ea2 upstream.

    When the system enters suspend, it disables all interrupts in
    suspend_device_irqs(), including the interrupts marked EARLY_RESUME.

    On the resume side things are different. The EARLY_RESUME interrupts
    are reenabled in sys_core_ops->resume and the non EARLY_RESUME
    interrupts are reenabled in the normal system resume path.

    When suspend_noirq() failed or suspend is aborted for any other
    reason, we might omit the resume side call to sys_core_ops->resume()
    and therefor the interrupts marked EARLY_RESUME are not reenabled and
    stay disabled forever.

    To solve this, enable all irqs unconditionally in irq_resume()
    regardless whether interrupts marked EARLY_RESUMEhave been already
    enabled or not.

    This might try to reenable already enabled interrupts in the non
    failure case, but the only affected platform is XEN and it has been
    confirmed that it does not cause any side effects.

    [ tglx: Massaged changelog. ]

    Signed-off-by: Laxman Dewangan
    Acked-by-and-tested-by: Konrad Rzeszutek Wilk
    Acked-by: Heiko Stuebner
    Reviewed-by: Pavel Machek
    Cc:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Laxman Dewangan
     
  • commit 4be77398ac9d948773116b6be4a3c91b3d6ea18c upstream.

    Since commit 1e75fa8be9f (time: Condense timekeeper.xtime
    into xtime_sec - merged in v3.6), there has been an problem
    with the error accounting in the timekeeping code, such that
    when truncating to nanoseconds, we round up to the next nsec,
    but the balancing adjustment to the ntp_error value was dropped.

    This causes 1ns per tick drift forward of the clock.

    In 3.7, this logic was isolated to only GENERIC_TIME_VSYSCALL_OLD
    architectures (s390, ia64, powerpc).

    The fix is simply to balance the accounting and to subtract the
    added nanosecond from ntp_error. This allows the internal long-term
    clock steering to keep the clock accurate.

    While this fix removes the regression added in 1e75fa8be9f, the
    ideal solution is to move away from GENERIC_TIME_VSYSCALL_OLD
    and use the new VSYSCALL method, which avoids entirely the
    nanosecond granular rounding, and the resulting short-term clock
    adjustment oscillation needed to keep long term accurate time.

    [ jstultz: Many thanks to Martin for his efforts identifying this
    subtle bug, and providing the fix. ]

    Originally-from: Martin Schwidefsky
    Cc: Tony Luck
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Andy Lutomirski
    Cc: Paul Turner
    Cc: Steven Rostedt
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Fenghua Yu
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1385149491-20307-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Martin Schwidefsky
     

08 Dec, 2013

1 commit

  • commit a97ad0c4b447a132a322cedc3a5f7fa4cab4b304 upstream.

    The current code requires that the scheduled update of the RTC happens
    in the closest tick to the half of the second. This seems to be
    difficult to achieve reliably. The scheduled work may be missing the
    target time by a tick or two and be constantly rescheduled every second.

    Relax the limit to 10 ticks. As a typical RTC drifts in the 11-minute
    update interval by several milliseconds, this shouldn't affect the
    overall accuracy of the RTC much.

    Signed-off-by: Miroslav Lichvar
    Signed-off-by: John Stultz
    Cc: Josh Boyer
    Signed-off-by: Greg Kroah-Hartman

    Miroslav Lichvar
     

05 Dec, 2013

13 commits

  • commit 0fc0287c9ed1ffd3706f8b4d9b314aa102ef1245 upstream.

    Juri hit the below lockdep report:

    [ 4.303391] ======================================================
    [ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
    [ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted
    [ 4.303395] ------------------------------------------------------
    [ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [] new_slab+0x6c/0x290
    [ 4.303417]
    [ 4.303417] and this task is already holding:
    [ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [] blk_execute_rq_nowait+0x5b/0x100
    [ 4.303431] which would create a new lock dependency:
    [ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
    [ 4.303436]

    [ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
    [ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
    [ 4.303922] HARDIRQ-ON-W at:
    [ 4.303923] [] __lock_acquire+0x65a/0x1ff0
    [ 4.303926] [] lock_acquire+0x93/0x140
    [ 4.303929] [] kthreadd+0x86/0x180
    [ 4.303931] [] ret_from_fork+0x7c/0xb0
    [ 4.303933] SOFTIRQ-ON-W at:
    [ 4.303933] [] __lock_acquire+0x68c/0x1ff0
    [ 4.303935] [] lock_acquire+0x93/0x140
    [ 4.303940] [] kthreadd+0x86/0x180
    [ 4.303955] [] ret_from_fork+0x7c/0xb0
    [ 4.303959] INITIAL USE at:
    [ 4.303960] [] __lock_acquire+0x344/0x1ff0
    [ 4.303963] [] lock_acquire+0x93/0x140
    [ 4.303966] [] kthreadd+0x86/0x180
    [ 4.303969] [] ret_from_fork+0x7c/0xb0
    [ 4.303972] }

    Which reports that we take mems_allowed_seq with interrupts enabled. A
    little digging found that this can only be from
    cpuset_change_task_nodemask().

    This is an actual deadlock because an interrupt doing an allocation will
    hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
    forever waiting for the write side to complete.

    Cc: John Stultz
    Cc: Mel Gorman
    Reported-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Tested-by: Juri Lelli
    Acked-by: Li Zefan
    Acked-by: Mel Gorman
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit e605b36575e896edd8161534550c9ea021b03bc0 upstream.

    If a cgroup file implements either read_map() or read_seq_string(),
    such file is served using seq_file by overriding file->f_op to
    cgroup_seqfile_operations, which also overrides the release method to
    single_release() from cgroup_file_release().

    Because cgroup_file_open() didn't use to acquire any resources, this
    used to be fine, but since f7d58818ba42 ("cgroup: pin
    cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
    pins the css (cgroup_subsys_state) which is put by
    cgroup_file_release(). The patch forgot to update the release path
    for seq_files and each open/release cycle leaks a css reference.

    Fix it by updating cgroup_file_release() to also handle seq_files and
    using it for seq_file release path too.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit e5fca243abae1445afbfceebda5f08462ef869d3 upstream.

    Since be44562613851 ("cgroup: remove synchronize_rcu() from
    cgroup_diput()"), cgroup destruction path makes use of workqueue. css
    freeing is performed from a work item from that point on and a later
    commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
    steps"), moves css offlining to workqueue too.

    As cgroup destruction isn't depended upon for memory reclaim, the
    destruction work items were put on the system_wq; unfortunately, some
    controller may block in the destruction path for considerable duration
    while holding cgroup_mutex. As large part of destruction path is
    synchronized through cgroup_mutex, when combined with high rate of
    cgroup removals, this has potential to fill up system_wq's max_active
    of 256.

    Also, it turns out that memcg's css destruction path ends up queueing
    and waiting for work items on system_wq through work_on_cpu(). If
    such operation happens while system_wq is fully occupied by cgroup
    destruction work items, work_on_cpu() can't make forward progress
    because system_wq is full and other destruction work items on
    system_wq can't make forward progress because the work item waiting
    for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

    This can be fixed by queueing destruction work items on a separate
    workqueue. This patch creates a dedicated workqueue -
    cgroup_destroy_wq - for this purpose. As these work items shouldn't
    have inter-dependencies and mostly serialized by cgroup_mutex anyway,
    giving high concurrency level doesn't buy anything and the workqueue's
    @max_active is set to 1 so that destruction work items are executed
    one by one on each CPU.

    Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
    cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
    separate core_initcall(). In the future, we probably want to reorder
    so that workqueue init happens before cgroup_init().

    Signed-off-by: Tejun Heo
    Reported-by: Hugh Dickins
    Reported-by: Shawn Bohrer
    Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
    Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream.

    An ordered workqueue implements execution ordering by using single
    pool_workqueue with max_active == 1. On a given pool_workqueue, work
    items are processed in FIFO order and limiting max_active to 1
    enforces the queued work items to be processed one by one.

    Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
    unbound workqueues") accidentally broke this guarantee by applying
    NUMA affinity to ordered workqueues too. On NUMA setups, an ordered
    workqueue would end up with separate pool_workqueues for different
    nodes. Each pool_workqueue still limits max_active to 1 but multiple
    work items may be executed concurrently and out of order depending on
    which node they are queued to.

    Fix it by using dedicated ordered_wq_attrs[] when creating ordered
    workqueues. The new attrs match the unbound ones except that no_numa
    is always set thus forcing all NUMA nodes to share the default
    pool_workqueue.

    While at it, add sanity check in workqueue creation path which
    verifies that an ordered workqueues has only the default
    pool_workqueue.

    Signed-off-by: Tejun Heo
    Reported-by: Libin
    Cc: Lai Jiangshan
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 8a56d7761d2d041ae5e8215d20b4167d8aa93f51 upstream.

    Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload"
    fixed module loading and unloading with respect to function tracing, but
    it missed the function graph tracer. If you perform the following

    # cd /sys/kernel/debug/tracing
    # echo function_graph > current_tracer
    # modprobe nfsd
    # echo nop > current_tracer

    You'll get the following oops message:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
    Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
    CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
    Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
    0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
    0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
    ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
    Call Trace:
    [] dump_stack+0x4f/0x7c
    [] warn_slowpath_common+0x81/0x9b
    [] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
    [] warn_slowpath_null+0x1a/0x1c
    [] __ftrace_hash_rec_update.part.35+0x168/0x1b9
    [] ? __mutex_lock_slowpath+0x364/0x364
    [] ftrace_shutdown+0xd7/0x12b
    [] unregister_ftrace_graph+0x49/0x78
    [] graph_trace_reset+0xe/0x10
    [] tracing_set_tracer+0xa7/0x26a
    [] tracing_set_trace_write+0x8b/0xbd
    [] ? ftrace_return_to_handler+0xb2/0xde
    [] ? __sb_end_write+0x5e/0x5e
    [] vfs_write+0xab/0xf6
    [] ftrace_graph_caller+0x85/0x85
    [] SyS_write+0x59/0x82
    [] ftrace_graph_caller+0x85/0x85
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 940358030751eafb ]---

    The above mentioned commit didn't go far enough. Well, it covered the
    function tracer by adding checks in __register_ftrace_function(). The
    problem is that the function graph tracer circumvents that (for a slight
    efficiency gain when function graph trace is running with a function
    tracer. The gain was not worth this).

    The problem came with ftrace_startup() which should always be called after
    __register_ftrace_function(), if you want this bug to be completely fixed.

    Anyway, this solution moves __register_ftrace_function() inside of
    ftrace_startup() and removes the need to call them both.

    Reported-by: Dave Wysochanski
    Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace")
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit d3aea84a4ace5ff9ce7fb7714cee07bebef681c2 upstream.

    ...to make it clear what the intent behind each record's operation was.

    In many cases you can infer this, based on the context of the syscall
    and the result. In other cases it's not so obvious. For instance, in
    the case where you have a file being renamed over another, you'll have
    two different records with the same filename but different inode info.
    By logging this information we can clearly tell which one was created
    and which was deleted.

    This fixes what was broken in commit bfcec708.
    Commit 79f6530c should also be backported to stable v3.7+.

    Signed-off-by: Jeff Layton
    Signed-off-by: Eric Paris
    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Eric Paris
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 64fbff9ae0a0a843365d922e0057fc785f23f0e3 upstream.

    We leak 4 bytes of kernel stack in response to an AUDIT_GET request as
    we miss to initialize the mask member of status_set. Fix that.

    Cc: Al Viro
    Cc: Eric Paris
    Signed-off-by: Mathias Krause
    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Eric Paris
    Signed-off-by: Greg Kroah-Hartman

    Mathias Krause
     
  • commit 4d8fe7376a12bf4524783dd95cbc00f1fece6232 upstream.

    Using the nlmsg_len member of the netlink header to test if the message
    is valid is wrong as it includes the size of the netlink header itself.
    Thereby allowing to send short netlink messages that pass those checks.

    Use nlmsg_len() instead to test for the right message length. The result
    of nlmsg_len() is guaranteed to be non-negative as the netlink message
    already passed the checks of nlmsg_ok().

    Also switch to min_t() to please checkpatch.pl.

    Cc: Al Viro
    Cc: Eric Paris
    Signed-off-by: Mathias Krause
    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Eric Paris
    Signed-off-by: Greg Kroah-Hartman

    Mathias Krause
     
  • commit 0868a5e150bc4c47e7a003367cd755811eb41e0b upstream.

    When the audit=1 kernel parameter is absent and auditd is not running,
    AUDIT_USER_AVC messages are being silently discarded.

    AUDIT_USER_AVC messages should be sent to userspace using printk(), as
    mentioned in the commit message of 4a4cd633 ("AUDIT: Optimise the
    audit-disabled case for discarding user messages").

    When audit_enabled is 0, audit_receive_msg() discards all user messages
    except for AUDIT_USER_AVC messages. However, audit_log_common_recv_msg()
    refuses to allocate an audit_buffer if audit_enabled is 0. The fix is to
    special case AUDIT_USER_AVC messages in both functions.

    It looks like commit 50397bd1 ("[AUDIT] clean up audit_receive_msg()")
    introduced this bug.

    Signed-off-by: Tyler Hicks
    Cc: Al Viro
    Cc: Eric Paris
    Cc: linux-audit@redhat.com
    Acked-by: Kees Cook
    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Eric Paris
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit 6a0c7cd33075f6b7f1d80145bb19812beb3fc5c9 upstream.

    I have received a report about the BUG_ON() in free_basic_memory_bitmaps()
    triggering mysteriously during an aborted s2disk hibernation attempt.
    The only way I can explain that is that /dev/snapshot was first
    opened for writing (resume mode), then closed and then opened again
    for reading and closed again without freezing tasks. In that case
    the first invocation of snapshot_open() would set the free_bitmaps
    flag in snapshot_state, which is a static variable. That flag
    wouldn't be cleared later and the second invocation of snapshot_open()
    would just leave it like that, so the subsequent snapshot_release()
    would see data->frozen set and free_basic_memory_bitmaps() would be
    called unnecessarily.

    To prevent that from happening clear data->free_bitmaps in
    snapshot_open() when the file is being opened for reading (hibernate
    mode).

    In addition to that, replace the BUG_ON() in free_basic_memory_bitmaps()
    with a WARN_ON() as the kernel can continue just fine if the condition
    checked by that macro occurs.

    Fixes: aab172891542 (PM / hibernate: Fix user space driven resume regression)
    Reported-by: Oliver Lorenz
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • commit fd432b9f8c7c88428a4635b9f5a9c6e174df6e36 upstream.

    When system has a lot of highmem (e.g. 16GiB using a 32 bits kernel),
    the code to calculate how much memory we need to preallocate in
    normal zone may cause overflow. As Leon has analysed:

    It looks that during computing 'alloc' variable there is overflow:
    alloc = (3943404 - 1970542) - 1978280 = -5418 (signed)
    And this function goes to err_out.

    Fix this by avoiding that overflow.

    References: https://bugzilla.kernel.org/show_bug.cgi?id=60817
    Reported-and-tested-by: Leon Drugi
    Signed-off-by: Aaron Lu
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Aaron Lu
     
  • commit 98d6f4dd84a134d942827584a3c5f67ffd8ec35f upstream.

    Fedora Ruby maintainer reported latest Ruby doesn't work on Fedora Rawhide
    on ARM. (http://bugs.ruby-lang.org/issues/9008)

    Because of, commit 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no
    RTC device is present) intruduced to return ENOTSUPP when
    clock_get{time,res} can't find a RTC device. However this is incorrect.

    First, ENOTSUPP isn't exported to userland (ENOTSUP or EOPNOTSUP are the
    closest userland equivlents).

    Second, Posix and Linux man pages agree that clock_gettime and
    clock_getres should return EINVAL if clk_id argument is invalid.
    While the arugment that the clockid is valid, but just not supported
    on this hardware could be made, this is just a technicality that
    doesn't help userspace applicaitons, and only complicates error
    handling.

    Thus, this patch changes the code to use EINVAL.

    Cc: Thomas Gleixner
    Cc: Frederic Weisbecker
    Reported-by: Vit Ondruch
    Signed-off-by: KOSAKI Motohiro
    [jstultz: Tweaks to commit message to include full rational]
    Signed-off-by: John Stultz
    Signed-off-by: Greg Kroah-Hartman

    KOSAKI Motohiro
     
  • commit bbfe65c219c638e19f1da5adab1005b2d68ca810 upstream.

    In commit ee23871389 ("genirq: Set irq thread to RT priority on
    creation") we moved the assigment of the thread's priority from the
    thread's function into __setup_irq(). That function may run in user
    context for instance if the user opens an UART node and then driver
    calls requests in the ->open() callback. That user may not have
    CAP_SYS_NICE and so the irq thread won't run with the SCHED_OTHER
    policy.

    This patch uses sched_setscheduler_nocheck() so we omit the CAP_SYS_NICE
    check which is otherwise required for the SCHED_OTHER policy.

    [bigeasy: Rewrite the changelog]

    Signed-off-by: Thomas Pfaff
    Cc: Ivo Sieben
    Link: http://lkml.kernel.org/r/1381489240-29626-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Pfaff
     

30 Nov, 2013

3 commits

  • commit d049f74f2dbe71354d43d393ac3a188947811348 upstream.

    The get_dumpable() return value is not boolean. Most users of the
    function actually want to be testing for non-SUID_DUMP_USER(1) rather than
    SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a
    protected state. Almost all places did this correctly, excepting the two
    places fixed in this patch.

    Wrong logic:
    if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
    or
    if (dumpable == 0) { /* be protective */ }
    or
    if (!dumpable) { /* be protective */ }

    Correct logic:
    if (dumpable != SUID_DUMP_USER) { /* be protective */ }
    or
    if (dumpable != 1) { /* be protective */ }

    Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
    user was able to ptrace attach to processes that had dropped privileges to
    that user. (This may have been partially mitigated if Yama was enabled.)

    The macros have been moved into the file that declares get/set_dumpable(),
    which means things like the ia64 code can see them too.

    CVE-2013-2929

    Reported-by: Vasily Kulikov
    Signed-off-by: Kees Cook
    Cc: "Luck, Tony"
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit 12ae030d54ef250706da5642fc7697cc60ad0df7 upstream.

    The current default perf paranoid level is "1" which has
    "perf_paranoid_kernel()" return false, and giving any operations that
    use it, access to normal users. Unfortunately, this includes function
    tracing and normal users should not be allowed to enable function
    tracing by default.

    The proper level is defined at "-1" (full perf access), which
    "perf_paranoid_tracepoint_raw()" will only give access to. Use that
    check instead for enabling function tracing.

    Reported-by: Dave Jones
    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Frederic Weisbecker
    CVE: CVE-2013-2930
    Fixes: ced39002f5ea ("ftrace, perf: Add support to use function tracepoint in perf")
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt
     
  • commit ea8117478918a4734586d35ff530721b682425be upstream.

    Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
    regressed several workloads and caused excessive reschedule
    interrupts.

    The patch in question failed to notice that the x86 code had an
    inverted sense of the polling state versus the new generic code (x86:
    default polling, generic: default !polling).

    Fix the two prominent x86 mwait based idle drivers and introduce a few
    new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
    usage).

    Also switch the idle routines to using tif_need_resched() which is an
    immediate TIF_NEED_RESCHED test as opposed to need_resched which will
    end up being slightly different.

    Reported-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: lenb@kernel.org
    Cc: tglx@linutronix.de
    Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

21 Nov, 2013

1 commit

  • commit 057db8488b53d5e4faa0cedb2f39d4ae75dfbdbb upstream.

    Andrey reported the following report:

    ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3
    ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3)
    Accessed by thread T13003:
    #0 ffffffff810dd2da (asan_report_error+0x32a/0x440)
    #1 ffffffff810dc6b0 (asan_check_region+0x30/0x40)
    #2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20)
    #3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260)
    #4 ffffffff812a1065 (__fput+0x155/0x360)
    #5 ffffffff812a12de (____fput+0x1e/0x30)
    #6 ffffffff8111708d (task_work_run+0x10d/0x140)
    #7 ffffffff810ea043 (do_exit+0x433/0x11f0)
    #8 ffffffff810eaee4 (do_group_exit+0x84/0x130)
    #9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30)
    #10 ffffffff81928782 (system_call_fastpath+0x16/0x1b)

    Allocated by thread T5167:
    #0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0)
    #1 ffffffff8128337c (__kmalloc+0xbc/0x500)
    #2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90)
    #3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0)
    #4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40)
    #5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430)
    #6 ffffffff8129b668 (finish_open+0x68/0xa0)
    #7 ffffffff812b66ac (do_last+0xb8c/0x1710)
    #8 ffffffff812b7350 (path_openat+0x120/0xb50)
    #9 ffffffff812b8884 (do_filp_open+0x54/0xb0)
    #10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0)
    #11 ffffffff8129d4b7 (SyS_open+0x37/0x50)
    #12 ffffffff81928782 (system_call_fastpath+0x16/0x1b)

    Shadow bytes around the buggy address:
    ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
    ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    =>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb
    ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
    ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
    Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable: 00
    Partially addressable: 01 02 03 04 05 06 07
    Heap redzone: fa
    Heap kmalloc redzone: fb
    Freed heap region: fd
    Shadow gap: fe

    The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;'

    Although the crash happened in ftrace_regex_open() the real bug
    occurred in trace_get_user() where there's an incrementation to
    parser->idx without a check against the size. The way it is triggered
    is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop
    that reads the last character stores it and then breaks out because
    there is no more characters. Then the last character is read to determine
    what to do next, and the index is incremented without checking size.

    Then the caller of trace_get_user() usually nulls out the last character
    with a zero, but since the index is equal to the size, it writes a nul
    character after the allocated space, which can corrupt memory.

    Luckily, only root user has write access to this file.

    Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home

    Reported-by: Andrey Konovalov
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt
     

29 Oct, 2013

1 commit

  • The PPC64 people noticed a missing memory barrier and crufty old
    comments in the perf ring buffer code. So update all the comments and
    add the missing barrier.

    When the architecture implements local_t using atomic_long_t there
    will be double barriers issued; but short of introducing more
    conditional barrier primitives this is the best we can do.

    Reported-by: Victor Kaplansky
    Tested-by: Victor Kaplansky
    Signed-off-by: Peter Zijlstra
    Cc: Mathieu Desnoyers
    Cc: michael@ellerman.id.au
    Cc: Paul McKenney
    Cc: Michael Neuling
    Cc: Frederic Weisbecker
    Cc: anton@samba.org
    Cc: benh@kernel.crashing.org
    Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 Oct, 2013

3 commits


26 Oct, 2013

1 commit

  • Pull ACPI and power management fixes from
    "These fix two bugs in the intel_pstate driver, a hibernate bug leading
    to nasty resume failures sometimes and acpi-cpufreq initialization bug
    that causes problems to happen during module unload when intel_pstate
    is in use.

    Specifics:

    - Fix for rounding errors in intel_pstate causing CPU utilization to
    be underestimated from Brennan Shacklett.

    - intel_pstate fix to always use the correct max pstate value when
    computing the min pstate from Dirk Brandewie.

    - Hibernation fix for deadlocking resume in cases when the probing of
    the device containing the image is deferred from Russ Dill.

    - acpi-cpufreq fix to prevent the module from staying in memory when
    the driver cannot be registered and then attempting to unregister
    things that have never been registered on exit"

    * tag 'pm+acpi-3.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    acpi-cpufreq: Fail initialization if driver cannot be registered
    PM / hibernate: Move software_resume to late_initcall_sync
    intel_pstate: Correct calculation of min pstate value
    intel_pstate: Improve accuracy by not truncating until final result

    Linus Torvalds