28 Jan, 2015

4 commits

  • commit 7485058eea40783ac142a60c3e799fc66ce72583 upstream.

    Using just the filter for checking for trampolines or regs is not enough
    when updating the code against the records that represent all functions.
    Both the filter hash and the notrace hash need to be checked.

    To trigger this bug (using trace-cmd and perf):

    # perf probe -a do_fork
    # trace-cmd start -B foo -e probe
    # trace-cmd record -p function_graph -n do_fork sleep 1

    The trace-cmd record at the end clears the filter before it disables
    function_graph tracing and then that causes the accounting of the
    ftrace function records to become incorrect and causes ftrace to bug.

    Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org

    [ still need to switch old_hash_ops to old_ops_hash ]
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 8f86f83709c585742dea5dd7f0d2b79c43f992ec upstream.

    As the set_ftrace_filter affects both the function tracer as well as the
    function graph tracer, the ops that represent each have a shared
    ftrace_ops_hash structure. This allows both to be updated when the filter
    files are updated.

    But if function graph is enabled and the global_ops (function tracing) ops
    is not, then it is possible that the filter could be changed without the
    update happening for the function graph ops. This will cause the changes
    to not take place and may even cause a ftrace_bug to occur as it could mess
    with the trampoline accounting.

    The solution is to check if the ops uses the shared global_ops filter and
    if the ops itself is not enabled, to check if there's another ops that is
    enabled and also shares the global_ops filter. In that case, the
    modification still needs to be executed.

    Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org

    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit c291ee622165cb2c8d4e7af63fffd499354a23be upstream.

    Since the rework of the sparse interrupt code to actually free the
    unused interrupt descriptors there exists a race between the /proc
    interfaces to the irq subsystem and the code which frees the interrupt
    descriptor.

    CPU0 CPU1
    show_interrupts()
    desc = irq_to_desc(X);
    free_desc(desc)
    remove_from_radix_tree();
    kfree(desc);
    raw_spinlock_irq(&desc->lock);

    /proc/interrupts is the only interface which can actively corrupt
    kernel memory via the lock access. /proc/stat can only read from freed
    memory. Extremly hard to trigger, but possible.

    The interfaces in /proc/irq/N/ are not affected by this because the
    removal of the proc file is serialized in procfs against concurrent
    readers/writers. The removal happens before the descriptor is freed.

    For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
    as the descriptor is never freed. It's merely cleared out with the irq
    descriptor lock held. So any concurrent proc access will either see
    the old correct value or the cleared out ones.

    Protect the lookup and access to the irq descriptor in
    show_interrupts() with the sparse_irq_lock.

    Provide kstat_irqs_usr() which is protecting the lookup and access
    with sparse_irq_lock and switch /proc/stat to use it.

    Document the existing kstat_irqs interfaces so it's clear that the
    caller needs to take care about protection. The users of these
    interfaces are either not affected due to SPARSE_IRQ=n or already
    protected against removal.

    Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator"
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit a5fd9733a30d18d7ac23f17080e7e07bb3205b69 upstream.

    commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
    use" was merged via the thermal tree without an explicit ack from the
    relevant maintainers.

    The exports are abused by the intel powerclamp driver which implements
    a fake idle state from a sched FIFO task. This causes all kinds of
    wreckage in the NOHZ core code which rightfully assumes that
    tick_nohz_idle_enter/exit() are only called from the idle task itself.

    Recent changes in the NOHZ core lead to a failure of the powerclamp
    driver and now people try to hack completely broken and backwards
    workarounds into the NOHZ core code. This is completely unacceptable
    and just papers over the real problem. There are way more subtle
    issues lurking around the corner.

    The real solution is to fix the powerclamp driver by rewriting it with
    a sane concept, but that's beyond the scope of this.

    So the only solution for now is to remove the calls into the core NOHZ
    code from the powerclamp trainwreck along with the exports.

    Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
    Signed-off-by: Thomas Gleixner
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Frederic Weisbecker
    Cc: Fengguang Wu
    Cc: Frederic Weisbecker
    Cc: Pan Jacob jun
    Cc: LKP
    Cc: Peter Zijlstra
    Cc: Zhang Rui
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanos
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

16 Jan, 2015

5 commits

  • commit 3245d6acab981a2388ffb877c7ecc97e763c59d4 upstream.

    wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
    both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
    change in between, gcc needs to reload p->exit_state after
    security_task_wait(). In this case ->notask_error will be wrongly
    cleared and do_wait() can hang forever if it was the last eligible
    child.

    Many thanks to Arne who carefully investigated the problem.

    Note: this bug is very old but it was pure theoretical until commit
    b3ab03160dfa ("wait: completely ignore the EXIT_DEAD tasks"). Before
    this commit "-O2" was probably enough to guarantee that compiler won't
    read ->exit_state twice.

    Signed-off-by: Oleg Nesterov
    Reported-by: Arne Goedeke
    Tested-by: Arne Goedeke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 9fc81d87420d0d3fd62d5e5529972c0ad9eab9cc upstream.

    We allow PMU driver to change the cpu on which the event
    should be installed to. This happened in patch:

    e2d37cd213dc ("perf: Allow the PMU driver to choose the CPU on which to install events")

    This patch also forces all the group members to follow
    the currently opened events cpu if the group happened
    to be moved.

    This and the change of event->cpu in perf_install_in_context()
    function introduced in:

    0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")

    forces group members to change their event->cpu,
    if the currently-opened-event's PMU changed the cpu
    and there is a group move.

    Above behaviour causes problem for breakpoint events,
    which uses event->cpu to touch cpu specific data for
    breakpoints accounting. By changing event->cpu, some
    breakpoints slots were wrongly accounted for given
    cpu.

    Vinces's perf fuzzer hit this issue and caused following
    WARN on my setup:

    WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
    Can't find any breakpoint slot
    [...]

    This patch changes the group moving code to keep the event's
    original cpu.

    Reported-by: Vince Weaver
    Signed-off-by: Jiri Olsa
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Cc: Yan, Zheng
    Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     
  • commit fd7de1e8d5b2b2b35e71332fafb899f584597150 upstream.

    Locklessly doing is_idle_task(rq->curr) is only okay because of
    RCU protection. The older variant of the broken code checked
    rq->curr == rq->idle instead and therefore didn't need RCU.

    Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
    Signed-off-by: Andy Lutomirski
    Reviewed-by: Chuansheng Liu
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit 269ad8015a6b2bb1cf9e684da4921eb6fa0a0c88 upstream.

    The dl_runtime_exceeded() function is supposed to ckeck if
    a SCHED_DEADLINE task must be throttled, by checking if its
    current runtime is
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Luca Abeni
     
  • commit 6a503c3be937d275113b702e0421e5b0720abe8a upstream.

    According to global EDF, tasks should be migrated between runqueues
    without checking if their scheduling deadlines and runtimes are valid.
    However, SCHED_DEADLINE currently performs such a check:
    a migration happens doing:

    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, later_rq->cpu);
    activate_task(later_rq, next_task, 0);

    which ends up calling dequeue_task_dl(), setting the new CPU, and then
    calling enqueue_task_dl().

    enqueue_task_dl() then calls enqueue_dl_entity(), which calls
    update_dl_entity(), which can modify scheduling deadline and runtime,
    breaking global EDF scheduling.

    As a result, some of the properties of global EDF are not respected:
    for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
    two cores can have unbounded response times for the third task even
    if 30/80+40/80+120/170 = 1.5809 < 2

    This can be fixed by invoking update_dl_entity() only in case of
    wakeup, or if this is a new SCHED_DEADLINE task.

    Signed-off-by: Luca Abeni
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Luca Abeni
     

09 Jan, 2015

13 commits

  • commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d upstream.

    alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
    fails because disable_pid_allocation() was called by the exiting
    child_reaper.

    We could simply move get_pid_ns() down to successful return, but this fix
    tries to be as trivial as possible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 041d7b98ffe59c59fdd639931dea7d74f9aa9a59 upstream.

    A regression was caused by commit 780a7654cee8:
    audit: Make testing for a valid loginuid explicit.
    (which in turn attempted to fix a regression caused by e1760bd)

    When audit_krule_to_data() fills in the rules to get a listing, there was a
    missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.

    This broke userspace by not returning the same information that was sent and
    expected.

    The rule:
    auditctl -a exit,never -F auid=-1
    gives:
    auditctl -l
    LIST_RULES: exit,never f24=0 syscall=all
    when it should give:
    LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all

    Tag it so that it is reported the same way it was set. Create a new
    private flags audit_krule field (pflags) to store it that won't interact with
    the public one from the API.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     
  • commit 3640dcfa4fd00cd91d88bb86250bdb496f7070c0 upstream.

    Commit f1dc4867 ("audit: anchor all pid references in the initial pid
    namespace") introduced a find_vpid() call when adding/removing audit
    rules with PID/PPID filters; unfortunately this is problematic as
    find_vpid() only works if there is a task with the associated PID
    alive on the system. The following commands demonstrate a simple
    reproducer.

    # auditctl -D
    # auditctl -l
    # autrace /bin/true
    # auditctl -l

    This patch resolves the problem by simply using the PID provided by
    the user without any additional validation, e.g. no calls to check to
    see if the task/PID exists.

    Cc: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Acked-by: Eric Paris
    Reviewed-by: Richard Guy Briggs
    Signed-off-by: Greg Kroah-Hartman

    Paul Moore
     
  • commit 54dc77d974a50147d6639dac6f59cb2c29207161 upstream.

    Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
    audit_log_end(), which can come from any context (aka even a sleeping context)
    GFP_KERNEL can't be used. Since the audit_buffer knows what context it should
    use, pass that down and use that.

    See: https://lkml.org/lkml/2014/12/16/542

    BUG: sleeping function called from invalid context at mm/slab.c:2849
    in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
    2 locks held by sulogin/885:
    #0: (&sig->cred_guard_mutex){+.+.+.}, at: [] prepare_bprm_creds+0x28/0x8b
    #1: (tty_files_lock){+.+.+.}, at: [] selinux_bprm_committing_creds+0x55/0x22b
    CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
    Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
    ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
    ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
    0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
    Call Trace:
    [] dump_stack+0x50/0xa8
    [] ___might_sleep+0x1b6/0x1be
    [] __might_sleep+0x119/0x128
    [] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
    [] kmem_cache_alloc+0x43/0x1c9
    [] __alloc_skb+0x42/0x1a3
    [] skb_copy+0x3e/0xa3
    [] audit_log_end+0x83/0x100
    [] ? avc_audit_pre_callback+0x103/0x103
    [] common_lsm_audit+0x441/0x450
    [] slow_avc_audit+0x63/0x67
    [] avc_has_perm+0xca/0xe3
    [] inode_has_perm+0x5a/0x65
    [] selinux_bprm_committing_creds+0x98/0x22b
    [] security_bprm_committing_creds+0xe/0x10
    [] install_exec_creds+0xe/0x79
    [] load_elf_binary+0xe36/0x10d7
    [] search_binary_handler+0x81/0x18c
    [] do_execveat_common.isra.31+0x4e3/0x7b7
    [] do_execve+0x1f/0x21
    [] SyS_execve+0x25/0x29
    [] stub_execve+0x69/0xa0

    Reported-by: Valdis Kletnieks
    Signed-off-by: Richard Guy Briggs
    Tested-by: Valdis Kletnieks
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     
  • commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 upstream.

    Now that setgroups can be disabled and not reenabled, setting gid_map
    without privielge can now be enabled when setgroups is disabled.

    This restores most of the functionality that was lost when unprivileged
    setting of gid_map was removed. Applications that use this functionality
    will need to check to see if they use setgroups or init_groups, and if they
    don't they can be fixed by simply disabling setgroups before writing to
    gid_map.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.

    - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f0d62aec931e4ae3333c797d346dc4f188f454ba upstream.

    Generalize id_map_mutex so it can be used for more state of a user namespace.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f95d7918bd1e724675de4940039f2865e5eec5fe upstream.

    If you did not create the user namespace and are allowed
    to write to uid_map or gid_map you should already have the necessary
    privilege in the parent user namespace to establish any mapping
    you want so this will not affect userspace in practice.

    Limiting unprivileged uid mapping establishment to the creator of the
    user namespace makes it easier to verify all credentials obtained with
    the uid mapping can be obtained without the uid mapping without
    privilege.

    Limiting unprivileged gid mapping establishment (which is temporarily
    absent) to the creator of the user namespace also ensures that the
    combination of uid and gid can already be obtained without privilege.

    This is part of the fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 80dd00a23784b384ccea049bfb3f259d3f973b9d upstream.

    setresuid allows the euid to be set to any of uid, euid, suid, and
    fsuid. Therefor it is safe to allow an unprivileged user to map
    their euid and use CAP_SETUID privileged with exactly that uid,
    as no new credentials can be obtained.

    I can not find a combination of existing system calls that allows setting
    uid, euid, suid, and fsuid from the fsuid making the previous use
    of fsuid for allowing unprivileged mappings a bug.

    This is part of a fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit be7c6dba2332cef0677fbabb606e279ae76652c3 upstream.

    As any gid mapping will allow and must allow for backwards
    compatibility dropping groups don't allow any gid mappings to be
    established without CAP_SETGID in the parent user namespace.

    For a small class of applications this change breaks userspace
    and removes useful functionality. This small class of applications
    includes tools/testing/selftests/mount/unprivilged-remount-test.c

    Most of the removed functionality will be added back with the addition
    of a one way knob to disable setgroups. Once setgroups is disabled
    setting the gid_map becomes as safe as setting the uid_map.

    For more common applications that set the uid_map and the gid_map
    with privilege this change will have no affect.

    This is part of a fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream.

    setgroups is unique in not needing a valid mapping before it can be called,
    in the case of setgroups(0, NULL) which drops all supplemental groups.

    The design of the user namespace assumes that CAP_SETGID can not actually
    be used until a gid mapping is established. Therefore add a helper function
    to see if the user namespace gid mapping has been established and call
    that function in the setgroups permission check.

    This is part of the fix for CVE-2014-8989, being able to drop groups
    without privilege using user namespaces.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 0542f17bf2c1f2430d368f44c8fcf2f82ec9e53e upstream.

    The rule is simple. Don't allow anything that wouldn't be allowed
    without unprivileged mappings.

    It was previously overlooked that establishing gid mappings would
    allow dropping groups and potentially gaining permission to files and
    directories that had lesser permissions for a specific group than for
    all other users.

    This is the rule needed to fix CVE-2014-8989 and prevent any other
    security issues with new_idmap_permitted.

    The reason for this rule is that the unix permission model is old and
    there are programs out there somewhere that take advantage of every
    little corner of it. So allowing a uid or gid mapping to be
    established without privielge that would allow anything that would not
    be allowed without that mapping will result in expectations from some
    code somewhere being violated. Violated expectations about the
    behavior of the OS is a long way to say a security issue.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream.

    Today there are 3 instances of setgroups and due to an oversight their
    permission checking has diverged. Add a common function so that
    they may all share the same permission checking code.

    This corrects the current oversight in the current permission checks
    and adds a helper to avoid this in the future.

    A user namespace security fix will update this new helper, shortly.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

17 Dec, 2014

1 commit


04 Dec, 2014

1 commit

  • It appears that some SCHEDULE_USER (asm for schedule_user) callers
    in arch/x86/kernel/entry_64.S are called from RCU kernel context,
    and schedule_user will return in RCU user context. This causes RCU
    warnings and possible failures.

    This is intended to be a minimal fix suitable for 3.18.

    Reported-and-tested-by: Dave Jones
    Cc: Oleg Nesterov
    Cc: Frédéric Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

24 Nov, 2014

2 commits

  • x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
    not on non-paranoid returns. I suspect that this is a mistake and that
    the code only works because int3 is paranoid.

    Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
    for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
    from the uprobes code.

    Reported-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Borislav Petkov
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Chris bisected a NULL pointer deference in task_sched_runtime() to
    commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
    inconsistency'.

    Chris observed crashes in atop or other /proc walking programs when he
    started fork bombs on his machine. He assumed that this is a new exit
    race, but that does not make any sense when looking at that commit.

    What's interesting is that, the commit provides update_curr callbacks
    for all scheduling classes except stop_task and idle_task.

    While nothing can ever hit that via the clock_nanosleep() and
    clock_gettime() interfaces, which have been the target of the commit in
    question, the author obviously forgot that there are other code paths
    which invoke task_sched_runtime()

    do_task_stat(()
    thread_group_cputime_adjusted()
    thread_group_cputime()
    task_cputime()
    task_sched_runtime()
    if (task_current(rq, p) && task_on_rq_queued(p)) {
    update_rq_clock(rq);
    up->sched_class->update_curr(rq);
    }

    If the stats are read for a stomp machine task, aka 'migration/N' and
    that task is current on its cpu, this will happily call the NULL pointer
    of stop_task->update_curr. Ooops.

    Chris observation that this happens faster when he runs the fork bomb
    makes sense as the fork bomb will kick migration threads more often so
    the probability to hit the issue will increase.

    Add the missing update_curr callbacks to the scheduler classes stop_task
    and idle_task. While idle tasks cannot be monitored via /proc we have
    other means to hit the idle case.

    Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
    Reported-by: Chris Mason
    Reported-and-tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Stanislaw Gruszka
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

22 Nov, 2014

2 commits


16 Nov, 2014

4 commits

  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • While looking over the cpu-timer code I found that we appear to add
    the delta for the calling task twice, through:

    cpu_timer_sample_group()
    thread_group_cputimer()
    thread_group_cputime()
    times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

    Which would make the sample run ahead, making the sleep short.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Stanislaw Gruszka
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because the whole numa task selection stuff runs with preemption
    enabled (its long and expensive) we can end up migrating and selecting
    oneself as a swap target. This doesn't really work out well -- we end
    up trying to acquire the same lock twice for the swap migrate -- so
    avoid this.

    Reported-and-Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When a CPU hotplugged out, we call perf_remove_from_context() (via
    perf_event_exit_cpu()) to rip each CPU-bound event out of its PMU's cpu
    context, but leave siblings grouped together. Freeing of these events is
    left to the mercy of the usual refcounting.

    When a CPU-bound event's refcount drops to zero we cross-call to
    __perf_remove_from_context() to clean it up, detaching grouped siblings.

    This works when the relevant CPU is online, but will fail if the CPU is
    currently offline, and we won't detach the event from its siblings
    before freeing the event, leaving the sibling list corrupt. If the
    sibling list is later walked (e.g. because the CPU cam online again
    before a remaining sibling's refcount drops to zero), we will walk the
    now corrupted siblings list, potentially dereferencing garbage values.

    Given that the events should never be scheduled again (as we removed
    them from their context), we can simply detatch siblings when the CPU
    goes down in the first place. If the CPU comes back online, the
    redundant call to __perf_remove_from_context() is safe.

    Reported-by: Drew Richardson
    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent.weaver@maine.edu
    Cc: Vince Weaver
    Cc: Will Deacon
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415203904-25308-2-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

15 Nov, 2014

1 commit

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are three regression fixes, two recent (generic power domains,
    suspend-to-idle) and one older (cpufreq), an ACPI blacklist entry for
    one more machine having problems with Windows 8 compatibility, a minor
    cpufreq driver fix (cpufreq-dt) and a fixup for new callback
    definitions (generic power domains).

    Specifics:

    - Fix a crash in the suspend-to-idle code path introduced by a recent
    commit that forgot to check a pointer against NULL before
    dereferencing it (Dmitry Eremin-Solenikov).

    - Fix a boot crash on Exynos5 introduced by a recent commit making
    that platform use generic Device Tree bindings for power domains
    which exposed a weakness in the generic power domains framework
    leading to that crash (Ulf Hansson).

    - Fix a crash during system resume on systems where cpufreq depends
    on Operation Performance Points (OPP) for functionality, but
    CONFIG_OPP is not set. This leads the cpufreq driver registration
    to fail, but the resume code attempts to restore the pre-suspend
    cpufreq configuration (which does not exist) nevertheless and
    crashes. From Geert Uytterhoeven.

    - Add a new ACPI blacklist entry for Dell Vostro 3546 that has
    problems if it is reported as Windows 8 compatible to the BIOS
    (Adam Lee).

    - Fix swapped arguments in an error message in the cpufreq-dt driver
    (Abhilash Kesavan).

    - Fix up the prototypes of new callbacks in struct generic_pm_domain
    to make them more useful. Users of those callbacks will be added
    in 3.19 and it's better for them to be based on the correct struct
    definition in mainline from the start. From Ulf Hansson and Kevin
    Hilman"

    * tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / Domains: Fix initial default state of the need_restore flag
    PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set
    PM / Domains: Change prototype for the attach and detach callbacks
    cpufreq: Avoid crash in resume on SMP without OPP
    cpufreq: cpufreq-dt: Fix arguments in clock failure error message
    ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546

    Linus Torvalds
     

14 Nov, 2014

2 commits

  • Commit 69361eef9056 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
    but failed to update the comments for print_tainted(). So, update the
    comments.

    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Pull audit fixes from Paul Moore:
    "After he sent the initial audit pull request for 3.18, Eric asked me
    to take over the management of the audit tree, hence this pull request
    to fix a couple of problems with audit.

    As you can see below, the changes are minimal: adding some whitespace
    to a string so userspace parses it correctly, and fixing a problem
    with audit's usage of fsnotify that was causing audit watch rules to
    be lost. Neither of these patches were very controversial on the
    mailing lists and they fix real problems, getting them into 3.18 would
    be a good thing"

    * 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit:
    audit: keep inode pinned
    audit: AUDIT_FEATURE_CHANGE message format missing delimiting space

    Linus Torvalds
     

12 Nov, 2014

1 commit

  • Audit rules disappear when an inode they watch is evicted from the cache.
    This is likely not what we want.

    The guilty commit is "fsnotify: allow marks to not pin inodes in core",
    which didn't take into account that audit_tree adds watches with a zero
    mask.

    Adding any mask should fix this.

    Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core")
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org # 2.6.36+
    Signed-off-by: Paul Moore

    Miklos Szeredi
     

11 Nov, 2014

2 commits

  • If the read loop in trace_buffers_splice_read() keeps failing due to
    memory allocation failures without reading even a single page then this
    function will keep busy looping.

    Remove the risk for that by exiting the function if memory allocation
    failures are seen.

    Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.in

    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     
  • On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
    lockup:

    # trace-cmd record -e raw_syscalls:* -F false
    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
    ...
    Call Trace:
    [] ? __wake_up_common+0x90/0x90
    [] wait_on_pipe+0x35/0x40
    [] tracing_buffers_splice_read+0x2e3/0x3c0
    [] ? tracing_stats_read+0x2a0/0x2a0
    [] ? _raw_spin_unlock+0x2b/0x40
    [] ? do_read_fault+0x21b/0x290
    [] ? handle_mm_fault+0x2ba/0xbd0
    [] ? trace_event_buffer_lock_reserve+0x40/0x80
    [] ? trace_buffer_lock_reserve+0x22/0x60
    [] ? trace_event_buffer_lock_reserve+0x40/0x80
    [] do_splice_to+0x6d/0x90
    [] SyS_splice+0x7c1/0x800
    [] tracesys_phase2+0xd3/0xd8

    The problem is this: tracing_buffers_splice_read() calls
    ring_buffer_wait() to wait for data in the ring buffers. The buffers
    are not empty so ring_buffer_wait() returns immediately. But
    tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
    meaning it only wants to read a full page. When the full page is not
    available, tracing_buffers_splice_read() tries to wait again with
    ring_buffer_wait(), which again returns immediately, and so on.

    Fix this by adding a "full" argument to ring_buffer_wait() which will
    make ring_buffer_wait() wait until the writer has left the reader's
    page, i.e. until full-page reads will succeed.

    Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in

    Cc: stable@vger.kernel.org # 3.16+
    Fixes: b1169cc69ba9 ("tracing: Remove mock up poll wait function")
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     

10 Nov, 2014

1 commit

  • On latest mm + KASan patchset I've got this:

    ==================================================================
    BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
    =============================================================================
    BUG kmalloc-8 (Not tainted): kasan error
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
    __slab_alloc+0x4b4/0x4f0
    __kmalloc_track_caller+0x15f/0x1e0
    kstrdup+0x44/0x90
    alloc_vfsmnt+0xb0/0x2c0
    vfs_kern_mount+0x35/0x190
    kern_mount_data+0x25/0x50
    pid_ns_prepare_proc+0x19/0x50
    alloc_pid+0x5e2/0x630
    copy_process.part.41+0xdf5/0x2aa0
    do_fork+0xf5/0x460
    kernel_thread+0x21/0x30
    rest_init+0x1e/0x90
    start_kernel+0x522/0x531
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x15b/0x16a
    INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
    INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

    Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
    Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
    Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
    ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
    ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_trailer (mm/slub.c:645)
    object_err (mm/slub.c:652)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_kmalloc (mm/kasan/kasan.c:311)
    __asan_load4 (mm/kasan/kasan.c:371)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kernel_init_freeable (init/main.c:869 init/main.c:997)
    ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
    ? rest_init (init/main.c:924)
    kernel_init (init/main.c:929)
    ? rest_init (init/main.c:924)
    ret_from_fork (arch/x86/kernel/entry_64.S:348)
    ? rest_init (init/main.c:924)
    Read of size 4 by task swapper/0:
    Memory state around the buggy address:
    ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
    ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
    ^
    ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Zero 'level' (e.g. on non-NUMA system) causing out of bounds
    access in this line:

    sched_max_numa_distance = sched_domains_numa_distance[level - 1];

    Fix this by exiting from sched_init_numa() earlier.

    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     

09 Nov, 2014

1 commit