09 Jan, 2015

13 commits

  • commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d upstream.

    alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
    fails because disable_pid_allocation() was called by the exiting
    child_reaper.

    We could simply move get_pid_ns() down to successful return, but this fix
    tries to be as trivial as possible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 041d7b98ffe59c59fdd639931dea7d74f9aa9a59 upstream.

    A regression was caused by commit 780a7654cee8:
    audit: Make testing for a valid loginuid explicit.
    (which in turn attempted to fix a regression caused by e1760bd)

    When audit_krule_to_data() fills in the rules to get a listing, there was a
    missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.

    This broke userspace by not returning the same information that was sent and
    expected.

    The rule:
    auditctl -a exit,never -F auid=-1
    gives:
    auditctl -l
    LIST_RULES: exit,never f24=0 syscall=all
    when it should give:
    LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all

    Tag it so that it is reported the same way it was set. Create a new
    private flags audit_krule field (pflags) to store it that won't interact with
    the public one from the API.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     
  • commit 3640dcfa4fd00cd91d88bb86250bdb496f7070c0 upstream.

    Commit f1dc4867 ("audit: anchor all pid references in the initial pid
    namespace") introduced a find_vpid() call when adding/removing audit
    rules with PID/PPID filters; unfortunately this is problematic as
    find_vpid() only works if there is a task with the associated PID
    alive on the system. The following commands demonstrate a simple
    reproducer.

    # auditctl -D
    # auditctl -l
    # autrace /bin/true
    # auditctl -l

    This patch resolves the problem by simply using the PID provided by
    the user without any additional validation, e.g. no calls to check to
    see if the task/PID exists.

    Cc: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Acked-by: Eric Paris
    Reviewed-by: Richard Guy Briggs
    Signed-off-by: Greg Kroah-Hartman

    Paul Moore
     
  • commit 54dc77d974a50147d6639dac6f59cb2c29207161 upstream.

    Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
    audit_log_end(), which can come from any context (aka even a sleeping context)
    GFP_KERNEL can't be used. Since the audit_buffer knows what context it should
    use, pass that down and use that.

    See: https://lkml.org/lkml/2014/12/16/542

    BUG: sleeping function called from invalid context at mm/slab.c:2849
    in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
    2 locks held by sulogin/885:
    #0: (&sig->cred_guard_mutex){+.+.+.}, at: [] prepare_bprm_creds+0x28/0x8b
    #1: (tty_files_lock){+.+.+.}, at: [] selinux_bprm_committing_creds+0x55/0x22b
    CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
    Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
    ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
    ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
    0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
    Call Trace:
    [] dump_stack+0x50/0xa8
    [] ___might_sleep+0x1b6/0x1be
    [] __might_sleep+0x119/0x128
    [] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
    [] kmem_cache_alloc+0x43/0x1c9
    [] __alloc_skb+0x42/0x1a3
    [] skb_copy+0x3e/0xa3
    [] audit_log_end+0x83/0x100
    [] ? avc_audit_pre_callback+0x103/0x103
    [] common_lsm_audit+0x441/0x450
    [] slow_avc_audit+0x63/0x67
    [] avc_has_perm+0xca/0xe3
    [] inode_has_perm+0x5a/0x65
    [] selinux_bprm_committing_creds+0x98/0x22b
    [] security_bprm_committing_creds+0xe/0x10
    [] install_exec_creds+0xe/0x79
    [] load_elf_binary+0xe36/0x10d7
    [] search_binary_handler+0x81/0x18c
    [] do_execveat_common.isra.31+0x4e3/0x7b7
    [] do_execve+0x1f/0x21
    [] SyS_execve+0x25/0x29
    [] stub_execve+0x69/0xa0

    Reported-by: Valdis Kletnieks
    Signed-off-by: Richard Guy Briggs
    Tested-by: Valdis Kletnieks
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     
  • commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 upstream.

    Now that setgroups can be disabled and not reenabled, setting gid_map
    without privielge can now be enabled when setgroups is disabled.

    This restores most of the functionality that was lost when unprivileged
    setting of gid_map was removed. Applications that use this functionality
    will need to check to see if they use setgroups or init_groups, and if they
    don't they can be fixed by simply disabling setgroups before writing to
    gid_map.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.

    - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f0d62aec931e4ae3333c797d346dc4f188f454ba upstream.

    Generalize id_map_mutex so it can be used for more state of a user namespace.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f95d7918bd1e724675de4940039f2865e5eec5fe upstream.

    If you did not create the user namespace and are allowed
    to write to uid_map or gid_map you should already have the necessary
    privilege in the parent user namespace to establish any mapping
    you want so this will not affect userspace in practice.

    Limiting unprivileged uid mapping establishment to the creator of the
    user namespace makes it easier to verify all credentials obtained with
    the uid mapping can be obtained without the uid mapping without
    privilege.

    Limiting unprivileged gid mapping establishment (which is temporarily
    absent) to the creator of the user namespace also ensures that the
    combination of uid and gid can already be obtained without privilege.

    This is part of the fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 80dd00a23784b384ccea049bfb3f259d3f973b9d upstream.

    setresuid allows the euid to be set to any of uid, euid, suid, and
    fsuid. Therefor it is safe to allow an unprivileged user to map
    their euid and use CAP_SETUID privileged with exactly that uid,
    as no new credentials can be obtained.

    I can not find a combination of existing system calls that allows setting
    uid, euid, suid, and fsuid from the fsuid making the previous use
    of fsuid for allowing unprivileged mappings a bug.

    This is part of a fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit be7c6dba2332cef0677fbabb606e279ae76652c3 upstream.

    As any gid mapping will allow and must allow for backwards
    compatibility dropping groups don't allow any gid mappings to be
    established without CAP_SETGID in the parent user namespace.

    For a small class of applications this change breaks userspace
    and removes useful functionality. This small class of applications
    includes tools/testing/selftests/mount/unprivilged-remount-test.c

    Most of the removed functionality will be added back with the addition
    of a one way knob to disable setgroups. Once setgroups is disabled
    setting the gid_map becomes as safe as setting the uid_map.

    For more common applications that set the uid_map and the gid_map
    with privilege this change will have no affect.

    This is part of a fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream.

    setgroups is unique in not needing a valid mapping before it can be called,
    in the case of setgroups(0, NULL) which drops all supplemental groups.

    The design of the user namespace assumes that CAP_SETGID can not actually
    be used until a gid mapping is established. Therefore add a helper function
    to see if the user namespace gid mapping has been established and call
    that function in the setgroups permission check.

    This is part of the fix for CVE-2014-8989, being able to drop groups
    without privilege using user namespaces.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 0542f17bf2c1f2430d368f44c8fcf2f82ec9e53e upstream.

    The rule is simple. Don't allow anything that wouldn't be allowed
    without unprivileged mappings.

    It was previously overlooked that establishing gid mappings would
    allow dropping groups and potentially gaining permission to files and
    directories that had lesser permissions for a specific group than for
    all other users.

    This is the rule needed to fix CVE-2014-8989 and prevent any other
    security issues with new_idmap_permitted.

    The reason for this rule is that the unix permission model is old and
    there are programs out there somewhere that take advantage of every
    little corner of it. So allowing a uid or gid mapping to be
    established without privielge that would allow anything that would not
    be allowed without that mapping will result in expectations from some
    code somewhere being violated. Violated expectations about the
    behavior of the OS is a long way to say a security issue.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream.

    Today there are 3 instances of setgroups and due to an oversight their
    permission checking has diverged. Add a common function so that
    they may all share the same permission checking code.

    This corrects the current oversight in the current permission checks
    and adds a helper to avoid this in the future.

    A user namespace security fix will update this new helper, shortly.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

17 Dec, 2014

1 commit


04 Dec, 2014

1 commit

  • It appears that some SCHEDULE_USER (asm for schedule_user) callers
    in arch/x86/kernel/entry_64.S are called from RCU kernel context,
    and schedule_user will return in RCU user context. This causes RCU
    warnings and possible failures.

    This is intended to be a minimal fix suitable for 3.18.

    Reported-and-tested-by: Dave Jones
    Cc: Oleg Nesterov
    Cc: Frédéric Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

24 Nov, 2014

2 commits

  • x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
    not on non-paranoid returns. I suspect that this is a mistake and that
    the code only works because int3 is paranoid.

    Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
    for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
    from the uprobes code.

    Reported-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Borislav Petkov
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Chris bisected a NULL pointer deference in task_sched_runtime() to
    commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
    inconsistency'.

    Chris observed crashes in atop or other /proc walking programs when he
    started fork bombs on his machine. He assumed that this is a new exit
    race, but that does not make any sense when looking at that commit.

    What's interesting is that, the commit provides update_curr callbacks
    for all scheduling classes except stop_task and idle_task.

    While nothing can ever hit that via the clock_nanosleep() and
    clock_gettime() interfaces, which have been the target of the commit in
    question, the author obviously forgot that there are other code paths
    which invoke task_sched_runtime()

    do_task_stat(()
    thread_group_cputime_adjusted()
    thread_group_cputime()
    task_cputime()
    task_sched_runtime()
    if (task_current(rq, p) && task_on_rq_queued(p)) {
    update_rq_clock(rq);
    up->sched_class->update_curr(rq);
    }

    If the stats are read for a stomp machine task, aka 'migration/N' and
    that task is current on its cpu, this will happily call the NULL pointer
    of stop_task->update_curr. Ooops.

    Chris observation that this happens faster when he runs the fork bomb
    makes sense as the fork bomb will kick migration threads more often so
    the probability to hit the issue will increase.

    Add the missing update_curr callbacks to the scheduler classes stop_task
    and idle_task. While idle tasks cannot be monitored via /proc we have
    other means to hit the idle case.

    Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
    Reported-by: Chris Mason
    Reported-and-tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Stanislaw Gruszka
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

22 Nov, 2014

2 commits


16 Nov, 2014

4 commits

  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • While looking over the cpu-timer code I found that we appear to add
    the delta for the calling task twice, through:

    cpu_timer_sample_group()
    thread_group_cputimer()
    thread_group_cputime()
    times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

    Which would make the sample run ahead, making the sleep short.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Stanislaw Gruszka
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because the whole numa task selection stuff runs with preemption
    enabled (its long and expensive) we can end up migrating and selecting
    oneself as a swap target. This doesn't really work out well -- we end
    up trying to acquire the same lock twice for the swap migrate -- so
    avoid this.

    Reported-and-Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When a CPU hotplugged out, we call perf_remove_from_context() (via
    perf_event_exit_cpu()) to rip each CPU-bound event out of its PMU's cpu
    context, but leave siblings grouped together. Freeing of these events is
    left to the mercy of the usual refcounting.

    When a CPU-bound event's refcount drops to zero we cross-call to
    __perf_remove_from_context() to clean it up, detaching grouped siblings.

    This works when the relevant CPU is online, but will fail if the CPU is
    currently offline, and we won't detach the event from its siblings
    before freeing the event, leaving the sibling list corrupt. If the
    sibling list is later walked (e.g. because the CPU cam online again
    before a remaining sibling's refcount drops to zero), we will walk the
    now corrupted siblings list, potentially dereferencing garbage values.

    Given that the events should never be scheduled again (as we removed
    them from their context), we can simply detatch siblings when the CPU
    goes down in the first place. If the CPU comes back online, the
    redundant call to __perf_remove_from_context() is safe.

    Reported-by: Drew Richardson
    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent.weaver@maine.edu
    Cc: Vince Weaver
    Cc: Will Deacon
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415203904-25308-2-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

15 Nov, 2014

1 commit

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are three regression fixes, two recent (generic power domains,
    suspend-to-idle) and one older (cpufreq), an ACPI blacklist entry for
    one more machine having problems with Windows 8 compatibility, a minor
    cpufreq driver fix (cpufreq-dt) and a fixup for new callback
    definitions (generic power domains).

    Specifics:

    - Fix a crash in the suspend-to-idle code path introduced by a recent
    commit that forgot to check a pointer against NULL before
    dereferencing it (Dmitry Eremin-Solenikov).

    - Fix a boot crash on Exynos5 introduced by a recent commit making
    that platform use generic Device Tree bindings for power domains
    which exposed a weakness in the generic power domains framework
    leading to that crash (Ulf Hansson).

    - Fix a crash during system resume on systems where cpufreq depends
    on Operation Performance Points (OPP) for functionality, but
    CONFIG_OPP is not set. This leads the cpufreq driver registration
    to fail, but the resume code attempts to restore the pre-suspend
    cpufreq configuration (which does not exist) nevertheless and
    crashes. From Geert Uytterhoeven.

    - Add a new ACPI blacklist entry for Dell Vostro 3546 that has
    problems if it is reported as Windows 8 compatible to the BIOS
    (Adam Lee).

    - Fix swapped arguments in an error message in the cpufreq-dt driver
    (Abhilash Kesavan).

    - Fix up the prototypes of new callbacks in struct generic_pm_domain
    to make them more useful. Users of those callbacks will be added
    in 3.19 and it's better for them to be based on the correct struct
    definition in mainline from the start. From Ulf Hansson and Kevin
    Hilman"

    * tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / Domains: Fix initial default state of the need_restore flag
    PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set
    PM / Domains: Change prototype for the attach and detach callbacks
    cpufreq: Avoid crash in resume on SMP without OPP
    cpufreq: cpufreq-dt: Fix arguments in clock failure error message
    ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546

    Linus Torvalds
     

14 Nov, 2014

2 commits

  • Commit 69361eef9056 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
    but failed to update the comments for print_tainted(). So, update the
    comments.

    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Pull audit fixes from Paul Moore:
    "After he sent the initial audit pull request for 3.18, Eric asked me
    to take over the management of the audit tree, hence this pull request
    to fix a couple of problems with audit.

    As you can see below, the changes are minimal: adding some whitespace
    to a string so userspace parses it correctly, and fixing a problem
    with audit's usage of fsnotify that was causing audit watch rules to
    be lost. Neither of these patches were very controversial on the
    mailing lists and they fix real problems, getting them into 3.18 would
    be a good thing"

    * 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit:
    audit: keep inode pinned
    audit: AUDIT_FEATURE_CHANGE message format missing delimiting space

    Linus Torvalds
     

12 Nov, 2014

1 commit

  • Audit rules disappear when an inode they watch is evicted from the cache.
    This is likely not what we want.

    The guilty commit is "fsnotify: allow marks to not pin inodes in core",
    which didn't take into account that audit_tree adds watches with a zero
    mask.

    Adding any mask should fix this.

    Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core")
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org # 2.6.36+
    Signed-off-by: Paul Moore

    Miklos Szeredi
     

11 Nov, 2014

2 commits

  • If the read loop in trace_buffers_splice_read() keeps failing due to
    memory allocation failures without reading even a single page then this
    function will keep busy looping.

    Remove the risk for that by exiting the function if memory allocation
    failures are seen.

    Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.in

    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     
  • On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
    lockup:

    # trace-cmd record -e raw_syscalls:* -F false
    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
    ...
    Call Trace:
    [] ? __wake_up_common+0x90/0x90
    [] wait_on_pipe+0x35/0x40
    [] tracing_buffers_splice_read+0x2e3/0x3c0
    [] ? tracing_stats_read+0x2a0/0x2a0
    [] ? _raw_spin_unlock+0x2b/0x40
    [] ? do_read_fault+0x21b/0x290
    [] ? handle_mm_fault+0x2ba/0xbd0
    [] ? trace_event_buffer_lock_reserve+0x40/0x80
    [] ? trace_buffer_lock_reserve+0x22/0x60
    [] ? trace_event_buffer_lock_reserve+0x40/0x80
    [] do_splice_to+0x6d/0x90
    [] SyS_splice+0x7c1/0x800
    [] tracesys_phase2+0xd3/0xd8

    The problem is this: tracing_buffers_splice_read() calls
    ring_buffer_wait() to wait for data in the ring buffers. The buffers
    are not empty so ring_buffer_wait() returns immediately. But
    tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
    meaning it only wants to read a full page. When the full page is not
    available, tracing_buffers_splice_read() tries to wait again with
    ring_buffer_wait(), which again returns immediately, and so on.

    Fix this by adding a "full" argument to ring_buffer_wait() which will
    make ring_buffer_wait() wait until the writer has left the reader's
    page, i.e. until full-page reads will succeed.

    Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in

    Cc: stable@vger.kernel.org # 3.16+
    Fixes: b1169cc69ba9 ("tracing: Remove mock up poll wait function")
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     

10 Nov, 2014

1 commit

  • On latest mm + KASan patchset I've got this:

    ==================================================================
    BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
    =============================================================================
    BUG kmalloc-8 (Not tainted): kasan error
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
    __slab_alloc+0x4b4/0x4f0
    __kmalloc_track_caller+0x15f/0x1e0
    kstrdup+0x44/0x90
    alloc_vfsmnt+0xb0/0x2c0
    vfs_kern_mount+0x35/0x190
    kern_mount_data+0x25/0x50
    pid_ns_prepare_proc+0x19/0x50
    alloc_pid+0x5e2/0x630
    copy_process.part.41+0xdf5/0x2aa0
    do_fork+0xf5/0x460
    kernel_thread+0x21/0x30
    rest_init+0x1e/0x90
    start_kernel+0x522/0x531
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x15b/0x16a
    INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
    INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

    Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
    Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
    Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
    ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
    ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_trailer (mm/slub.c:645)
    object_err (mm/slub.c:652)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_kmalloc (mm/kasan/kasan.c:311)
    __asan_load4 (mm/kasan/kasan.c:371)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kernel_init_freeable (init/main.c:869 init/main.c:997)
    ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
    ? rest_init (init/main.c:924)
    kernel_init (init/main.c:929)
    ? rest_init (init/main.c:924)
    ret_from_fork (arch/x86/kernel/entry_64.S:348)
    ? rest_init (init/main.c:924)
    Read of size 4 by task swapper/0:
    Memory state around the buggy address:
    ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
    ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
    ^
    ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Zero 'level' (e.g. on non-NUMA system) causing out of bounds
    access in this line:

    sched_max_numa_distance = sched_domains_numa_distance[level - 1];

    Fix this by exiting from sched_init_numa() earlier.

    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     

09 Nov, 2014

1 commit


04 Nov, 2014

1 commit

  • sched_move_task() is the only interface to change sched_task_group:
    cpu_cgrp_subsys methods and autogroup_move_group() use it.

    Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
    is ordered with other users of sched_move_task(). This means we do no
    need RCU here: if we've dereferenced a tg here, the .attach method
    hasn't been called for it yet.

    Thus, we should pass "true" to task_css_check() to silence lockdep
    warnings.

    Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
    Reported-by: Oleg Nesterov
    Reported-by: Fengguang Wu
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

01 Nov, 2014

8 commits

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are fixes received after my previous pull request plus one that
    has been in the works for quite a while, but its previous version
    caused problems to happen, so it's been deferred till now.

    Fixed are two recent regressions (MFD enumeration and cpufreq-dt),
    ACPI EC regression introduced in 3.17, system suspend error code path
    regression introduced in 3.15, an older bug related to recovery from
    failing resume from hibernation and a cpufreq-dt driver issue related
    to operation performance points.

    Specifics:

    - Fix a crash on r8a7791/koelsch during resume from system suspend
    caused by a recent cpufreq-dt commit (Geert Uytterhoeven).

    - Fix an MFD enumeration problem introduced by a recent commit adding
    ACPI support to the MFD subsystem that exposed a weakness in the
    ACPI core causing ACPI enumeration to be applied to all devices
    associated with one ACPI companion object, although it should be
    used for one of them only (Mika Westerberg).

    - Fix an ACPI EC regression introduced during the 3.17 cycle causing
    some Samsung laptops to misbehave as a result of a workaround
    targeted at some Acer machines. That includes a revert of a commit
    that went too far and a quirk for the Acer machines in question.
    From Lv Zheng.

    - Fix a regression in the system suspend error code path introduced
    during the 3.15 cycle that causes it to fail to take errors from
    asychronous execution of "late" suspend callbacks into account
    (Imre Deak).

    - Fix a long-standing bug in the hibernation resume error code path
    that fails to roll back everything correcty on "freeze" callback
    errors and leaves some devices in a "suspended" state causing more
    breakage to happen subsequently (Imre Deak).

    - Make the cpufreq-dt driver disable operation performance points
    that are not supported by the VR connected to the CPU voltage plane
    with acceptable tolerance instead of constantly failing voltage
    scaling later on (Lucas Stach)"

    * tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI / EC: Fix regression due to conflicting firmware behavior between Samsung and Acer.
    Revert "ACPI / EC: Add support to disallow QR_EC to be issued before completing previous QR_EC"
    cpufreq: cpufreq-dt: Restore default cpumask_setall(policy->cpus)
    PM / Sleep: fix recovery during resuming from hibernation
    PM / Sleep: fix async suspend_late/freeze_late error handling
    ACPI: Use ACPI companion to match only the first physical device
    cpufreq: cpufreq-dt: disable unsupported OPPs

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A bit has accumulated, but it's been a week or so since my last batch
    of post-merge-window fixes, so...

    1) Missing module license in netfilter reject module, from Pablo.
    Lots of people ran into this.

    2) Off by one in mac80211 baserate calculation, from Karl Beldan.

    3) Fix incorrect return value from ax88179_178a driver's set_mac_addr
    op, which broke use of it with bonding. From Ian Morgan.

    4) Checking of skb_gso_segment()'s return value was not all
    encompassing, it can return an SKB pointer, a pointer error, or
    NULL. Fix from Florian Westphal.

    This is crummy, and longer term will be fixed to just return error
    pointers or a real SKB.

    6) Encapsulation offloads not being handled by
    skb_gso_transport_seglen(). From Florian Westphal.

    7) Fix deadlock in TIPC stack, from Ying Xue.

    8) Fix performance regression from using rhashtable for netlink
    sockets. The problem was the synchronize_net() invoked for every
    socket destroy. From Thomas Graf.

    9) Fix bug in eBPF verifier, and remove the strong dependency of BPF
    on NET. From Alexei Starovoitov.

    10) In qdisc_create(), use the correct interface to allocate
    ->cpu_bstats, otherwise the u64_stats_sync member isn't
    initialized properly. From Sabrina Dubroca.

    11) Off by one in ip_set_nfnl_get_byindex(), from Dan Carpenter.

    12) nf_tables_newchain() was erroneously expecting error pointers from
    netdev_alloc_pcpu_stats(). It only returna a valid pointer or
    NULL. From Sabrina Dubroca.

    13) Fix use-after-free in _decode_session6(), from Li RongQing.

    14) When we set the TX flow hash on a socket, we mistakenly do so
    before we've nailed down the final source port. Move the setting
    deeper to fix this. From Sathya Perla.

    15) NAPI budget accounting in amd-xgbe driver was counting descriptors
    instead of full packets, fix from Thomas Lendacky.

    16) Fix total_data_buflen calculation in hyperv driver, from Haiyang
    Zhang.

    17) Fix bcma driver build with OF_ADDRESS disabled, from Hauke
    Mehrtens.

    18) Fix mis-use of per-cpu memory in TCP md5 code. The problem is
    that something that ends up being vmalloc memory can't be passed
    to the crypto hash routines via scatter-gather lists. From Eric
    Dumazet.

    19) Fix regression in promiscuous mode enabling in cdc-ether, from
    Olivier Blin.

    20) Bucket eviction and frag entry killing can race with eachother,
    causing an unlink of the object from the wrong list. Fix from
    Nikolay Aleksandrov.

    21) Missing initialization of spinlock in cxgb4 driver, from Anish
    Bhatt.

    22) Do not cache ipv4 routing failures, otherwise if the sysctl for
    forwarding is subsequently enabled this won't be seen. From
    Nicolas Cavallari"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (131 commits)
    drivers: net: cpsw: Support ALLMULTI and fix IFF_PROMISC in switch mode
    drivers: net: cpsw: Fix broken loop condition in switch mode
    net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
    stmmac: pci: set default of the filter bins
    net: smc91x: Fix gpios for device tree based booting
    mpls: Allow mpls_gso to be built as module
    mpls: Fix mpls_gso handler.
    r8152: stop submitting intr for -EPROTO
    netfilter: nft_reject_bridge: restrict reject to prerouting and input
    netfilter: nft_reject_bridge: don't use IP stack to reject traffic
    netfilter: nf_reject_ipv6: split nf_send_reset6() in smaller functions
    netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions
    netfilter: nf_tables_bridge: update hook_mask to allow {pre,post}routing
    drivers/net: macvtap and tun depend on INET
    drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
    drivers/net: Disable UFO through virtio
    net: skb_fclone_busy() needs to detect orphaned skb
    gre: Use inner mac length when computing tunnel length
    mlx4: Avoid leaking steering rules on flow creation error flow
    net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Various scheduler fixes all over the place: three SCHED_DL fixes,
    three sched/numa fixes, two generic race fixes and a comment fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/dl: Fix preemption checks
    sched: Update comments for CLONE_NEWNS
    sched: stop the unbound recursion in preempt_schedule_context()
    sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
    sched/fair: Care divide error in update_task_scan_period()
    sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
    sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
    sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
    sched: Fix race between task_group and sched_task_group

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, plus on the kernel side:

    - a revert for a newly introduced PMU driver which isn't complete yet
    and where we ran out of time with fixes (to be tried again in
    v3.19) - this makes up for a large chunk of the diffstat.

    - compilation warning fixes

    - a printk message fix

    - event_idx usage fixes/cleanups"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf probe: Trivial typo fix for --demangle
    perf tools: Fix report -F dso_from for data without branch info
    perf tools: Fix report -F dso_to for data without branch info
    perf tools: Fix report -F symbol_from for data without branch info
    perf tools: Fix report -F symbol_to for data without branch info
    perf tools: Fix report -F mispredict for data without branch info
    perf tools: Fix report -F in_tx for data without branch info
    perf tools: Fix report -F abort for data without branch info
    perf tools: Make CPUINFO_PROC an array to support different kernel versions
    perf callchain: Use global caching provided by libunwind
    perf/x86/intel: Revert incomplete and undocumented Broadwell client support
    perf/x86: Fix compile warnings for intel_uncore
    perf: Fix typos in sample code in the perf_event.h header
    perf: Fix and clean up initialization of pmu::event_idx
    perf: Fix bogus kernel printk
    perf diff: Add missing hists__init() call at tool start

    Linus Torvalds
     
  • Pull futex fixes from Ingo Molnar:
    "This contains two futex fixes: one fixes a race condition, the other
    clarifies shared/private futex comments"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Fix a race condition between REQUEUE_PI and task death
    futex: Mention key referencing differences between shared and private futexes

    Linus Torvalds
     
  • Pull core fixes from Ingo Molnar:
    "The tree contains two RCU fixes and a compiler quirk comment fix"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Make rcu_barrier() understand about missing rcuo kthreads
    compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
    rcu: More on deadlock between CPU hotplug and expedited grace periods

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "As you requested in the rc2 release mail the timer department serves
    you a few real bug fixes:

    - Fix the probe logic of the architected arm/arm64 timer
    - Plug a stack info leak in posix-timers
    - Prevent a shift out of bounds issue in the clockevents core"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM/ARM64: arch-timer: fix arch_timer_probed logic
    clockevents: Prevent shift out of bounds
    posix-timers: Fix stack info leak in timer_create()

    Linus Torvalds
     
  • …/git/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "ARM has system calls outside the NR_syscalls range, and the generic
    tracing system does not support that and without checks, it can cause
    an oops to be reported.

    Rabin Vincent added checks in the return code on syscall events to
    make sure that the system call number is within the range that tracing
    knows about, and if not, simply ignores the system call.

    The system call tracing infrastructure needs to be rewritten to handle
    these cases better, but for now, to keep from oopsing, this patch will
    do"

    * tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/syscalls: Ignore numbers outside NR_syscalls' range

    Linus Torvalds