11 Feb, 2015

2 commits

  • This patch introduces generic code to perform PM domain look-up using
    device tree and automatically bind devices to their PM domains.

    Generic device tree bindings are introduced to specify PM domains of
    devices in their device tree nodes.

    Backwards compatibility with legacy Samsung-specific PM domain bindings
    is provided, but for now the new code is not compiled when
    CONFIG_ARCH_EXYNOS is selected to avoid collision with legacy code.
    This will change as soon as the Exynos PM domain code gets converted to
    use the generic framework in further patch.

    This patch was originally submitted by Tomasz Figa when he was employed
    by Samsung.

    Link: http://marc.info/?l=linux-pm&m=139955349702152&w=2
    Signed-off-by: Ulf Hansson
    Acked-by: Rob Herring
    Tested-by: Philipp Zabel
    Reviewed-by: Kevin Hilman
    Signed-off-by: Rafael J. Wysocki
    (cherry picked from commit aa42240ab2544a8bcb2efb400193826f57f3175e)
    (cherry picked from commit 4a2d7a846761e3b86e08b903e5a1a088686e2181)

    Tomasz Figa
     
  • This reverts commit 4aa055cb0634bc8d0389070104fe6aa7cfa99b8c.
    Signed-off-by: Robin Gong

    (cherry picked from commit e599f64de890a60a3b9884dd5838c43472f145e2)

    Robin Gong
     

16 Jan, 2015

2 commits

  • Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
    Pallipadi to generic.

    Change-Id: Idf29cda15be151f494ff245933c12462643388d5
    Acked-by: Nicolas Pitre
    Signed-off-by: Todd Poynor

    Todd Poynor
     
  • This patch introduces generic code to perform power domain look-up using
    device tree and automatically bind devices to their power domains.
    Generic device tree binding is introduced to specify power domains of
    devices in their device tree nodes.

    Backwards compatibility with legacy Samsung-specific power domain
    bindings is provided, but for now the new code is not compiled when
    CONFIG_ARCH_EXYNOS is selected to avoid collision with legacy code. This
    will change as soon as Exynos power domain code gets converted to use
    the generic framework in further patch.

    Signed-off-by: Tomasz Figa
    Reviewed-by: Mark Brown
    Reviewed-by: Kevin Hilman
    Reviewed-by: Philipp Zabel
    [on i.MX6 GK802]
    Tested-by: Philipp Zabel
    Reviewed-by: Ulf Hansson
    [shawn.guo: http://thread.gmane.org/gmane.linux.kernel.samsung-soc/31029]
    Signed-off-by: Shawn Guo

    Tomasz Figa
     

09 Jan, 2015

11 commits

  • commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d upstream.

    alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
    fails because disable_pid_allocation() was called by the exiting
    child_reaper.

    We could simply move get_pid_ns() down to successful return, but this fix
    tries to be as trivial as possible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 041d7b98ffe59c59fdd639931dea7d74f9aa9a59 upstream.

    A regression was caused by commit 780a7654cee8:
    audit: Make testing for a valid loginuid explicit.
    (which in turn attempted to fix a regression caused by e1760bd)

    When audit_krule_to_data() fills in the rules to get a listing, there was a
    missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.

    This broke userspace by not returning the same information that was sent and
    expected.

    The rule:
    auditctl -a exit,never -F auid=-1
    gives:
    auditctl -l
    LIST_RULES: exit,never f24=0 syscall=all
    when it should give:
    LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all

    Tag it so that it is reported the same way it was set. Create a new
    private flags audit_krule field (pflags) to store it that won't interact with
    the public one from the API.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     
  • commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 upstream.

    Now that setgroups can be disabled and not reenabled, setting gid_map
    without privielge can now be enabled when setgroups is disabled.

    This restores most of the functionality that was lost when unprivileged
    setting of gid_map was removed. Applications that use this functionality
    will need to check to see if they use setgroups or init_groups, and if they
    don't they can be fixed by simply disabling setgroups before writing to
    gid_map.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.

    - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f0d62aec931e4ae3333c797d346dc4f188f454ba upstream.

    Generalize id_map_mutex so it can be used for more state of a user namespace.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f95d7918bd1e724675de4940039f2865e5eec5fe upstream.

    If you did not create the user namespace and are allowed
    to write to uid_map or gid_map you should already have the necessary
    privilege in the parent user namespace to establish any mapping
    you want so this will not affect userspace in practice.

    Limiting unprivileged uid mapping establishment to the creator of the
    user namespace makes it easier to verify all credentials obtained with
    the uid mapping can be obtained without the uid mapping without
    privilege.

    Limiting unprivileged gid mapping establishment (which is temporarily
    absent) to the creator of the user namespace also ensures that the
    combination of uid and gid can already be obtained without privilege.

    This is part of the fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 80dd00a23784b384ccea049bfb3f259d3f973b9d upstream.

    setresuid allows the euid to be set to any of uid, euid, suid, and
    fsuid. Therefor it is safe to allow an unprivileged user to map
    their euid and use CAP_SETUID privileged with exactly that uid,
    as no new credentials can be obtained.

    I can not find a combination of existing system calls that allows setting
    uid, euid, suid, and fsuid from the fsuid making the previous use
    of fsuid for allowing unprivileged mappings a bug.

    This is part of a fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit be7c6dba2332cef0677fbabb606e279ae76652c3 upstream.

    As any gid mapping will allow and must allow for backwards
    compatibility dropping groups don't allow any gid mappings to be
    established without CAP_SETGID in the parent user namespace.

    For a small class of applications this change breaks userspace
    and removes useful functionality. This small class of applications
    includes tools/testing/selftests/mount/unprivilged-remount-test.c

    Most of the removed functionality will be added back with the addition
    of a one way knob to disable setgroups. Once setgroups is disabled
    setting the gid_map becomes as safe as setting the uid_map.

    For more common applications that set the uid_map and the gid_map
    with privilege this change will have no affect.

    This is part of a fix for CVE-2014-8989.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream.

    setgroups is unique in not needing a valid mapping before it can be called,
    in the case of setgroups(0, NULL) which drops all supplemental groups.

    The design of the user namespace assumes that CAP_SETGID can not actually
    be used until a gid mapping is established. Therefore add a helper function
    to see if the user namespace gid mapping has been established and call
    that function in the setgroups permission check.

    This is part of the fix for CVE-2014-8989, being able to drop groups
    without privilege using user namespaces.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 0542f17bf2c1f2430d368f44c8fcf2f82ec9e53e upstream.

    The rule is simple. Don't allow anything that wouldn't be allowed
    without unprivileged mappings.

    It was previously overlooked that establishing gid mappings would
    allow dropping groups and potentially gaining permission to files and
    directories that had lesser permissions for a specific group than for
    all other users.

    This is the rule needed to fix CVE-2014-8989 and prevent any other
    security issues with new_idmap_permitted.

    The reason for this rule is that the unix permission model is old and
    there are programs out there somewhere that take advantage of every
    little corner of it. So allowing a uid or gid mapping to be
    established without privielge that would allow anything that would not
    be allowed without that mapping will result in expectations from some
    code somewhere being violated. Violated expectations about the
    behavior of the OS is a long way to say a security issue.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream.

    Today there are 3 instances of setgroups and due to an oversight their
    permission checking has diverged. Add a common function so that
    they may all share the same permission checking code.

    This corrects the current oversight in the current permission checks
    and adds a helper to avoid this in the future.

    A user namespace security fix will update this new helper, shortly.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

07 Dec, 2014

1 commit

  • commit 82975bc6a6df743b9a01810fb32cb65d0ec5d60b upstream.

    x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
    not on non-paranoid returns. I suspect that this is a mistake and that
    the code only works because int3 is paranoid.

    Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
    for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
    from the uprobes code.

    Reported-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Borislav Petkov
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     

22 Nov, 2014

6 commits

  • commit b3f207855f57b9c8f43a547a801340bb5cbc59e5 upstream.

    When running a 32-bit userspace on a 64-bit kernel (eg. i386
    application on x86_64 kernel or 32-bit arm userspace on arm64
    kernel) some of the perf ioctls must be treated with special
    care, as they have a pointer size encoded in the command.

    For example, PERF_EVENT_IOC_ID in 32-bit world will be encoded
    as 0x80042407, but 64-bit kernel will expect 0x80082407. In
    result the ioctl will fail returning -ENOTTY.

    This patch solves the problem by adding code fixing up the
    size as compat_ioctl file operation.

    Reported-by: Drew Richardson
    Signed-off-by: Pawel Moll
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Link: http://lkml.kernel.org/r/1402671812-9078-1-git-send-email-pawel.moll@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: David Ahern
    Signed-off-by: Greg Kroah-Hartman

    Pawel Moll
     
  • commit 2aa792e6faf1a00f5accf1f69e87e11a390ba2cd upstream.

    The rcu_gp_kthread_wake() function checks for three conditions before
    waking up grace period kthreads:

    * Is the thread we are trying to wake up the current thread?
    * Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
    * Is there no thread created for this flavour, hence nothing to wake up?

    If any one of these condition is true, we do not call wake_up().
    It was found that there are quite a few avoidable wake ups both during
    idle time and under stress induced by rcutorture.

    Idle:

    Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
    Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0

    rcutorture:

    Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
    Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0

    Here case{1-3} are the cases listed above. We can avoid these wake
    ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
    period kthreads.

    There is a comment about an implied barrier supplied by the wake_up()
    logic. This barrier is necessary for the awakened thread to see the
    updated ->gp_flags. This flag is always being updated with the root node
    lock held. Also, the awakened thread tries to acquire the root node lock
    before reading ->gp_flags because of which there is proper ordering.

    Hence this commit tries to avoid calling wake_up() whenever we can by
    using rcu_gp_kthread_wake() function.

    Signed-off-by: Pranith Kumar
    CC: Mathieu Desnoyers
    Signed-off-by: Paul E. McKenney
    Cc: Kamal Mostafa
    Signed-off-by: Greg Kroah-Hartman

    Pranith Kumar
     
  • commit 48a7639ce80cf279834d0d44865e49ecd714f37d upstream.

    The rcu_start_gp_advanced() function currently uses irq_work_queue()
    to defer wakeups of the RCU grace-period kthread. This deferring
    is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
    structure's lock, meaning that RCU cannot call any of the scheduler's
    wake-up functions while holding one of these locks.

    Unfortunately, the second and subsequent calls to irq_work_queue() are
    ignored, and the first call will be ignored (aside from queuing the work
    item) if the scheduler-clock tick is turned off. This is OK for many
    uses, especially those where irq_work_queue() is called from an interrupt
    or softirq handler, because in those cases the scheduler-clock-tick state
    will be re-evaluated, which will turn the scheduler-clock tick back on.
    On the next tick, any deferred work will then be processed.

    However, this strategy does not always work for RCU, which can be invoked
    at process level from idle CPUs. In this case, the tick might never
    be turned back on, indefinitely defering a grace-period start request.
    Note that the RCU CPU stall detector cannot see this condition, because
    there is no RCU grace period in progress. Therefore, we can (and do!)
    see long tens-of-seconds stalls in grace-period handling. In theory,
    we could see a full grace-period hang, but rcutorture testing to date
    has seen only the tens-of-seconds stalls. Event tracing demonstrates
    that irq_work_queue() is being called repeatedly to no effect during
    these stalls: The "newreq" event appears repeatedly from a task that is
    not one of the grace-period kthreads.

    In theory, irq_work_queue() might be fixed to avoid this sort of issue,
    but RCU's requirements are unusual and it is quite straightforward to pass
    wake-up responsibility up through RCU's call chain, so that the wakeup
    happens when the offending locks are released.

    This commit therefore makes this change. The rcu_start_gp_advanced(),
    rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
    __note_gp_changes(), and rcu_start_gp() functions now return a boolean
    which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
    does the wakeup when it is necessary and safe to do so: No self-wakes,
    no wake-ups if the ->gp_flags field indicates there is no need (as in
    someone else did the wake-up before we got around to it), and no wake-ups
    before the grace-period kthread has been created.

    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Reviewed-by: Josh Triplett
    [ Pranith: backport to 3.13-stable: just rcu_gp_kthread_wake(),
    prereq for 2aa792e "rcu: Use rcu_gp_kthread_wake() to wake up grace
    period kthreads" ]
    Signed-off-by: Pranith Kumar
    Signed-off-by: Kamal Mostafa
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 799b601451b21ebe7af0e6e8f6e2ccd4683c5064 upstream.

    Audit rules disappear when an inode they watch is evicted from the cache.
    This is likely not what we want.

    The guilty commit is "fsnotify: allow marks to not pin inodes in core",
    which didn't take into account that audit_tree adds watches with a zero
    mask.

    Adding any mask should fix this.

    Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core")
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 897f1acbb6702ddaa953e8d8436eee3b12016c7e upstream.

    Add a space between subj= and feature= fields to make them parsable.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     
  • commit 9ef91514774a140e468f99d73d7593521e6d25dc upstream.

    When an AUDIT_GET_FEATURE message is sent from userspace to the kernel, it
    should reply with a message tagged as an AUDIT_GET_FEATURE type with a struct
    audit_feature. The current reply is a message tagged as an AUDIT_GET
    type with a struct audit_feature.

    This appears to have been a cut-and-paste-eo in commit b0fed40.

    Reported-by: Steve Grubb
    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Greg Kroah-Hartman

    Richard Guy Briggs
     

15 Nov, 2014

8 commits

  • commit f1e3a0932f3a9554371792a7daaf1e0eb19f66d5 upstream.

    Probability of use-after-free isn't zero in this place.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Paul E. McKenney
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     
  • commit 6891c4509c792209c44ced55a60f13954cb50ef4 upstream.

    If userland creates a timer without specifying a sigevent info, we'll
    create one ourself, using a stack local variable. Particularly will we
    use the timer ID as sival_int. But as sigev_value is a union containing
    a pointer and an int, that assignment will only partially initialize
    sigev_value on systems where the size of a pointer is bigger than the
    size of an int. On such systems we'll copy the uninitialized stack bytes
    from the timer_create() call to userland when the timer actually fires
    and we're going to deliver the signal.

    Initialize sigev_value with 0 to plug the stack info leak.

    Found in the PaX patch, written by the PaX Team.

    Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
    Signed-off-by: Mathias Krause
    Cc: Oleg Nesterov
    Cc: Brad Spengler
    Cc: PaX Team
    Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Mathias Krause
     
  • commit 94fb823fcb4892614f57e59601bb9d4920f24711 upstream.

    If a device's dev_pm_ops::freeze callback fails during the QUIESCE
    phase, we don't rollback things correctly calling the thaw and complete
    callbacks. This could leave some devices in a suspended state in case of
    an error during resuming from hibernation.

    Signed-off-by: Imre Deak
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Imre Deak
     
  • commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.

    PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.

    Reduce the race window by checking all tasks after OOM killer has been
    disabled. This is still not race free completely unfortunately because
    oom_killer_disable cannot stop an already ongoing OOM killer so a task
    might still wake up from the fridge and get killed without
    freeze_processes noticing. Full synchronization of OOM and freezer is,
    however, too heavy weight for this highly unlikely case.

    Introduce and check oom_kills counter which gets incremented early when
    the allocator enters __alloc_pages_may_oom path and only check all the
    tasks if the counter changes during the freezing attempt. The counter
    is updated so early to reduce the race window since allocator checked
    oom_killer_disabled which is set by PM-freezing code. A false positive
    will push the PM-freezer into a slow path but that is not a big deal.

    Changes since v1
    - push the re-check loop out of freeze_processes into
    check_frozen_processes and invert the condition to make the code more
    readable as per Rafael

    Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 51fae6da640edf9d266c94f36bc806c63c301991 upstream.

    Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen
    before deferring) OOM killer relies on being able to thaw a frozen task
    to handle OOM situation but a3201227f803 (freezer: make freezing() test
    freeze conditions in effect instead of TIF_FREEZE) has reorganized the
    code and stopped clearing freeze flag in __thaw_task. This means that
    the target task only wakes up and goes into the fridge again because the
    freezing condition hasn't changed for it. This reintroduces the bug
    fixed by f660daac474c6f.

    Fix the issue by checking for TIF_MEMDIE thread flag in
    freezing_slow_path and exclude the task from freezing completely. If a
    task was already frozen it would get woken by __thaw_task from OOM killer
    and get out of freezer after rechecking freezing().

    Changes since v1
    - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
    as per Oleg
    - return __thaw_task into oom_scan_process_thread because
    oom_kill_process will not wake task in the fridge because it is
    sleeping uninterruptible

    [mhocko@suse.cz: rewrote the changelog]
    Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
    Signed-off-by: Cong Wang
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • commit d3051b489aa81ca9ba62af366149ef42b8dae97c upstream.

    A panic was seen in the following sitation.

    There are two threads running on the system. The first thread is a system
    monitoring thread that is reading /proc/modules. The second thread is
    loading and unloading a module (in this example I'm using my simple
    dummy-module.ko). Note, in the "real world" this occurred with the qlogic
    driver module.

    When doing this, the following panic occurred:

    ------------[ cut here ]------------
    kernel BUG at kernel/module.c:3739!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module]
    CPU: 37 PID: 186343 Comm: cat Tainted: GF O-------------- 3.10.0+ #7
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
    task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000
    RIP: 0010:[] [] module_flags+0xb5/0xc0
    RSP: 0018:ffff88080fa7fe18 EFLAGS: 00010246
    RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000
    RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000
    RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000
    R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000
    R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000
    FS: 00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Stack:
    ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c
    ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00
    ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0
    Call Trace:
    [] m_show+0x19c/0x1e0
    [] seq_read+0x16e/0x3b0
    [] proc_reg_read+0x3d/0x80
    [] vfs_read+0x9c/0x170
    [] SyS_read+0x58/0xb0
    [] system_call_fastpath+0x16/0x1b
    Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
    RIP [] module_flags+0xb5/0xc0
    RSP

    Consider the two processes running on the system.

    CPU 0 (/proc/modules reader)
    CPU 1 (loading/unloading module)

    CPU 0 opens /proc/modules, and starts displaying data for each module by
    traversing the modules list via fs/seq_file.c:seq_open() and
    fs/seq_file.c:seq_read(). For each module in the modules list, seq_read
    does

    op->start() show() stop() state == MODULE_STATE_UNFORMED);
    ...

    The other thread, CPU 1, in unloading the module calls the syscall
    delete_module() defined in kernel/module.c. The module_mutex is acquired
    for a short time, and then released. free_module() is called without the
    module_mutex. free_module() then sets mod->state = MODULE_STATE_UNFORMED,
    also without the module_mutex. Some additional code is called and then the
    module_mutex is reacquired to remove the module from the modules list:

    /* Now we can delete it from the lists */
    mutex_lock(&module_mutex);
    stop_machine(__unlink_module, mod, NULL);
    mutex_unlock(&module_mutex);

    This is the sequence of events that leads to the panic.

    CPU 1 is removing dummy_module via delete_module(). It acquires the
    module_mutex, and then releases it. CPU 1 has NOT set dummy_module->state to
    MODULE_STATE_UNFORMED yet.

    CPU 0, which is reading the /proc/modules, acquires the module_mutex and
    acquires a pointer to the dummy_module which is still in the modules list.
    CPU 0 calls m_show for dummy_module. The check in m_show() for
    MODULE_STATE_UNFORMED passed for dummy_module even though it is being
    torn down.

    Meanwhile CPU 1, which has been continuing to remove dummy_module without
    holding the module_mutex, now calls free_module() and sets
    dummy_module->state to MODULE_STATE_UNFORMED.

    CPU 0 now calls module_flags() with dummy_module and ...

    static char *module_flags(struct module *mod, char *buf)
    {
    int bx = 0;

    BUG_ON(mod->state == MODULE_STATE_UNFORMED);

    and BOOM.

    Acquire and release the module_mutex lock around the setting of
    MODULE_STATE_UNFORMED in the teardown path, which should resolve the
    problem.

    Testing: In the unpatched kernel I can panic the system within 1 minute by
    doing

    while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done

    and

    while (true) do cat /proc/modules; done

    in separate terminals.

    In the patched kernel I was able to run just over one hour without seeing
    any issues. I also verified the output of panic via sysrq-c and the output
    of /proc/modules looks correct for all three states for the dummy_module.

    dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-)
    dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE)
    dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+)

    Signed-off-by: Prarit Bhargava
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Rusty Russell
    Signed-off-by: Greg Kroah-Hartman

    Prarit Bhargava
     
  • commit 66339c31bc3978d5fff9c4b4cb590a861def4db2 upstream.

    dl_bw_of() dereferences rq->rd which has to have RCU read lock held.
    Probability of use-after-free isn't zero here.

    Also add lockdep assert into dl_bw_cpus().

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Paul E. McKenney
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     
  • commit 086ba77a6db00ed858ff07451bedee197df868c9 upstream.

    ARM has some private syscalls (for example, set_tls(2)) which lie
    outside the range of NR_syscalls. If any of these are called while
    syscall tracing is being performed, out-of-bounds array access will
    occur in the ftrace and perf sys_{enter,exit} handlers.

    # trace-cmd record -e raw_syscalls:* true && trace-cmd report
    ...
    true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
    true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
    true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
    true-653 [000] 384.675988: sys_exit: NR 983045 = 0
    ...

    # trace-cmd record -e syscalls:* true
    [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
    [ 17.289590] pgd = 9e71c000
    [ 17.289696] [aaaaaace] *pgd=00000000
    [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    [ 17.290169] Modules linked in:
    [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
    [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
    [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
    [ 17.290866] LR is at syscall_trace_enter+0x124/0x184

    Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

    Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    added the check for less than zero, but it should have also checked
    for greater than NR_syscalls.

    Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in

    Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Rabin Vincent
     

31 Oct, 2014

1 commit

  • commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0 upstream.

    Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
    nothing to wake up) changes the futex code to avoid taking a lock when
    there are no waiters. This code has been subsequently fixed in commit
    11d4616bd07f (futex: revert back to the explicit waiter counting code).
    Both the original commit and the fix-up rely on get_futex_key_refs() to
    always imply a barrier.

    However, for private futexes, none of the cases in the switch statement
    of get_futex_key_refs() would be hit and the function completes without
    a memory barrier as required before checking the "waiters" in
    futex_wake() -> hb_waiters_pending(). The consequence is a race with a
    thread waiting on a futex on another CPU, allowing the waker thread to
    read "waiters == 0" while the waiter thread to have read "futex_val ==
    locked" (in kernel).

    Without this fix, the problem (user space deadlocks) can be seen with
    Android bionic's mutex implementation on an arm64 multi-cluster system.

    Signed-off-by: Catalin Marinas
    Reported-by: Matteo Franchin
    Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
    Acked-by: Davidlohr Bueso
    Tested-by: Mike Galbraith
    Cc: Darren Hart
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     

10 Oct, 2014

5 commits

  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Davidlohr Bueso
     
  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.

    timeval_to_jiffies tried to round a timeval up to an integral number
    of jiffies, but the logic for doing so was incorrect: intervals
    corresponding to exactly N jiffies would become N+1. This manifested
    itself particularly repeatedly stopping/starting an itimer:

    setitimer(ITIMER_PROF, &val, NULL);
    setitimer(ITIMER_PROF, NULL, &val);

    would add a full tick to val, _even if it was exactly representable in
    terms of jiffies_ (say, the result of a previous rounding.) Doing
    this repeatedly would cause unbounded growth in val. So fix the math.

    Here's what was wrong with the conversion: we essentially computed
    (eliding seconds)

    jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)

    by using scaling arithmetic, which took the best approximation of
    NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
    x/(2^USEC_JIFFIE_SC), and computed:

    jiffies = (usec * x) >> USEC_JIFFIE_SC

    and rounded this calculation up in the intermediate form (since we
    can't necessarily exactly represent TICK_NSEC in usec.) But the
    scaling arithmetic is a (very slight) *over*approximation of the true
    value; that is, instead of dividing by (1 usec/ 1 jiffie), we
    effectively divided by (1 usec/1 jiffie)-epsilon (rounding
    down). This would normally be fine, but we want to round timeouts up,
    and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
    would be fine if our division was exact, but dividing this by the
    slightly smaller factor was equivalent to adding just _over_ 1 to the
    final result (instead of just _under_ 1, as desired.)

    In particular, with HZ=1000, we consistently computed that 10000 usec
    was 11 jiffies; the same was true for any exact multiple of
    TICK_NSEC.

    We could possibly still round in the intermediate form, adding
    something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
    convert usec->nsec, round in nanoseconds, and then convert using
    time*spec*_to_jiffies. This adds one constant multiplication, and is
    not observably slower in microbenchmarks on recent x86 hardware.

    Tested: the following program:

    int main() {
    struct itimerval zero = {{0, 0}, {0, 0}};
    /* Initially set to 10 ms. */
    struct itimerval initial = zero;
    initial.it_interval.tv_usec = 10000;
    setitimer(ITIMER_PROF, &initial, NULL);
    /* Save and restore several times. */
    for (size_t i = 0; i < 10; ++i) {
    struct itimerval prev;
    setitimer(ITIMER_PROF, &zero, &prev);
    /* on old kernels, this goes up by TICK_USEC every iteration */
    printf("previous value: %ld %ld %ld %ld\n",
    prev.it_interval.tv_sec, prev.it_interval.tv_usec,
    prev.it_value.tv_sec, prev.it_value.tv_usec);
    setitimer(ITIMER_PROF, &prev, NULL);
    }
    return 0;
    }

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Paul Turner
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Reviewed-by: Paul Turner
    Reported-by: Aaron Jacobs
    Signed-off-by: Andrew Hunter
    [jstultz: Tweaked to apply to 3.17-rc]
    Signed-off-by: John Stultz
    [bwh: Backported to 3.16: adjust filename]
    Signed-off-by: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    Andrew Hunter
     
  • commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.

    Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
    fixed one bug but in the process caused another one. The reset is to
    update the header page, but that fix also changed the way the cached
    reads were updated. The cache reads are used to test if an iterator
    needs to be updated or not.

    A ring buffer iterator, when created, disables writes to the ring buffer
    but does not stop other readers or consuming reads from happening.
    Although all readers are synchronized via a lock, they are only
    synchronized when in the ring buffer functions. Those functions may
    be called by any number of readers. The iterator continues down when
    its not interrupted by a consuming reader. If a consuming read
    occurs, the iterator starts from the beginning of the buffer.

    The way the iterator sees that a consuming read has happened since
    its last read is by checking the reader "cache". The cache holds the
    last counts of the read and the reader page itself.

    Commit 651e22f2701b changed what was saved by the cache_read when
    the rb_iter_reset() occurred, making the iterator never match the cache.
    Then if the iterator calls rb_iter_reset(), it will go into an
    infinite loop by checking if the cache doesn't match, doing the reset
    and retrying, just to see that the cache still doesn't match! Which
    should never happen as the reset is suppose to set the cache to the
    current value and there's locks that keep a consuming reader from
    having access to the data.

    Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.

    Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
    calling perf_event_free_task() when failing sched_fork() we will not yet
    have done the memset() on ->perf_event_ctxp[] and will therefore try and
    'free' the inherited contexts, which are still in use by the parent
    process. This is bad..

    Suggested-by: Oleg Nesterov
    Reported-by: Oleg Nesterov
    Reported-by: Sylvain 'ythier' Hitier
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

06 Oct, 2014

4 commits

  • commit 43e8317b0bba1d6eb85f38a4a233d82d7c20d732 upstream.

    Use the observation that, for platform-dependent sleep states
    (PM_SUSPEND_STANDBY, PM_SUSPEND_MEM), a given state is either
    always supported or always unsupported and store that information
    in pm_states[] instead of calling valid_state() every time we
    need to check it.

    Also do not use valid_state() for PM_SUSPEND_FREEZE, which is always
    valid, and move the pm_test_level validity check for PM_SUSPEND_FREEZE
    directly into enter_state().

    Signed-off-by: Rafael J. Wysocki
    Cc: Brian Norris
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • commit 27ddcc6596e50cb8f03d2e83248897667811d8f6 upstream.

    To allow sleep states corresponding to the "mem", "standby" and
    "freeze" lables to be different from the pm_states[] indexes of
    those strings, introduce struct pm_sleep_state, consisting of
    a string label and a state number, and turn pm_states[] into an
    array of objects of that type.

    This modification should not lead to any functional changes.

    Signed-off-by: Rafael J. Wysocki
    Cc: Brian Norris
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • commit 3577af70a2ce4853d58e57d832e687d739281479 upstream.

    We saw a kernel soft lockup in perf_remove_from_context(),
    it looks like the `perf` process, when exiting, could not go
    out of the retry loop. Meanwhile, the target process was forking
    a child. So either the target process should execute the smp
    function call to deactive the event (if it was running) or it should
    do a context switch which deactives the event.

    It seems we optimize out a context switch in perf_event_context_sched_out(),
    and what's more important, we still test an obsolete task pointer when
    retrying, so no one actually would deactive that event in this situation.
    Fix it directly by reloading the task pointer in perf_remove_from_context().

    This should cure the above soft lockup.

    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409696840-843-1-git-send-email-xiyou.wangcong@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • commit 474e941bed9262f5fa2394f9a4a67e24499e5926 upstream.

    Locks the k_itimer's it_lock member when handling the alarm timer's
    expiry callback.

    The regular posix timers defined in posix-timers.c have this lock held
    during timout processing because their callbacks are routed through
    posix_timer_fn(). The alarm timers follow a different path, so they
    ought to grab the lock somewhere else.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Sharvil Nanavati
    Signed-off-by: Richard Larocque
    Signed-off-by: John Stultz
    Signed-off-by: Greg Kroah-Hartman

    Richard Larocque