11 Feb, 2015
2 commits
-
This patch introduces generic code to perform PM domain look-up using
device tree and automatically bind devices to their PM domains.Generic device tree bindings are introduced to specify PM domains of
devices in their device tree nodes.Backwards compatibility with legacy Samsung-specific PM domain bindings
is provided, but for now the new code is not compiled when
CONFIG_ARCH_EXYNOS is selected to avoid collision with legacy code.
This will change as soon as the Exynos PM domain code gets converted to
use the generic framework in further patch.This patch was originally submitted by Tomasz Figa when he was employed
by Samsung.Link: http://marc.info/?l=linux-pm&m=139955349702152&w=2
Signed-off-by: Ulf Hansson
Acked-by: Rob Herring
Tested-by: Philipp Zabel
Reviewed-by: Kevin Hilman
Signed-off-by: Rafael J. Wysocki
(cherry picked from commit aa42240ab2544a8bcb2efb400193826f57f3175e)
(cherry picked from commit 4a2d7a846761e3b86e08b903e5a1a088686e2181) -
This reverts commit 4aa055cb0634bc8d0389070104fe6aa7cfa99b8c.
Signed-off-by: Robin Gong(cherry picked from commit e599f64de890a60a3b9884dd5838c43472f145e2)
16 Jan, 2015
2 commits
-
Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
Pallipadi to generic.Change-Id: Idf29cda15be151f494ff245933c12462643388d5
Acked-by: Nicolas Pitre
Signed-off-by: Todd Poynor -
This patch introduces generic code to perform power domain look-up using
device tree and automatically bind devices to their power domains.
Generic device tree binding is introduced to specify power domains of
devices in their device tree nodes.Backwards compatibility with legacy Samsung-specific power domain
bindings is provided, but for now the new code is not compiled when
CONFIG_ARCH_EXYNOS is selected to avoid collision with legacy code. This
will change as soon as Exynos power domain code gets converted to use
the generic framework in further patch.Signed-off-by: Tomasz Figa
Reviewed-by: Mark Brown
Reviewed-by: Kevin Hilman
Reviewed-by: Philipp Zabel
[on i.MX6 GK802]
Tested-by: Philipp Zabel
Reviewed-by: Ulf Hansson
[shawn.guo: http://thread.gmane.org/gmane.linux.kernel.samsung-soc/31029]
Signed-off-by: Shawn Guo
09 Jan, 2015
11 commits
-
commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d upstream.
alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
fails because disable_pid_allocation() was called by the exiting
child_reaper.We could simply move get_pid_ns() down to successful return, but this fix
tries to be as trivial as possible.Signed-off-by: Oleg Nesterov
Reviewed-by: "Eric W. Biederman"
Cc: Aaron Tomlin
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 041d7b98ffe59c59fdd639931dea7d74f9aa9a59 upstream.
A regression was caused by commit 780a7654cee8:
audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.This broke userspace by not returning the same information that was sent and
expected.The rule:
auditctl -a exit,never -F auid=-1
gives:
auditctl -l
LIST_RULES: exit,never f24=0 syscall=all
when it should give:
LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=allTag it so that it is reported the same way it was set. Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.Signed-off-by: Richard Guy Briggs
Signed-off-by: Paul Moore
Signed-off-by: Greg Kroah-Hartman -
commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 upstream.
Now that setgroups can be disabled and not reenabled, setting gid_map
without privielge can now be enabled when setgroups is disabled.This restores most of the functionality that was lost when unprivileged
setting of gid_map was removed. Applications that use this functionality
will need to check to see if they use setgroups or init_groups, and if they
don't they can be fixed by simply disabling setgroups before writing to
gid_map.Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.
- Expose the knob to user space through a proc file /proc//setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit f0d62aec931e4ae3333c797d346dc4f188f454ba upstream.
Generalize id_map_mutex so it can be used for more state of a user namespace.
Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit f95d7918bd1e724675de4940039f2865e5eec5fe upstream.
If you did not create the user namespace and are allowed
to write to uid_map or gid_map you should already have the necessary
privilege in the parent user namespace to establish any mapping
you want so this will not affect userspace in practice.Limiting unprivileged uid mapping establishment to the creator of the
user namespace makes it easier to verify all credentials obtained with
the uid mapping can be obtained without the uid mapping without
privilege.Limiting unprivileged gid mapping establishment (which is temporarily
absent) to the creator of the user namespace also ensures that the
combination of uid and gid can already be obtained without privilege.This is part of the fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit 80dd00a23784b384ccea049bfb3f259d3f973b9d upstream.
setresuid allows the euid to be set to any of uid, euid, suid, and
fsuid. Therefor it is safe to allow an unprivileged user to map
their euid and use CAP_SETUID privileged with exactly that uid,
as no new credentials can be obtained.I can not find a combination of existing system calls that allows setting
uid, euid, suid, and fsuid from the fsuid making the previous use
of fsuid for allowing unprivileged mappings a bug.This is part of a fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit be7c6dba2332cef0677fbabb606e279ae76652c3 upstream.
As any gid mapping will allow and must allow for backwards
compatibility dropping groups don't allow any gid mappings to be
established without CAP_SETGID in the parent user namespace.For a small class of applications this change breaks userspace
and removes useful functionality. This small class of applications
includes tools/testing/selftests/mount/unprivilged-remount-test.cMost of the removed functionality will be added back with the addition
of a one way knob to disable setgroups. Once setgroups is disabled
setting the gid_map becomes as safe as setting the uid_map.For more common applications that set the uid_map and the gid_map
with privilege this change will have no affect.This is part of a fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream.
setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.Reviewed-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit 0542f17bf2c1f2430d368f44c8fcf2f82ec9e53e upstream.
The rule is simple. Don't allow anything that wouldn't be allowed
without unprivileged mappings.It was previously overlooked that establishing gid mappings would
allow dropping groups and potentially gaining permission to files and
directories that had lesser permissions for a specific group than for
all other users.This is the rule needed to fix CVE-2014-8989 and prevent any other
security issues with new_idmap_permitted.The reason for this rule is that the unix permission model is old and
there are programs out there somewhere that take advantage of every
little corner of it. So allowing a uid or gid mapping to be
established without privielge that would allow anything that would not
be allowed without that mapping will result in expectations from some
code somewhere being violated. Violated expectations about the
behavior of the OS is a long way to say a security issue.Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman -
commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream.
Today there are 3 instances of setgroups and due to an oversight their
permission checking has diverged. Add a common function so that
they may all share the same permission checking code.This corrects the current oversight in the current permission checks
and adds a helper to avoid this in the future.A user namespace security fix will update this new helper, shortly.
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman
07 Dec, 2014
1 commit
-
commit 82975bc6a6df743b9a01810fb32cb65d0ec5d60b upstream.
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns. I suspect that this is a mistake and that
the code only works because int3 is paranoid.Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.Reported-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Borislav Petkov
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
22 Nov, 2014
6 commits
-
commit b3f207855f57b9c8f43a547a801340bb5cbc59e5 upstream.
When running a 32-bit userspace on a 64-bit kernel (eg. i386
application on x86_64 kernel or 32-bit arm userspace on arm64
kernel) some of the perf ioctls must be treated with special
care, as they have a pointer size encoded in the command.For example, PERF_EVENT_IOC_ID in 32-bit world will be encoded
as 0x80042407, but 64-bit kernel will expect 0x80082407. In
result the ioctl will fail returning -ENOTTY.This patch solves the problem by adding code fixing up the
size as compat_ioctl file operation.Reported-by: Drew Richardson
Signed-off-by: Pawel Moll
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Link: http://lkml.kernel.org/r/1402671812-9078-1-git-send-email-pawel.moll@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: David Ahern
Signed-off-by: Greg Kroah-Hartman -
commit 2aa792e6faf1a00f5accf1f69e87e11a390ba2cd upstream.
The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:* Is the thread we are trying to wake up the current thread?
* Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
* Is there no thread created for this flavour, hence nothing to wake up?If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.Idle:
Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0rcutorture:
Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.There is a comment about an implied barrier supplied by the wake_up()
logic. This barrier is necessary for the awakened thread to see the
updated ->gp_flags. This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.Signed-off-by: Pranith Kumar
CC: Mathieu Desnoyers
Signed-off-by: Paul E. McKenney
Cc: Kamal Mostafa
Signed-off-by: Greg Kroah-Hartman -
commit 48a7639ce80cf279834d0d44865e49ecd714f37d upstream.
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.Signed-off-by: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Reviewed-by: Josh Triplett
[ Pranith: backport to 3.13-stable: just rcu_gp_kthread_wake(),
prereq for 2aa792e "rcu: Use rcu_gp_kthread_wake() to wake up grace
period kthreads" ]
Signed-off-by: Pranith Kumar
Signed-off-by: Kamal Mostafa
Signed-off-by: Greg Kroah-Hartman -
commit 799b601451b21ebe7af0e6e8f6e2ccd4683c5064 upstream.
Audit rules disappear when an inode they watch is evicted from the cache.
This is likely not what we want.The guilty commit is "fsnotify: allow marks to not pin inodes in core",
which didn't take into account that audit_tree adds watches with a zero
mask.Adding any mask should fix this.
Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core")
Signed-off-by: Miklos Szeredi
Signed-off-by: Paul Moore
Signed-off-by: Greg Kroah-Hartman -
commit 897f1acbb6702ddaa953e8d8436eee3b12016c7e upstream.
Add a space between subj= and feature= fields to make them parsable.
Signed-off-by: Richard Guy Briggs
Signed-off-by: Paul Moore
Signed-off-by: Greg Kroah-Hartman -
commit 9ef91514774a140e468f99d73d7593521e6d25dc upstream.
When an AUDIT_GET_FEATURE message is sent from userspace to the kernel, it
should reply with a message tagged as an AUDIT_GET_FEATURE type with a struct
audit_feature. The current reply is a message tagged as an AUDIT_GET
type with a struct audit_feature.This appears to have been a cut-and-paste-eo in commit b0fed40.
Reported-by: Steve Grubb
Signed-off-by: Richard Guy Briggs
Signed-off-by: Greg Kroah-Hartman
15 Nov, 2014
8 commits
-
commit f1e3a0932f3a9554371792a7daaf1e0eb19f66d5 upstream.
Probability of use-after-free isn't zero in this place.
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Paul E. McKenney
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 6891c4509c792209c44ced55a60f13954cb50ef4 upstream.
If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.Initialize sigev_value with 0 to plug the stack info leak.
Found in the PaX patch, written by the PaX Team.
Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
Signed-off-by: Mathias Krause
Cc: Oleg Nesterov
Cc: Brad Spengler
Cc: PaX Team
Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit 94fb823fcb4892614f57e59601bb9d4920f24711 upstream.
If a device's dev_pm_ops::freeze callback fails during the QUIESCE
phase, we don't rollback things correctly calling the thaw and complete
callbacks. This could leave some devices in a suspended state in case of
an error during resuming from hibernation.Signed-off-by: Imre Deak
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman -
commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per RafaelFixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Signed-off-by: Michal Hocko
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman -
commit 51fae6da640edf9d266c94f36bc806c63c301991 upstream.
Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen
before deferring) OOM killer relies on being able to thaw a frozen task
to handle OOM situation but a3201227f803 (freezer: make freezing() test
freeze conditions in effect instead of TIF_FREEZE) has reorganized the
code and stopped clearing freeze flag in __thaw_task. This means that
the target task only wakes up and goes into the fridge again because the
freezing condition hasn't changed for it. This reintroduces the bug
fixed by f660daac474c6f.Fix the issue by checking for TIF_MEMDIE thread flag in
freezing_slow_path and exclude the task from freezing completely. If a
task was already frozen it would get woken by __thaw_task from OOM killer
and get out of freezer after rechecking freezing().Changes since v1
- put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
as per Oleg
- return __thaw_task into oom_scan_process_thread because
oom_kill_process will not wake task in the fridge because it is
sleeping uninterruptible[mhocko@suse.cz: rewrote the changelog]
Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
Signed-off-by: Cong Wang
Signed-off-by: Michal Hocko
Acked-by: Oleg Nesterov
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman -
commit d3051b489aa81ca9ba62af366149ef42b8dae97c upstream.
A panic was seen in the following sitation.
There are two threads running on the system. The first thread is a system
monitoring thread that is reading /proc/modules. The second thread is
loading and unloading a module (in this example I'm using my simple
dummy-module.ko). Note, in the "real world" this occurred with the qlogic
driver module.When doing this, the following panic occurred:
------------[ cut here ]------------
kernel BUG at kernel/module.c:3739!
invalid opcode: 0000 [#1] SMP
Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module]
CPU: 37 PID: 186343 Comm: cat Tainted: GF O-------------- 3.10.0+ #7
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000
RIP: 0010:[] [] module_flags+0xb5/0xc0
RSP: 0018:ffff88080fa7fe18 EFLAGS: 00010246
RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000
RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000
RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000
R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000
FS: 00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c
ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00
ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0
Call Trace:
[] m_show+0x19c/0x1e0
[] seq_read+0x16e/0x3b0
[] proc_reg_read+0x3d/0x80
[] vfs_read+0x9c/0x170
[] SyS_read+0x58/0xb0
[] system_call_fastpath+0x16/0x1b
Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
RIP [] module_flags+0xb5/0xc0
RSPConsider the two processes running on the system.
CPU 0 (/proc/modules reader)
CPU 1 (loading/unloading module)CPU 0 opens /proc/modules, and starts displaying data for each module by
traversing the modules list via fs/seq_file.c:seq_open() and
fs/seq_file.c:seq_read(). For each module in the modules list, seq_read
doesop->start() show() stop() state == MODULE_STATE_UNFORMED);
...The other thread, CPU 1, in unloading the module calls the syscall
delete_module() defined in kernel/module.c. The module_mutex is acquired
for a short time, and then released. free_module() is called without the
module_mutex. free_module() then sets mod->state = MODULE_STATE_UNFORMED,
also without the module_mutex. Some additional code is called and then the
module_mutex is reacquired to remove the module from the modules list:/* Now we can delete it from the lists */
mutex_lock(&module_mutex);
stop_machine(__unlink_module, mod, NULL);
mutex_unlock(&module_mutex);This is the sequence of events that leads to the panic.
CPU 1 is removing dummy_module via delete_module(). It acquires the
module_mutex, and then releases it. CPU 1 has NOT set dummy_module->state to
MODULE_STATE_UNFORMED yet.CPU 0, which is reading the /proc/modules, acquires the module_mutex and
acquires a pointer to the dummy_module which is still in the modules list.
CPU 0 calls m_show for dummy_module. The check in m_show() for
MODULE_STATE_UNFORMED passed for dummy_module even though it is being
torn down.Meanwhile CPU 1, which has been continuing to remove dummy_module without
holding the module_mutex, now calls free_module() and sets
dummy_module->state to MODULE_STATE_UNFORMED.CPU 0 now calls module_flags() with dummy_module and ...
static char *module_flags(struct module *mod, char *buf)
{
int bx = 0;BUG_ON(mod->state == MODULE_STATE_UNFORMED);
and BOOM.
Acquire and release the module_mutex lock around the setting of
MODULE_STATE_UNFORMED in the teardown path, which should resolve the
problem.Testing: In the unpatched kernel I can panic the system within 1 minute by
doingwhile (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done
and
while (true) do cat /proc/modules; done
in separate terminals.
In the patched kernel I was able to run just over one hour without seeing
any issues. I also verified the output of panic via sysrq-c and the output
of /proc/modules looks correct for all three states for the dummy_module.dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-)
dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE)
dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+)Signed-off-by: Prarit Bhargava
Reviewed-by: Oleg Nesterov
Signed-off-by: Rusty Russell
Signed-off-by: Greg Kroah-Hartman -
commit 66339c31bc3978d5fff9c4b4cb590a861def4db2 upstream.
dl_bw_of() dereferences rq->rd which has to have RCU read lock held.
Probability of use-after-free isn't zero here.Also add lockdep assert into dl_bw_cpus().
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Paul E. McKenney
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 086ba77a6db00ed858ff07451bedee197df868c9 upstream.
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
Signed-off-by: Rabin Vincent
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman
31 Oct, 2014
1 commit
-
commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0 upstream.
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.Signed-off-by: Catalin Marinas
Reported-by: Matteo Franchin
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
Acked-by: Davidlohr Bueso
Tested-by: Mike Galbraith
Cc: Darren Hart
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Paul E. McKenney
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
10 Oct, 2014
5 commits
-
commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Greg Kroah-Hartman -
commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.
Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Greg Kroah-Hartman -
commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.) Doing
this repeatedly would cause unbounded growth in val. So fix the math.Here's what was wrong with the conversion: we essentially computed
(eliding seconds)jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)
by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:jiffies = (usec * x) >> USEC_JIFFIE_SC
and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies. This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.Tested: the following program:
int main() {
struct itimerval zero = {{0, 0}, {0, 0}};
/* Initially set to 10 ms. */
struct itimerval initial = zero;
initial.it_interval.tv_usec = 10000;
setitimer(ITIMER_PROF, &initial, NULL);
/* Save and restore several times. */
for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, &zero, &prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, &prev, NULL);
}
return 0;
}Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Paul Turner
Cc: Richard Cochran
Cc: Prarit Bhargava
Reviewed-by: Paul Turner
Reported-by: Aaron Jacobs
Signed-off-by: Andrew Hunter
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings
Signed-off-by: Greg Kroah-Hartman -
commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.
Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.Commit 651e22f2701b changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman -
commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.
Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process. This is bad..Suggested-by: Oleg Nesterov
Reported-by: Oleg Nesterov
Reported-by: Sylvain 'ythier' Hitier
Signed-off-by: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
06 Oct, 2014
4 commits
-
commit 43e8317b0bba1d6eb85f38a4a233d82d7c20d732 upstream.
Use the observation that, for platform-dependent sleep states
(PM_SUSPEND_STANDBY, PM_SUSPEND_MEM), a given state is either
always supported or always unsupported and store that information
in pm_states[] instead of calling valid_state() every time we
need to check it.Also do not use valid_state() for PM_SUSPEND_FREEZE, which is always
valid, and move the pm_test_level validity check for PM_SUSPEND_FREEZE
directly into enter_state().Signed-off-by: Rafael J. Wysocki
Cc: Brian Norris
Signed-off-by: Greg Kroah-Hartman -
commit 27ddcc6596e50cb8f03d2e83248897667811d8f6 upstream.
To allow sleep states corresponding to the "mem", "standby" and
"freeze" lables to be different from the pm_states[] indexes of
those strings, introduce struct pm_sleep_state, consisting of
a string label and a state number, and turn pm_states[] into an
array of objects of that type.This modification should not lead to any functional changes.
Signed-off-by: Rafael J. Wysocki
Cc: Brian Norris
Signed-off-by: Greg Kroah-Hartman -
commit 3577af70a2ce4853d58e57d832e687d739281479 upstream.
We saw a kernel soft lockup in perf_remove_from_context(),
it looks like the `perf` process, when exiting, could not go
out of the retry loop. Meanwhile, the target process was forking
a child. So either the target process should execute the smp
function call to deactive the event (if it was running) or it should
do a context switch which deactives the event.It seems we optimize out a context switch in perf_event_context_sched_out(),
and what's more important, we still test an obsolete task pointer when
retrying, so no one actually would deactive that event in this situation.
Fix it directly by reloading the task pointer in perf_remove_from_context().This should cure the above soft lockup.
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: Peter Zijlstra
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1409696840-843-1-git-send-email-xiyou.wangcong@gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 474e941bed9262f5fa2394f9a4a67e24499e5926 upstream.
Locks the k_itimer's it_lock member when handling the alarm timer's
expiry callback.The regular posix timers defined in posix-timers.c have this lock held
during timout processing because their callbacks are routed through
posix_timer_fn(). The alarm timers follow a different path, so they
ought to grab the lock somewhere else.Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Richard Cochran
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Signed-off-by: Richard Larocque
Signed-off-by: John Stultz
Signed-off-by: Greg Kroah-Hartman