Eric Lee / smarc-fsl-linux-kernel

24 May, 2016

1 commit

17b0573d7 prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable ... Browse Code »

PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.

Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Alex Thorlton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800

18 Mar, 2016

1 commit

da8b44d5a timer: convert timer_slack_ns from unsigned long to u64 ... Browse Code »

This patchset introduces a /proc//timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc//timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real 0m10.747s
user 0m0.001s
sys 0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz
Cc: Arjan van de Ven
Cc: Thomas Gleixner
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Kees Cook
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Stultz
2016-03-18 06:09:34 +0800

21 Jan, 2016

1 commit

ddf1d398e prctl: take mmap sem for writing to protect against others ... Browse Code »

An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.

proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);

These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:

kernel BUG at fs/proc/base.c:240!
invalid opcode: 0000 [#1] SMP
Modules linked in: virtio_net
CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
RIP: proc_pid_cmdline_read+0x520/0x530
RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
Call Trace:
__vfs_read+0x37/0x100
vfs_read+0x82/0x130
SyS_read+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RIP proc_pid_cmdline_read+0x520/0x530
---[ end trace 97882617ae9c6818 ]---

Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.

Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.

The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.

The second patch is optional and puts locking around aformentioned
consumers for safety. Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.

This patch (of 2):

The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.

The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.

Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.

Signed-off-by: Mateusz Guzik
Acked-by: Cyrill Gorcunov
Cc: Alexey Dobriyan
Cc: Jarod Wilson
Cc: Jan Stancek
Cc: Al Viro
Cc: Anshuman Khandual
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mateusz Guzik
2016-01-21 09:09:18 +0800

07 Nov, 2015

1 commit

8639b4613 pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode ... Browse Code »

setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
the current pid namespace. This is in contrast to both the other modes of
setpriority and the example of kill(-1). Fix this. getpriority and
ioprio have the same failure mode, fix them too.

Eric said:

: After some more thinking about it this patch sounds justifiable.
:
: My goal with namespaces is not to build perfect isolation mechanisms
: as that can get into ill defined territory, but to build well defined
: mechanisms. And to handle the corner cases so you can use only
: a single namespace with well defined results.
:
: In this case you have found the two interfaces I am aware of that
: identify processes by uid instead of by pid. Which quite frankly is
: weird. Unfortunately the weird unexpected cases are hard to handle
: in the usual way.
:
: I was hoping for a little more information. Changes like this one we
: have to be careful of because someone might be depending on the current
: behavior. I don't think they are and I do think this make sense as part
: of the pid namespace.

Signed-off-by: Ben Segall
Cc: Oleg Nesterov
Cc: Al Viro
Cc: Ambrose Feinstein
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Segall
2015-11-07 09:50:42 +0800

10 Jul, 2015

1 commit

90f8572b0 vfs: Commit to never having exectuables on proc and sysfs. ... Browse Code »

Today proc and sysfs do not contain any executable files. Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.

Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.

Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.

The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.

This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.

Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc. Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-10 23:39:25 +0800

26 Jun, 2015

1 commit

4a00e9df2 prctl: more prctl(PR_SET_MM_*) checks ... Browse Code »

Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
consistent view of mm->arg_start et al fields, but not enough. In
particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
but don't check that the start address is lower than the end address _at
all_.

Consolidate all consistency checks, so there will be no difference in
the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.

The program below makes both ARGV and ENVP areas be reversed. It makes
/proc/$PID/cmdline show garbage (it doesn't oops by luck).

#include
#include
#include

enum {PAGE_SIZE=4096};

int main(void)
{
void *p;

p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

#define PR_SET_MM 35
#define PR_SET_MM_ARG_START 8
#define PR_SET_MM_ARG_END 9
#define PR_SET_MM_ENV_START 10
#define PR_SET_MM_ENV_END 11
prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ARG_END, (unsigned long)p, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_END, (unsigned long)p, 0, 0);

pause();
return 0;
}

[akpm@linux-foundation.org: tidy code, tweak comment]
Signed-off-by: Alexey Dobriyan
Acked-by: Cyrill Gorcunov
Cc: Jarod Wilson
Cc: Jan Stancek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2015-06-26 08:00:37 +0800

09 Jun, 2015

1 commit

46a6e0cf1 x86/mpx: Clean up the code by not passing a task pointer around when unnecessary ... Browse Code »

The MPX code can only work on the current task. You can not,
for instance, enable MPX management in another process or
thread. You can also not handle a fault for another process or
thread.

Despite this, we pass a task_struct around prolifically. This
patch removes all of the task struct passing for code paths
where the code can not deal with another task (which turns out
to be all of them).

This has no functional changes. It's just a cleanup.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Dave Hansen
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2015-06-09 18:24:30 +0800

17 Apr, 2015

1 commit

6e399cd14 prctl: avoid using mmap_sem for exe_file serialization ... Browse Code »

Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
of calling set_mm_exe_file() which requires some form of serialization --
mmap_sem in this case. For archs that do not have atomic rmw instructions
we still fallback to a spinlock alternative, so this should always be
safe. As such, we only need the mmap_sem for looking up the backing
vm_file, which can be done sharing the lock. Naturally, this means we
need to manually deal with both the new and old file reference counting,
and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
probably be deleted in the future anyway.

Signed-off-by: Davidlohr Bueso
Suggested-by: Oleg Nesterov
Acked-by: Oleg Nesterov
Reviewed-by: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2015-04-17 21:04:07 +0800

16 Apr, 2015

1 commit

2813893f8 kernel: conditionally support non-root users, groups and capabilities ... Browse Code »

There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda
Reviewed-by: Josh Triplett
Acked-by: Geert Uytterhoeven
Tested-by: Paul E. McKenney
Reviewed-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Iulia Manda
2015-04-16 07:35:22 +0800

01 Mar, 2015

1 commit

39afb5ee4 kernel/sys.c: fix UNAME26 for 4.0 ... Browse Code »

There's a uname workaround for broken userspace which can't handle kernel
versions of 3.x. Update it for 4.x.

Signed-off-by: Jon DeVree
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jon DeVree
2015-03-01 01:57:51 +0800

22 Feb, 2015

1 commit

a135c717d Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus ... Browse Code »

Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS:

- a number of fixes that didn't make the 3.19 release.

- a number of cleanups.

- preliminary support for Cavium's Octeon 3 SOCs which feature up to
48 MIPS64 R3 cores with FPU and hardware virtualization.

- support for MIPS R6 processors.

Revision 6 of the MIPS architecture is a major revision of the MIPS
architecture which does away with many of original sins of the
architecture such as branch delay slots. This and other changes in
R6 require major changes throughout the entire MIPS core
architecture code and make up for the lion share of this pull
request.

- finally some preparatory work for eXtendend Physical Address
support, which allows support of up to 40 bit of physical address
space on 32 bit processors"

[ Ahh, MIPS can't leave the PAE brain damage alone. It's like
every CPU architect has to make that mistake, but pee in the snow
by changing the TLA. But whether it's called PAE, LPAE or XPA,
it's horrid crud - Linus ]

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits)
MIPS: sead3: Corrected get_c0_perfcount_int
MIPS: mm: Remove dead macro definitions
MIPS: OCTEON: irq: add CIB and other fixes
MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs.
MIPS: OCTEON: More OCTEONIII support
MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits.
MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup.
MIPS: OCTEON: Update octeon-model.h code for new SoCs.
MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX
MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h
MIPS: OCTEON: Implement the core-16057 workaround
MIPS: OCTEON: Delete unused COP2 saving code
MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register
MIPS: OCTEON: Save and restore CP2 SHA3 state
MIPS: OCTEON: Fix FP context save.
MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs
MIPS: boot: Provide more uImage options
MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h
MIPS: ip22-gio: Remove legacy suspend/resume support
mips: pci: Add ifdef around pci_proc_domain
...

Linus Torvalds
2015-02-22 11:41:38 +0800

12 Feb, 2015

1 commit

9791554b4 MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPS ... Browse Code »

Userland code may be built using an ABI which permits linking to objects
that have more restrictive floating point requirements. For example,
userland code may be built to target the O32 FPXX ABI. Such code may be
linked with other FPXX code, or code built for either one of the more
restrictive FP32 or FP64. When linking with more restrictive code, the
overall requirement of the process becomes that of the more restrictive
code. The kernel has no way to know in advance which mode the process
will need to be executed in, and indeed it may need to change during
execution. The dynamic loader is the only code which will know the
overall required mode, and so it needs to have a means to instruct the
kernel to switch the FP mode of the process.

This patch introduces 2 new options to the prctl syscall which provide
such a capability. The FP mode of the process is represented as a
simple bitmask combining a number of mode bits mirroring those present
in the hardware. Userland can either retrieve the current FP mode of
the process:

mode = prctl(PR_GET_FP_MODE);

or modify the current FP mode of the process:

err = prctl(PR_SET_FP_MODE, new_mode);

Signed-off-by: Paul Burton
Cc: Matthew Fortune
Cc: Markos Chandras
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/8899/
Signed-off-by: Ralf Baechle

Paul Burton
2015-02-12 19:30:29 +0800

23 Jan, 2015

1 commit

e9d1b4f3c x86, mpx: Strictly enforce empty prctl() args ... Browse Code »

Description from Michael Kerrisk. He suggested an identical patch
to one I had already coded up and tested.

commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds
tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
that unused arguments are zero, as is done in many existing prctl()s
and as should be done for all new prctl()s. This patch adds the
required checks.

Suggested-by: Andy Lutomirski
Suggested-by: Michael Kerrisk
Signed-off-by: Dave Hansen
Cc: Dave Hansen
Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2015-01-23 04:11:06 +0800

18 Nov, 2014

1 commit

fe3d197f8 x86, mpx: On-demand kernel allocation of bounds tables ... Browse Code »

This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).

Long Description:

This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.

PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.

PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.

PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.

Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.

==== Why does the hardware even have these Bounds Tables? ====

MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".

They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.

The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.

==== Why not do this in userspace? ====

This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.

Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.

Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.

Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.

Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.

Based-on-patch-by: Qiaowei Ren
Signed-off-by: Dave Hansen
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2014-11-18 07:58:53 +0800

13 Oct, 2014

1 commit

faafcba3b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)

- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

- sched/numa updates and optimizations (Rik van Riel)

- sysbench speedup (Vincent Guittot)

- capacity calculation cleanups/refactoring (Vincent Guittot)

- Various cleanups to thread group iteration (Oleg Nesterov)

- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)

- various sched/deadline fixes

... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...

Linus Torvalds
2014-10-13 22:23:15 +0800

10 Oct, 2014

6 commits

0baae41ea kernel/sys.c: compat sysinfo syscall: fix undefined behavior ... Browse Code »

Fix undefined behavior and compiler warning by replacing right shift 32
with upper_32_bits macro

Signed-off-by: Scotty Bauer
Cc: Clemens Ladisch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Scotty Bauer
2014-10-10 10:26:04 +0800
ec94fc3d5 kernel/sys.c: whitespace fixes ... Browse Code »

Fix minor errors and warning messages in kernel/sys.c. These errors were
reported by checkpatch while working with some modifications in sys.c
file. Fixing this first will help me to improve my further patches.

ERROR: trailing whitespace - 9
ERROR: do not use assignment in if condition - 4
ERROR: spaces required around that '?' (ctx:VxO) - 10
ERROR: switch and case should be at the same indent - 3

total 26 errors & 3 warnings fixed.

Signed-off-by: vishnu.ps
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

vishnu.ps
2014-10-10 10:26:04 +0800
96dad67ff mm: use VM_BUG_ON_MM where possible ... Browse Code »

Dump the contents of the relevant struct_mm when we hit the bug condition.

Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-10-10 10:25:58 +0800
f606b77f1 prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation ... Browse Code »

During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.

A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.

Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.

prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)

struct prctl_mm_map {
__u64 start_code;
__u64 end_code;
__u64 start_data;
__u64 end_data;
__u64 start_brk;
__u64 brk;
__u64 start_stack;
__u64 arg_start;
__u64 arg_end;
__u64 env_start;
__u64 env_end;
__u64 *auxv;
__u32 auxv_size;
__u32 exe_fd;
};

All members except @exe_fd correspond ones of struct mm_struct. To figure
out which available values these members may take here are meanings of the
members.

- start_code, end_code: represent bounds of executable code area
- start_data, end_data: represent bounds of data area
- start_brk, brk: used to calculate bounds for brk() syscall
- start_stack: used when accounting space needed for command
line arguments, environment and shmat() syscall
- arg_start, arg_end, env_start, env_end: represent memory area
supplied for command line arguments and environment variables
- auxv, auxv_size: carries auxiliary vector, Elf format specifics
- exe_fd: file descriptor number for executable link (/proc/self/exe)

Thus we apply the following requirements to the values

1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
interval.

2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
VMAs (say a program maps own new .text and .data segments during execution)
the rest of members should belong to VMA which must exist.

3) Addresses must be ordered, ie @start_ member must not be greater or
equal to appropriate @end_ member.

4) As in regular Elf loading procedure we require that @start_brk and
@brk be greater than @end_data.

5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
exceed existing limit. Same applies to RLIMIT_STACK.

6) Auxiliary vector size must not exceed existing one (which is
predefined as AT_VECTOR_SIZE and depends on architecture).

7) File descriptor passed in @exe_file should be pointing
to executable file (because we use existing prctl_set_mm_exe_file_locked
helper it ensures that the file we are going to use as exe link has all
required permission granted).

Now about where these members are involved inside kernel code:

- @start_code and @end_code are used in /proc/$pid/[stat|statm] output;

- @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
also they are considered if there enough space for brk() syscall
result if RLIMIT_DATA is set;

- @start_brk shown in /proc/$pid/stat output and accounted in brk()
syscall if RLIMIT_DATA is set; also this member is tested to
find a symbolic name of mmap event for perf system (we choose
if event is generated for "heap" area); one more aplication is
selinux -- we test if a process has PROCESS__EXECHEAP permission
if trying to make heap area being executable with mprotect() syscall;

- @brk is a current value for brk() syscall which lays inside heap
area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
provides new memory area to a user space upon brk() completion the
mm::brk is updated to carry new value;

Both @start_brk and @brk are actively used in /proc/$pid/maps
and /proc/$pid/smaps output to find a symbolic name "heap" for
VMA being scanned;

- @start_stack is printed out in /proc/$pid/stat and used to
find a symbolic name "stack" for task and threads in
/proc/$pid/maps and /proc/$pid/smaps output, and as the same
as with @start_brk -- perf system uses it for event naming.
Also kernel treat this member as a start address of where
to map vDSO pages and to check if there is enough space
for shmat() syscall;

- @arg_start, @arg_end, @env_start and @env_end are printed out
in /proc/$pid/stat. Another access to the data these members
represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
Any attempt to read these areas kernel tests with access_process_vm
helper so a user must have enough rights for this action;

- @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
speaking kernel doesn't care much about which exactly data is
sitting there because it is solely for userspace;

- @exe_fd is referred from /proc/$pid/exe and when generating
coredump. We uses prctl_set_mm_exe_file_locked helper to update
this member, so exe-file link modification remains one-shot
action.

Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory). Still we require the
caller to be at least user-namespace root user.

I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.

To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.

[akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
Signed-off-by: Cyrill Gorcunov
Cc: Kees Cook
Cc: Tejun Heo
Acked-by: Andrew Vagin
Tested-by: Andrew Vagin
Cc: Eric W. Biederman
Cc: H. Peter Anvin
Acked-by: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Cc: Julien Tinnes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-10-10 10:25:55 +0800
71fe97e18 prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file ... Browse Code »

Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
and rename the helper to prctl_set_mm_exe_file_locked(). This will allow
to reuse this function in a next patch.

Signed-off-by: Cyrill Gorcunov
Cc: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Eric W. Biederman
Cc: H. Peter Anvin
Acked-by: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Cc: Julien Tinnes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-10-10 10:25:55 +0800
8764b338b mm: use may_adjust_brk helper ... Browse Code »

Signed-off-by: Cyrill Gorcunov
Cc: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Eric W. Biederman
Cc: H. Peter Anvin
Acked-by: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Cc: Julien Tinnes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-10-10 10:25:55 +0800

08 Sep, 2014

1 commit

e78c34967 time, signal: Protect resource use statistics with seqlock ... Browse Code »

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alex Thorlton
Cc: Andrew Morton
Cc: Daeseok Youn
Cc: David Rientjes
Cc: Dongsheng Yang
Cc: Geert Uytterhoeven
Cc: Guillaume Morin
Cc: Ionut Alexa
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Li Zefan
Cc: Michal Hocko
Cc: Michal Schmidt
Cc: Oleg Nesterov
Cc: Vladimir Davydov
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-09-08 14:17:01 +0800

19 Jul, 2014

1 commit

1d4457f99 sched: move no_new_privs into new atomic flags ... Browse Code »

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:38 +0800

22 May, 2014

1 commit

7aa2c016d sched: Consolidate open coded implementations of nice level frobbing into nice_t… ... Browse Code »

…o_rlimit() and rlimit_to_nice()

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Dongsheng Yang
2014-05-22 17:16:36 +0800

08 Apr, 2014

1 commit

a0715cc22 mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE ... Browse Code »

Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
also adds a prctl control which allows us to set the THP disable bit in
mm->def_flags so that VMs will pick up the setting as they are created.

Signed-off-by: Alex Thorlton
Suggested-by: Oleg Nesterov
Cc: Gerald Schaefer
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Christian Borntraeger
Cc: Paolo Bonzini
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Acked-by: Rik van Riel
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Alexander Viro
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Thorlton
2014-04-08 07:35:52 +0800

23 Feb, 2014

1 commit

c4a4d2f43 sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE ... Browse Code »

Signed-off-by: Dongsheng Yang
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Oleg Nesterov
Cc: Robin Holt
Cc: Al Viro
Cc: Kees Cook
Cc: "Eric W. Biederman"
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/0261f094b836f1acbcdf52e7166487c0c77323c8.1392103744.git.yangds.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar

Dongsheng Yang
2014-02-23 01:16:19 +0800

24 Jan, 2014

2 commits

2e1f38358 kernel/sys.c: k_getrusage() can use while_each_thread() ... Browse Code »

Change k_getrusage() to use while_each_thread(), no changes in the
compiled code.

Signed-off-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Al Viro
Cc: Kees Cook
Reviewed-by: Sameer Nanda
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:02 +0800
98611e4e6 exec: kill task_struct->did_exec ... Browse Code »

We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually
exclusive. The patch kills ->did_exec because it has a single user.

Signed-off-by: Oleg Nesterov
Acked-by: KOSAKI Motohiro
Cc: Al Viro
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:02 +0800

13 Nov, 2013

1 commit

81e41ea25 kernel/sys.c: remove obsolete #include <linux/kexec.h> ... Browse Code »

Commit 15d94b82565e ("reboot: move shutdown/reboot related functions to
kernel/reboot.c") moved all kexec-related functionality to
kernel/reboot.c, so kernel/sys.c no longer needs to include
.

Signed-off-by: Geert Uytterhoeven
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geert Uytterhoeven
2013-11-13 11:09:13 +0800

31 Aug, 2013

1 commit

c7b96acf1 userns: Kill nsown_capable it makes the wrong thing easy ... Browse Code »

nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-08-31 14:44:11 +0800

10 Jul, 2013

2 commits

15d94b825 reboot: move shutdown/reboot related functions to kernel/reboot.c ... Browse Code »

This patch is preparatory. It moves reboot related syscall, etc
functions from kernel/sys.c to kernel/reboot.c.

Signed-off-by: Robin Holt
Cc: H. Peter Anvin
Cc: Russ Anderson
Cc: Robin Holt
Cc: Russell King
Cc: Guan Xuetao
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2013-07-10 01:33:29 +0800
0efbee708 reboot: remove -stable friendly PF_THREAD_BOUND define ... Browse Code »

Remove the prior patch's #define for easier backporting to the stable
releases.

Signed-off-by: Robin Holt
Cc: H. Peter Anvin
Cc: Russ Anderson
Cc: Robin Holt
Cc: Russell King
Cc: Guan Xuetao
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2013-07-10 01:33:29 +0800

04 Jul, 2013

3 commits

81dabb464 exit.c: unexport __set_special_pids() ... Browse Code »

Move __set_special_pids() from exit.c to sys.c close to its single caller
and make it static.

And rename it to set_special_pids(), another helper with this name has
gone away.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-04 07:08:02 +0800
45c64940c kernel/sys.c:do_sysinfo(): use get_monotonic_boottime() ... Browse Code »

Change do_sysinfo() to use get_monotonic_boottime() instead of
do_posix_clock_monotonic_gettime() + monotonic_to_bootbased().

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Acked-by: John Stultz
Cc: Tomas Janousek
Cc: Tomas Smetana
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-04 07:07:42 +0800
7ec75e1ca kernel/sys.c: sys_reboot(): fix malformed panic message ... Browse Code »

If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will
stay on the same line with the next message, so append a '\n'.

Signed-off-by: liguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

liguang
2013-07-04 07:07:41 +0800

13 Jun, 2013

1 commit

cf7df378a reboot: rigrate shutdown/reboot to boot cpu ... Browse Code »

We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit
f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").

The current implementation does all the work of hot removing the cpus
before halting the system. We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.

This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu. Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.

Signed-off-by: Robin Holt
Tested-by: Shawn Guo
Cc: "Srivatsa S. Bhat"
Cc: H. Peter Anvin
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Russ Anderson
Cc: Robin Holt
Cc: Russell King
Cc: Guan Xuetao
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2013-06-13 07:29:44 +0800

01 May, 2013

3 commits

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800
52b369415 kernel/sys.c: make prctl(PR_SET_MM) generally available ... Browse Code »

The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:

start_code, end_code, start_data, end_data, start_brk, brk,
start_stack, arg_start, arg_end, env_start, env_end.

This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file. This includes:

1. Restoring processes from a checkpoint-file (by all potential
user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
and high-availablity).
4. Starting a process from an executable format that is not supported
by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
executable that, for confidentiality, licensing or other reasons,
may not be written to the local file-systems.

The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled. The patch removes those ifdefs.

Signed-off-by: Amnon Shiloh
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amnon Shiloh
2013-05-01 08:04:09 +0800
4a22f1663 kernel/timer.c: move some non timer related syscalls to kernel/sys.c ... Browse Code »

Andrew Morton noted:

akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
SYSCALL_DEFINE1(alarm, unsigned int, seconds)
SYSCALL_DEFINE0(getpid)
SYSCALL_DEFINE0(getppid)
SYSCALL_DEFINE0(getuid)
SYSCALL_DEFINE0(geteuid)
SYSCALL_DEFINE0(getgid)
SYSCALL_DEFINE0(getegid)
SYSCALL_DEFINE0(gettid)
SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

Only one of those should be in kernel/timer.c. Who wrote this thing?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Rothwell
Acked-by: Thomas Gleixner
Cc: Guenter Roeck
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Rothwell
2013-05-01 08:04:03 +0800

09 Apr, 2013

1 commit

6f389a8f1 PM / reboot: call syscore_shutdown() after disable_nonboot_cpus() ... Browse Code »

As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
subsystems PM) say, syscore_ops operations should be carried with one
CPU on-line and interrupts disabled. However, after commit f96972f2d
(kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
syscore_shutdown() is called before disable_nonboot_cpus(), so break
the rules. We have a MIPS machine with a 8259A PIC, and there is an
external timer (HPET) linked at 8259A. Since 8259A has been shutdown
too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
timer interrupt, so it hangs and reboot fails. This patch call
syscore_shutdown() a little later (after disable_nonboot_cpus()) to
avoid reboot failure, this is the same way as poweroff does.

For consistency, add disable_nonboot_cpus() to kernel_halt().

Signed-off-by: Huacai Chen
Cc:
Signed-off-by: Rafael J. Wysocki

Huacai Chen
2013-04-09 04:10:40 +0800