Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

16 Aug, 2014

1 commit

ff7e0055b module: Clean up ro/nx after early module load failures ... Browse Code »

The commit

4982223e51e8 module: set nx before marking module MODULE_STATE_COMING.

introduced a regression: if a module fails to parse its arguments or
if mod_sysfs_setup fails, then the module's memory will be freed
while still read-only. Anything that reuses that memory will crash
as soon as it tries to write to it.

Cc: stable@vger.kernel.org # v3.16
Cc: Rusty Russell
Signed-off-by: Andy Lutomirski
Signed-off-by: Rusty Russell

Andy Lutomirski
2014-08-16 03:17:00 +0800

15 Aug, 2014

4 commits

c9d26423e Merge tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more ACPI and power management updates from Rafael Wysocki:
"These are a couple of regression fixes, cpuidle menu governor
optimizations, fixes for ACPI proccessor and battery drivers,
hibernation fix to avoid problems related to the e820 memory map,
fixes for a few cpufreq drivers and a new version of the suspend
profiling tool analyze_suspend.py.

Specifics:

- Fix for an ACPI-based device hotplug regression introduced in 3.14
that causes a kernel panic to trigger when memory hot-remove is
attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen

- Fix for a cpufreq regression introduced in 3.16 that triggers a
"sleeping function called from invalid context" bug in
dev_pm_opp_init_cpufreq_table() from Stephen Boyd

- ACPI battery driver fix for a warning message added in 3.16 that
prints silly stuff sometimes from Mariusz Ceier

- Hibernation fix for safer handling of mismatches in the 820 memory
map between the configurations during image creation and during the
subsequent restore from Chun-Yi Lee

- ACPI processor driver fix to handle CPU hotplug notifications
correctly during system suspend/resume from Lan Tianyu

- Series of four cpuidle menu governor cleanups that also should
speed it up a bit from Mel Gorman

- Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little
cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus
Pargmann and Uwe Kleine-König

- Version 3.0 of the analyze_suspend.py suspend profiling tool from
Todd E Brandt"

* tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / battery: Fix warning message in acpi_battery_get_state()
PM / tools: analyze_suspend.py: update to v3.0
cpufreq: arm_big_little: fix module license spec
cpufreq: speedstep-smi: fix decimal printf specifiers
ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove()
cpufreq: OPP: Avoid sleeping while atomic
cpufreq: cpu0: Do not print error message when deferring
cpufreq: integrator: Use set_cpus_allowed_ptr
PM / hibernate: avoid unsafe pages in e820 reserved regions
ACPI / processor: Make acpi_cpu_soft_notify() process CPU FROZEN events
cpuidle: menu: Lookup CPU runqueues less
cpuidle: menu: Call nr_iowait_cpu less times
cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
cpuidle: menu: Use shifts when calculating averages where possible

Linus Torvalds
2014-08-15 08:13:46 +0800
0680eb1f4 timekeeping: Another fix to the VSYSCALL_OLD update_vsyscall ... Browse Code »

Benjamin Herrenschmidt pointed out that I further missed modifying
update_vsyscall after the wall_to_mono value was changed to a
timespec64. This causes issues on powerpc32, which expects a 32bit
timespec.

This patch fixes the problem by properly converting from a timespec64 to
a timespec before passing the value on to the arch-specific vsyscall
logic.

[ Thomas is currently on vacation, but reviewed it and wanted me to send
this fix on to you directly. ]

Cc: LKML
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Benjamin Herrenschmidt
Reported-by: Benjamin Herrenschmidt
Reviewed-by: Thomas Gleixner
Signed-off-by: John Stultz
Signed-off-by: Linus Torvalds

John Stultz
2014-08-15 01:04:11 +0800
1d508f8ac Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

Pull more powerpc updates from Ben Herrenschmidt:
"Here are some more powerpc bits for 3.17, essentially fixes.

The biggest series, also aimed at -stable, is from Aneesh and is the
result of weeks and weeks of debugging to find out why the heck or THP
implementation was occasionally triggering multi-hit errors in our
level 1 TLB. It ended up being a combination of issues including
subtleties as to how we should invalidate those special 'MPSS' pages
we use to allow the use of 16M pages inside 4K/64K "base page size"
segments (you really have to love our MMU !)

Another interesting one in the "OMG" category is the series from
Michael adding memory barriers to spin_is_locked(). That's also the
result of many days of debugging to figure out why the semaphore code
would occasionally crash in ways that made no sense. It ended up
being some creative lock stacking that was defeated by the fact that
our locks allow a load inside the locked section to be re-ordered with
the load of the lock value itself (I'm still of two mind about whether
to kill that once and for all by putting a heavier barrier back into
our lock implementation...). The fixes come with a long explanation
in the cset comments, feel free to read it if you feel like having a
headache today"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
powerpc/thp: Add tracepoints to track hugepage invalidate
powerpc/mm: Use read barrier when creating real_pte
powerpc/thp: Use ACCESS_ONCE when loading pmdp
powerpc/thp: Invalidate with vpn in loop
powerpc/thp: Handle combo pages in invalidate
powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte
powerpc/thp: Don't recompute vsid and ssize in loop on invalidate
powerpc/thp: Add write barrier after updating the valid bit
powerpc: reorder per-cpu NUMA information's initialization
powerpc/perf/hv-24x7: Use kmem_cache_free
powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info
powerpc: Hard disable interrupts in xmon
powerpc: remove duplicate definition of TEXASR_FS
powerpc/pseries: Avoid deadlock on removing ddw
powerpc/pseries: Failure on removing device node
powerpc/boot: Use correct zlib types for comparison
powerpc/powernv: Interface to register/unregister opal dump region
printk: Add function to return log buffer address and size
powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS
powerpc/ppc476: Disable BTAC
...

Linus Torvalds
2014-08-15 00:14:07 +0800
311bf6d1c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull seccomp fix from James Morris.

BUG(!spin_is_locked()) really doesn't work very well in UP
configurations without any actual spinlock state. Which is very much
why we have that "assert_spin_lock()" function for this.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock

Linus Torvalds
2014-08-15 00:09:48 +0800

13 Aug, 2014

1 commit

14c4000a8 printk: Add function to return log buffer address and size ... Browse Code »

Platforms like IBM Power Systems supports service processor
assisted dump. It provides interface to add memory region to
be captured when system is crashed.

During initialization/running we can add kernel memory region
to be collected.

Presently we don't have a way to get the log buffer base address
and size. This patch adds support to return log buffer address
and size.

Signed-off-by: Vasant Hegde
Signed-off-by: Benjamin Herrenschmidt
Acked-by: Andrew Morton

Vasant Hegde
2014-08-13 13:13:44 +0800

12 Aug, 2014

3 commits

e67ee1019 Merge branches 'pm-sleep', 'pm-cpufreq' and 'pm-cpuidle' ... Browse Code »

* pm-sleep:
PM / hibernate: avoid unsafe pages in e820 reserved regions

* pm-cpufreq:
cpufreq: arm_big_little: fix module license spec
cpufreq: speedstep-smi: fix decimal printf specifiers
cpufreq: OPP: Avoid sleeping while atomic
cpufreq: cpu0: Do not print error message when deferring
cpufreq: integrator: Use set_cpus_allowed_ptr

* pm-cpuidle:
cpuidle: menu: Lookup CPU runqueues less
cpuidle: menu: Call nr_iowait_cpu less times
cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
cpuidle: menu: Use shifts when calculating averages where possible

Rafael J. Wysocki
2014-08-12 05:19:48 +0800
69f6a34bd seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock ... Browse Code »

Current upstream kernel hangs with mips and powerpc targets in
uniprocessor mode if SECCOMP is configured.

Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking").
Turns out that code such as
BUG_ON(!spin_is_locked(&list_lock));
can not be used in uniprocessor mode because spin_is_locked() always
returns false in this configuration, and that assert_spin_locked()
exists for that very purpose and must be used instead.

Fixes: dbd952127d11 ("seccomp: introduce writer locking")
Cc: Kees Cook
Signed-off-by: Guenter Roeck
Signed-off-by: Kees Cook

Guenter Roeck
2014-08-12 04:29:12 +0800
f6f993328 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"Stuff in here:

- acct.c fixes and general rework of mnt_pin mechanism. That allows
to go for delayed-mntput stuff, which will permit mntput() on deep
stack without worrying about stack overflows - fs shutdown will
happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
series without introducing tons of stack overflows on new mntput()
call chains it introduces.
- Bruce's d_splice_alias() patches
- more Miklos' rename() stuff.
- a couple of regression fixes (stable fodder, in the end of branch)
and a fix for API idiocy in iov_iter.c.

There definitely will be another pile, maybe even two. I'd like to
get Eric's series in this time, but even if we miss it, it'll go right
in the beginning of for-next in the next cycle - the tricky part of
prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
fix copy_tree() regression
__generic_file_write_iter(): fix handling of sync error after DIO
switch iov_iter_get_pages() to passing maximal number of pages
fs: mark __d_obtain_alias static
dcache: d_splice_alias should detect loops
exportfs: update Exporting documentation
dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
dcache: remove unused d_find_alias parameter
dcache: d_obtain_alias callers don't all want DISCONNECTED
dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
dcache: d_splice_alias mustn't create directory aliases
dcache: close d_move race in d_splice_alias
dcache: move d_splice_alias
namei: trivial fix to vfs_rename_dir comment
VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
cifs: support RENAME_NOREPLACE
hostfs: support rename flags
shmem: support RENAME_EXCHANGE
shmem: support RENAME_NOREPLACE
btrfs: add RENAME_NOREPLACE
...

Linus Torvalds
2014-08-12 02:44:11 +0800

11 Aug, 2014

1 commit

c8d6637d0 Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull module updates from Rusty Russell:
"This finally applies the stricter sysfs perms checking we pulled out
before last merge window. A few stragglers are fixed (thanks
linux-next!)"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
arch/powerpc/platforms/powernv/opal-dump.c: fix world-writable sysfs files
arch/powerpc/platforms/powernv/opal-elog.c: fix world-writable sysfs files
drivers/video/fbdev/s3c2410fb.c: don't make debug world-writable.
ARM: avoid ARM binutils leaking ELF local symbols
scripts: modpost: Remove numeric suffix pattern matching
scripts: modpost: fix compilation warning
sysfs: disallow world-writable files.
module: return bool from within_module*()
module: add within_module() function
modules: Fix build error in moduleloader.h

Linus Torvalds
2014-08-11 12:31:58 +0800

10 Aug, 2014

3 commits

fc335c1b6 Merge tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull trace file read iterator fixes from Steven Rostedt:
"This contains a fix for two long standing bugs. Both of which are
rarely ever hit, and requires the user to do something that users
rarely do. It took a few special test cases to even trigger this bug,
and one of them was just one test in the process of finishing up as
another one started.

Both bugs have to do with the ring buffer iterator rb_iter_peek(), but
one is more indirect than the other.

The fist bug fix is simply an increase in the safety net loop counter.
The counter makes sure that the rb_iter_peek() only iterates the
number of times we expect it can, and no more. Well, there was one
way it could iterate one more than we expected, and that caused the
ring buffer to shutdown with a nasty warning. The fix was simply to
up that counter by one.

The other bug has to be with rb_iter_reset() (called by
rb_iter_peek()). This happens when a user reads both the trace_pipe
and trace files. The trace_pipe is a consuming read and does not use
the ring buffer iterator, but the trace file is not a consuming read
and does use the ring buffer iterator. When the trace file is being
read, if it detects that a consuming read occurred, it resets the
iterator and starts over. But the reset code that does this
(rb_iter_reset()), checks if the reader_page is linked to the ring
buffer or not, and will look into the ring buffer itself if it is not.
This is wrong, as it should always try to read the reader page first.
Not to mention, the code that looked into the ring buffer did it
wrong, and used the header_page "read" offset to start reading on that
page. That offset is bogus for pages in the writable ring buffer, and
was corrupting the iterator, and it would start returning bogus
events"

* tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Always reset iterator to reader page
ring-buffer: Up rb_iter_peek() loop count to 3

Linus Torvalds
2014-08-10 08:29:36 +0800
77e40aae7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"This is a bunch of small changes built against 3.16-rc6. The most
significant change for users is the first patch which makes setns
drmatically faster by removing unneded rcu handling.

The next chunk of changes are so that "mount -o remount,.." will not
allow the user namespace root to drop flags on a mount set by the
system wide root. Aks this forces read-only mounts to stay read-only,
no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
mounts to stay no exec and it prevents unprivileged users from messing
with a mounts atime settings. I have included my test case as the
last patch in this series so people performing backports can verify
this change works correctly.

The next change fixes a bug in NFS that was discovered while auditing
nsproxy users for the first optimization. Today you can oops the
kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
with pid namespaces. I rebased and fixed the build of the
!CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
that no one to my knowledge bases anything on my tree fixing the typo
in place seems more responsible that requiring a typo-fix to be
backported as well.

The last change is a small semantic cleanup introducing
/proc/thread-self and pointing /proc/mounts and /proc/net at it. This
prevents several kinds of problemantic corner cases. It is a
user-visible change so it has a minute chance of causing regressions
so the change to /proc/mounts and /proc/net are individual one line
commits that can be trivially reverted. Unfortunately I lost and
could not find the email of the original reporter so he is not
credited. From at least one perspective this change to /proc/net is a
refgression fix to allow pthread /proc/net uses that were broken by
the introduction of the network namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
proc: Implement /proc/thread-self to point at the directory of the current thread
proc: Have net show up under /proc//task/
NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
namespaces: Use task_lock and not rcu to protect nsproxy

Linus Torvalds
2014-08-10 08:10:41 +0800
63b12bdb0 Merge branch 'signal-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc ... Browse Code »

Pull arch signal handling cleanup from Richard Weinberger:
"This patch series moves all remaining archs to the get_signal(),
signal_setup_done() and sigsp() functions.

Currently these archs use open coded variants of the said functions.
Further, unused parameters get removed from get_signal_to_deliver(),
tracehook_signal_handler() and signal_delivered().

At the end of the day we save around 500 lines of code."

* 'signal-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (43 commits)
powerpc: Use sigsp()
openrisc: Use sigsp()
mn10300: Use sigsp()
mips: Use sigsp()
microblaze: Use sigsp()
metag: Use sigsp()
m68k: Use sigsp()
m32r: Use sigsp()
hexagon: Use sigsp()
frv: Use sigsp()
cris: Use sigsp()
c6x: Use sigsp()
blackfin: Use sigsp()
avr32: Use sigsp()
arm64: Use sigsp()
arc: Use sigsp()
sas_ss_flags: Remove nested ternary if
Rip out get_signal_to_deliver()
Clean up signal_delivered()
tracehook_signal_handler: Remove sig, info, ka and regs
...

Linus Torvalds
2014-08-10 00:58:12 +0800

09 Aug, 2014

27 commits

8e7d83810 kexec: verify the signature of signed PE bzImage ... Browse Code »

This is the final piece of the puzzle of verifying kernel image signature
during kexec_file_load() syscall.

This patch calls into PE file routines to verify signature of bzImage. If
signature are valid, kexec_file_load() succeeds otherwise it fails.

Two new config options have been introduced. First one is
CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be
validly signed otherwise kernel load will fail. If this option is not
set, no signature verification will be done. Only exception will be when
secureboot is enabled. In that case signature verification should be
automatically enforced when secureboot is enabled. But that will happen
when secureboot patches are merged.

Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option
enables signature verification support on bzImage. If this option is not
set and previous one is set, kernel image loading will fail because kernel
does not have support to verify signature of bzImage.

I tested these patches with both "pesign" and "sbsign" signed bzImages.

I used signing_key.priv key and signing_key.x509 cert for signing as
generated during kernel build process (if module signing is enabled).

Used following method to sign bzImage.

pesign
======
- Convert DER format cert to PEM format cert
openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform
PEM

- Generate a .p12 file from existing cert and private key file
openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in
signing_key.x509.PEM

- Import .p12 file into pesign db
pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign

- Sign bzImage
pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign
-c "Glacier signing key - Magrathea" -s

sbsign
======
sbsign --key signing_key.priv --cert signing_key.x509.PEM --output
/boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+

Patch details:

Well all the hard work is done in previous patches. Now bzImage loader
has just call into that code and verify whether bzImage signature are
valid or not.

Also create two config options. First one is CONFIG_KEXEC_VERIFY_SIG.
This option enforces that kernel has to be validly signed otherwise kernel
load will fail. If this option is not set, no signature verification will
be done. Only exception will be when secureboot is enabled. In that case
signature verification should be automatically enforced when secureboot is
enabled. But that will happen when secureboot patches are merged.

Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option
enables signature verification support on bzImage. If this option is not
set and previous one is set, kernel image loading will fail because kernel
does not have support to verify signature of bzImage.

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Cc: Matt Fleming
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:33 +0800
dd5f72607 kexec: support for kexec on panic using new system call ... Browse Code »

This patch adds support for loading a kexec on panic (kdump) kernel usning
new system call.

It prepares ELF headers for memory areas to be dumped and for saved cpu
registers. Also prepares the memory map for second kernel and limits its
boot to reserved areas only.

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:33 +0800
27f48d3e6 kexec-bzImage64: support for loading bzImage using 64bit entry ... Browse Code »

This is loader specific code which can load bzImage and set it up for
64bit entry. This does not take care of 32bit entry or real mode entry.

32bit mode entry can be implemented if somebody needs it.

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:33 +0800
12db5562e kexec: load and relocate purgatory at kernel load time ... Browse Code »

Load purgatory code in RAM and relocate it based on the location.
Relocation code has been inspired by module relocation code and purgatory
relocation code in kexec-tools.

Also compute the checksums of loaded kexec segments and store them in
purgatory.

Arch independent code provides this functionality so that arch dependent
bootloaders can make use of it.

Helper functions are provided to get/set symbol values in purgatory which
are used by bootloaders later to set things like stack and entry point of
second kernel etc.

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
cb1052581 kexec: implementation of new syscall kexec_file_load ... Browse Code »

Previous patch provided the interface definition and this patch prvides
implementation of new syscall.

Previously segment list was prepared in user space. Now user space just
passes kernel fd, initrd fd and command line and kernel will create a
segment list internally.

This patch contains generic part of the code. Actual segment preparation
and loading is done by arch and image specific loader. Which comes in
next patch.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
f0895685c kexec: new syscall kexec_file_load() declaration ... Browse Code »

This is the new syscall kexec_file_load() declaration/interface. I have
reserved the syscall number only for x86_64 so far. Other architectures
(including i386) can reserve syscall number when they enable the support
for this new syscall.

Signed-off-by: Vivek Goyal
Cc: Michael Kerrisk
Cc: Borislav Petkov
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
8c86e70ac resource: provide new functions to walk through resources ... Browse Code »
13

I have added two more functions to walk through resources.

Currently walk_system_ram_range() deals with pfn and /proc/iomem can
contain partial pages. By dealing in pfn, callback function loses the
info that last page of a memory range is a partial page and not the full
page. So I implemented walk_system_ram_res() which returns u64 values to
callback functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next ram
resource. This in turn only travels through siblings of top level child
and does not travers through all the nodes of the resoruce tree. I also
need another function where I can walk through all the resources, for
example figure out where "GART" aperture is. Figure out where ACPI memory
is.

So I wrote another function walk_iomem_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller can
specify "name" of resource, start and end and flags.

Got rid of find_next_system_ram_res() and instead implemented more generic
find_next_iomem_res() which can be used to traverse top level children
only based on an argument.

Signed-off-by: Vivek Goyal
Cc: Yinghai Lu
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
255aedd90 kexec: use common function for kimage_normal_alloc() and kimage_crash_alloc() ... Browse Code »

kimage_normal_alloc() and kimage_crash_alloc() are doing lot of similar
things and differ only little. So instead of having two separate
functions create a common function kimage_alloc_init() and pass it the
"flags" argument which tells whether it is normal kexec or kexec_on_panic.
And this function should be able to deal with both the cases.

This consolidation also helps later where we can use a common function
kimage_file_alloc_init() to handle normal and crash cases for new file
based kexec syscall.

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
dabe78628 kexec: move segment verification code in a separate function ... Browse Code »

Previously do_kimage_alloc() will allocate a kimage structure, copy
segment list from user space and then do the segment list sanity
verification.

Break down this function in 3 parts. do_kimage_alloc_init() to do actual
allocation and basic initialization of kimage structure.
copy_user_segment_list() to copy segment list from user space and
sanity_check_segment_list() to verify the sanity of segment list as passed
by user space.

In later patches, I need to only allocate kimage and not copy segment list
from user space. So breaking down in smaller functions enables re-use of
code at other places.

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
7d3e2bca2 kexec: rename unusebale_pages to unusable_pages ... Browse Code »

Let's use the more common "unusable".

This patch was originally written and posted by Boris. I am including it
in this patch series.

Signed-off-by: Borislav Petkov
Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
8370edea8 bin2c: move bin2c in scripts/basic ... Browse Code »

This patch series does not do kernel signature verification yet. I plan
to post another patch series for that. Now distributions are already
signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
those signatures.

Primary goal of this patchset is to prepare groundwork so that kernel
image can be signed and signatures be verified during kexec load. This
should help with two things.

- It should allow kexec/kdump on secureboot enabled machines.

- In general it can help even without secureboot. By being able to verify
kernel image signature in kexec, it should help with avoiding module
signing restrictions. Matthew Garret showed how to boot into a custom
kernel, modify first kernel's memory and then jump back to old kernel and
bypass any policy one wants to.

This patch (of 15):

Kexec wants to use bin2c and it wants to use it really early in the build
process. See arch/x86/purgatory/ code in later patches.

So move bin2c in scripts/basic so that it can be built very early and
be usable by arch/x86/purgatory/

Signed-off-by: Vivek Goyal
Cc: Borislav Petkov
Cc: Michael Kerrisk
Cc: Yinghai Lu
Cc: Eric Biederman
Cc: H. Peter Anvin
Cc: Matthew Garrett
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: WANG Chao
Cc: Baoquan He
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vivek Goyal
2014-08-09 06:57:32 +0800
9183df25f shm: add memfd_create() syscall ... Browse Code »
39

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It can support sealing and avoids any
connection to user-visible mount-points. Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but with
a file-descriptor to it.

memfd_create() returns the raw shmem file, so calls like ftruncate() can
be used to modify the underlying inode. Also calls like fstat() will
return proper information and mark the file as regular file. If you want
sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
supported (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to a filesystem size limit. It is still properly accounted to
memcg limits, though, and to the same overcommit or no-overcommit
accounting as all user memory.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800
4bb5f5d93 mm: allow drivers to prevent new writable mappings ... Browse Code »

This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space. To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set. In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers. This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800
934fc295b kernel/acct.c: fix coding style warnings and errors ... Browse Code »

Signed-off-by: Ionut Alexa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ionut Alexa
2014-08-09 06:57:27 +0800
ab602f799 shm: make exit_shm work proportional to task activity ... Browse Code »

This is small set of patches our team has had kicking around for a few
versions internally that fixes tasks getting hung on shm_exit when there
are many threads hammering it at once.

Anton wrote a simple test to cause the issue:

http://ozlabs.org/~anton/junkcode/bust_shm_exit.c

Before applying this patchset, this test code will cause either hanging
tracebacks or pthread out of memory errors.

After this patchset, it will still produce output like:

root@somehost:~# ./bust_shm_exit 1024 160
...
INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
INFO: Stall ended before state dump start
...

But the task will continue to run along happily, so we consider this an
improvement over hanging, even if it's a bit noisy.

This patch (of 3):

exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
walks every shared memory segment in the namespace. Thus the amount of
work is related to the number of shm segments in the namespace not the
number of segments that might need to be cleaned.

In addition, this occurs after the task has been notified the thread has
exited, so the number of tasks waiting for the ns shm rwsem can grow
without bound until memory is exausted.

Add a list to the task struct of all shmids allocated by this task. Init
the list head in copy_process. Use the ns->rwsem for locking. Add
segments after id is added, remove before removing from id.

On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
to handling of semaphore undo.

I chose a define for the init sequence since its a simple list init,
otherwise it would require a function call to avoid include loops between
the semaphore code and the task struct. Converting the list_del to
list_del_init for the unshare cases would remove the exit followed by
init, but I left it blow up if not inited.

Signed-off-by: Milton Miller
Signed-off-by: Jack Miller
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Cc: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Miller
2014-08-09 06:57:26 +0800
69361eef9 panic: add TAINT_SOFTLOCKUP ... Browse Code »
13

This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the
system has been in a softlockup state when debugging.

[akpm@linux-foundation.org: apply the taint before calling panic()]
Signed-off-by: Josh Hunt
Cc: Jason Baron
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josh Hunt
2014-08-09 06:57:24 +0800
834b18b23 kernel/gcov/fs.c: remove unnecessary null test before debugfs_remove ... Browse Code »

This fixes checkpatch warning:

WARNING: debugfs_remove(NULL) is safe this check is probably not required

Signed-off-by: Fabian Frederick
Cc: Peter Oberparleiter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-09 06:57:24 +0800
33144e842 kernel/fork.c: make mm_init_owner static ... Browse Code »

It's only used in fork.c:mm_init().

Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-09 06:57:23 +0800
4f7d46143 fork: copy mm's vm usage counters under mmap_sem ... Browse Code »

If a forking process has a thread calling (un)mmap (silly but still),
the child process may have some of its mm's vm usage counters (total_vm
and friends) screwed up, because currently they are copied from oldmm
w/o holding any locks (memcpy in dup_mm).

This patch moves the counters initialization to dup_mmap() to be called
under oldmm->mmap_sem, which eliminates any possibility of race.

Signed-off-by: Vladimir Davydov
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-09 06:57:23 +0800
ce65cefa5 fork: reset mm->pinned_vm ... Browse Code »

mm->pinned_vm counts pages of mm's address space that were permanently
pinned in memory by increasing their reference counter. The counter was
introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
pinned pages"), while before it locked_vm had been used for such pages.

Obviously, we should reset the counter on fork if !CLONE_VM, just like
we do with locked_vm, but currently we don't. Let's fix it.

This patch will fix the contents of /proc/pid/status:VmPin.

ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
RLIMIT_MEMLOCK. It's left from the times when pinned pages were accounted
under locked_vm, but today it looks wrong. It isn't clear how we should
deal with it.

We still have some drivers accounting pinned pages under mm->locked_vm -
this is what commit bc3e53f682d9 was fighting against. It's
infiniband/usnic and vfio.

Signed-off-by: Vladimir Davydov
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Christoph Lameter
Cc: Roland Dreier
Cc: Sean Hefty
Cc: Hal Rosenstock
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-09 06:57:23 +0800
41f727fde fork/exec: cleanup mm initialization ... Browse Code »

mm initialization on fork/exec is spread all over the place, which makes
the code look inconsistent.

We have mm_init(), which is supposed to init/nullify mm's internals, but
it doesn't init all the fields it should:

- on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
are zeroed in dup_mmap();

- on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
calling mm_init();

- ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
called before mm_init() on both fork and exec;

- ->context is initialized by init_new_context(), which is called after
mm_init() on both fork and exec;

Let's consolidate all the initializations in mm_init() to make the code
look cleaner.

Signed-off-by: Vladimir Davydov
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-09 06:57:23 +0800
ccf94f1b4 proc: constify seq_operations ... Browse Code »

proc_uid_seq_operations, proc_gid_seq_operations and
proc_projid_seq_operations are only called in proc_id_map_open with
seq_open as const struct seq_operations so we can constify the 3
structures and update proc_id_map_open prototype.

text data bss dec hex filename
6817 404 1984 9205 23f5 kernel/user_namespace.o-before
6913 308 1984 9205 23f5 kernel/user_namespace.o-after

Signed-off-by: Fabian Frederick
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-09 06:57:22 +0800
a0be55dee kernel/exit.c: fix coding style warnings and errors ... Browse Code »

Fixed coding style warnings and errors.

Signed-off-by: Ionut Alexa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ionut Alexa
2014-08-09 06:57:22 +0800
4878b14b4 kernel/test_kprobes.c: use current logging functions ... Browse Code »

- Add pr_fmt
- Coalesce formats
- Use current pr_foo() functions instead of printk
- Remove unnecessary "failed" display (already in log level).

Signed-off-by: Fabian Frederick
Cc: Ananth N Mavinakayanahalli
Cc: Anil S Keshavamurthy
Cc: "David S. Miller"
Cc: Masami Hiramatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-09 06:57:18 +0800
b86280aa4 kernel/kallsyms.c: fix %pB when there's no symbol at the address ... Browse Code »

__sprint_symbol() should restore original address when kallsyms_lookup()
failed to find a symbol. It's reported when dumpstack shows an address in
a dynamically allocated trampoline for ftrace.

[ 1314.612287] [] dump_stack+0x45/0x56
[ 1314.612290] [] ? meminfo_proc_open+0x30/0x30
[ 1314.612293] [] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 1314.612306] [] 0xffffffffa00160c3

You can see a difference in the hex address - c4 and c3. Fix it.

Signed-off-by: Namhyung Kim
Reported-by: Masami Hiramatsu
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Josh Poimboeuf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2014-08-09 06:57:18 +0800
9a3f4d85d page-cgroup: get rid of NR_PCG_FLAGS ... Browse Code »

It's not used anywhere today, so let's remove it.

Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-09 06:57:18 +0800
747db954c mm: memcontrol: use page lists for uncharge batching ... Browse Code »

Pages are now uncharged at release time, and all sources of batched
uncharges operate on lists of pages. Directly use those lists, and
get rid of the per-task batching state.

This also batches statistics accounting, in addition to the res
counter charges, to reduce IRQ-disabling and re-enabling.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Naoya Horiguchi
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:18 +0800