Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

13 Apr, 2014

2 commits

5166701b3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.

Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.

This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...

Linus Torvalds
2014-04-13 05:49:50 +0800
0b747172d Merge git://git.infradead.org/users/eparis/audit ... Browse Code »

Pull audit updates from Eric Paris.

* git://git.infradead.org/users/eparis/audit: (28 commits)
AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
audit: do not cast audit_rule_data pointers pointlesly
AUDIT: Allow login in non-init namespaces
audit: define audit_is_compat in kernel internal header
kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
sched: declare pid_alive as inline
audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
syscall_get_arch: remove useless function arguments
audit: remove stray newline from audit_log_execve_info() audit_panic() call
audit: remove stray newlines from audit_log_lost messages
audit: include subject in login records
audit: remove superfluous new- prefix in AUDIT_LOGIN messages
audit: allow user processes to log from another PID namespace
audit: anchor all pid references in the initial pid namespace
audit: convert PPIDs to the inital PID namespace.
pid: get pid_t ppid of task in init_pid_ns
audit: rename the misleading audit_get_context() to audit_take_context()
audit: Add generic compat syscall support
audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
...

Linus Torvalds
2014-04-13 03:38:53 +0800

08 Apr, 2014

10 commits

16caed319 fault-injection: set bounds on what /proc/self/make-it-fail accepts. ... Browse Code »

/proc/self/make-it-fail is a boolean, but accepts any number, including
negative ones. Change variable to unsigned, and cap upper bound at 1.

[akpm@linux-foundation.org: don't make make_it_fail unsigned]
Signed-off-by: Dave Jones
Reviewed-by: Akinobu Mita
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2014-04-08 07:36:10 +0800
c4082f36f vmcore: continue vmcore initialization if PT_NOTE is found empty ... Browse Code »

Currently when an empty PT_NOTE is detected, vmcore initialization
fails. It sounds too harsh. Because PT_NOTE could be empty, for
example, one offlined a cpu but never restarted kdump service, and after
crash, PT_NOTE program header is there but no data contains. It's
better to warn about the empty PT_NOTE and continue to initialise
vmcore.

And ultimately the multiple PT_NOTE are merged into a single one, all
empty PT_NOTE are discarded naturally during the merge. So empty
PT_NOTE is not visible to user space and vmcore is as good as expected.

Signed-off-by: WANG Chao
Cc: Vivek Goyal
Cc: HATAYAMA Daisuke
Cc: Greg Pearson
Cc: Baoquan He
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Chao
2014-04-08 07:36:06 +0800
82e0703b6 include/linux/crash_dump.h: add vmcore_cleanup() prototype ... Browse Code »

Eliminate the following warning in proc/vmcore.c:

fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]

[akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-08 07:36:06 +0800
ad86622b4 wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space ... Browse Code »
14

get_task_state() uses the most significant bit to report the state to
user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
can be noticed via /proc as Z -> X -> Z change. Note that this was
possible even before EXIT_TRACE was introduced.

This is not really bad but imho it make sense to hide EXIT_TRACE from
user-space completely. So the patch simply swaps EXIT_ZOMBIE and
EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.

Signed-off-by: Oleg Nesterov
Cc: Jan Kratochvil
Cc: Michal Schmidt
Cc: Al Viro
Cc: Lennart Poettering
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:06 +0800
32ed74a4b procfs: make /proc/*/pagemap 0400 ... Browse Code »

The /proc/*/pagemap contain sensitive information and currently its mode
is 0444. Change this to 0400, so the VFS will prevent unprivileged
processes from getting file descriptors on arbitrary privileged
/proc/*/pagemap files.

This reduces the scope of address space leaking and bypasses by protecting
already running processes.

Signed-off-by: Djalal Harouni
Acked-by: Kees Cook
Acked-by: Andy Lutomirski
Cc: Eric W. Biederman
Cc: Al Viro
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Djalal Harouni
2014-04-08 07:36:05 +0800
35a35046e procfs: make /proc/*/{stack,syscall,personality} 0400 ... Browse Code »

These procfs files contain sensitive information and currently their
mode is 0444. Change this to 0400, so the VFS will be able to block
unprivileged processes from getting file descriptors on arbitrary
privileged /proc/*/{stack,syscall,personality} files.

This reduces the scope of ASLR leaking and bypasses by protecting already
running processes.

Signed-off-by: Djalal Harouni
Acked-by: Kees Cook
Acked-by: Andy Lutomirski
Cc: Eric W. Biederman
Cc: Al Viro
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Djalal Harouni
2014-04-08 07:36:04 +0800
1c44dbc82 fs/proc/inode.c: use RCU_INIT_POINTER(x, NULL) ... Browse Code »

Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure. And in the
case of the NULL pointer, there is no structure to initialize. So,
rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Monam Agarwal
2014-04-08 07:36:04 +0800
49d063cb3 proc: show mnt_id in /proc/pid/fdinfo ... Browse Code »

Currently we don't have a way how to determing from which mount point
file has been opened. This information is required for proper dumping
and restoring file descriptos due to presence of mount namespaces. It's
possible, that two file descriptors are opened using the same paths, but
one fd references mount point from one namespace while the other fd --
from other namespace.

$ ls -l /proc/1/fd/1
lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null

$ cat /proc/1/fdinfo/1
pos: 0
flags: 0100002
mnt_id: 16

$ cat /proc/1/mountinfo | grep ^16
16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755

Signed-off-by: Andrey Vagin
Acked-by: Pavel Emelyanov
Acked-by: Cyrill Gorcunov
Cc: Rob Landley
Cc: Al Viro
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Vagin
2014-04-08 07:36:04 +0800
f0b5664ba fs/proc/meminfo: meminfo_proc_show(): fix typo in comment ... Browse Code »

It should read "reclaimable slab" and not "reclaimable swap".

Signed-off-by: Luiz Capitulino
Reviewed-by: Rik van Riel
Acked-by: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luiz Capitulino
2014-04-08 07:36:04 +0800
615d6e875 mm: per-thread vma caching ... Browse Code »
7

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-08 07:35:53 +0800

05 Apr, 2014

1 commit

24e7ea3be Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.

Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...

Linus Torvalds
2014-04-05 06:39:39 +0800

04 Apr, 2014

1 commit

91b0abe36 mm + fs: store shadow entries in page cache ... Browse Code »

Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800

03 Apr, 2014

1 commit

b9f2b21a3 Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux ... Browse Code »

Pull devicetree changes from Grant Likely:
"Updates to devicetree core code. This branch contains the following
notable changes:

- add reserved memory binding
- make struct device_node a kobject and remove legacy
/proc/device-tree
- ePAPR conformance fixes
- update in-kernel DTC copy to version v1.4.0
- preparatory changes for dynamic device tree overlays
- minor bug fixes and documentation changes

The most significant change in this branch is the conversion of struct
device_node to be a kobject that is exposed via sysfs and removal of
the old /proc/device-tree code. This simplifies the device tree
handling code and tightens up the lifecycle on device tree nodes.

[updated: added fix for dangling select PROC_DEVICETREE]"

* tag 'dt-for-linus' of git://git.secretlab.ca/git/linux: (29 commits)
dt: Remove dangling "select PROC_DEVICETREE"
of: Add support for ePAPR "stdout-path" property
of: device_node kobject lifecycle fixes
of: only scan for reserved mem when fdt present
powerpc: add support for reserved memory defined by device tree
arm64: add support for reserved memory defined by device tree
of: add missing major vendors
of: add vendor prefix for SMSC
of: remove /proc/device-tree
of/selftest: Add self tests for manipulation of properties
of: Make device nodes kobjects so they show up in sysfs
arm: add support for reserved memory defined by device tree
drivers: of: add support for custom reserved memory drivers
drivers: of: add initialization code for dynamic reserved memory
drivers: of: add initialization code for static reserved memory
of: document bindings for reserved-memory nodes
Revert "of: fix of_update_property()"
kbuild: dtbs_install: new make target
ARM: mvebu: Allows to get the SoC ID even without PCI enabled
of: Allows to use the PCI translator without the PCI core
...

Linus Torvalds
2014-04-03 05:27:15 +0800

02 Apr, 2014

2 commits

5d826c847 new helper: readlink_copy() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:15 +0800
a21e40877 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Ingo Molnar:
"The main purpose is to fix a full dynticks bug related to
virtualization, where steal time accounting appears to be zero in
/proc/stat even after a few seconds of competing guests running busy
loops in a same host CPU. It's not a regression though as it was
there since the beginning.

The other commits are preparatory work to fix the bug and various
cleanups"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch: Remove stub cputime.h headers
sched: Remove needless round trip nsecs tick conversion of steal time
cputime: Fix jiffies based cputime assumption on steal accounting
cputime: Bring cputime -> nsecs conversion
cputime: Default implementation of nsecs -> cputime conversion
cputime: Fix nsecs_to_cputime() return type cast

Linus Torvalds
2014-04-02 01:16:10 +0800

31 Mar, 2014

1 commit

d88cf7d7b Merge remote-tracking branch 'robh/for-next' into devicetree/next Browse Code »

Grant Likely
2014-03-31 15:10:55 +0800

20 Mar, 2014

1 commit

21a6457a7 proc: Update get proc_pid_cmdline() to use mm.h helpers ... Browse Code »

Re-factor proc_pid_cmdline() to use get_cmdline() helper
from mm.h.

Acked-by: David Rientjes
Acked-by: Stephen Smalley
Acked-by: Richard Guy Briggs

Signed-off-by: William Roberts
Acked-by: Richard Guy Briggs
Signed-off-by: Eric Paris

William Roberts
2014-03-20 22:10:26 +0800

13 Mar, 2014

2 commits

bfc3f0281 cputime: Default implementation of nsecs -> cputime conversion ... Browse Code »

The architectures that override cputime_t (s390, ppc) don't provide
any version of nsecs_to_cputime(). Indeed this cputime_t implementation
by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
which the core code doesn't make any use of nsecs_to_cputime().

At least for now.

We are going to make a broader use of it so lets provide a default
version with a per usecs granularity. It should be good enough for most
usecases.

Cc: Ingo Molnar
Cc: Marcelo Tosatti
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Acked-by: Rik van Riel
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-03-13 22:56:43 +0800
02b9984d6 fs: push sync_filesystem() down to the file system's remount_fs() ... Browse Code »
13

Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs(). This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior. In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o"
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig
Cc: Artem Bityutskiy
Cc: Adrian Hunter
Cc: Evgeniy Dushistov
Cc: Jan Kara
Cc: OGAWA Hirofumi
Cc: Anders Larsen
Cc: Phillip Lougher
Cc: Kees Cook
Cc: Mikulas Patocka
Cc: Petr Vandrovec
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org

Theodore Ts'o
2014-03-13 22:14:33 +0800

12 Mar, 2014

1 commit

8357041a6 of: remove /proc/device-tree ... Browse Code »
13

The same data is now available in sysfs, so we can remove the code
that exports it in /proc and replace it with a symlink to the sysfs
version.

Tested on versatile qemu model and mpc5200 eval board. More testing
would be appreciated.

v5: Fixed up conflicts with mainline changes

Signed-off-by: Grant Likely
Cc: Rob Herring
Cc: Benjamin Herrenschmidt
Cc: David S. Miller
Cc: Nathan Fontenot
Cc: Pantelis Antoniou

Grant Likely
2014-03-12 04:48:32 +0800

11 Mar, 2014

1 commit

70335abb2 fs/proc/base.c: fix GPF in /proc/$PID/map_files ... Browse Code »

The expected logic of proc_map_files_get_link() is either to return 0
and initialize 'path' or return an error and leave 'path' uninitialized.

By the time dname_to_vma_addr() returns 0 the corresponding vma may have
already be gone. In this case the path is not initialized but the
return value is still 0. This results in 'general protection fault'
inside d_path().

Steps to reproduce:

CONFIG_CHECKPOINT_RESTORE=y

fd = open(...);
while (1) {
mmap(fd, ...);
munmap(fd, ...);
}

ls -la /proc/$PID/map_files

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991

Signed-off-by: Artem Fetishev
Signed-off-by: Aleksandr Terekhov
Reported-by:
Acked-by: Pavel Emelyanov
Acked-by: Cyrill Gorcunov
Reviewed-by: "Eric W. Biederman"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Artem Fetishev
2014-03-11 08:26:20 +0800

04 Mar, 2014

1 commit

668f9abbd mm: close PageTail race ... Browse Code »

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes
Cc: Holger Kiehl
Cc: Christoph Lameter
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-03-04 23:55:47 +0800

11 Feb, 2014

1 commit

38dfac843 vmcore: prevent PT_NOTE p_memsz overflow during header update ... Browse Code »
2

Currently, update_note_header_size_elf64() and
update_note_header_size_elf32() will add the size of a PT_NOTE entry to
real_sz even if that causes real_sz to exceeds max_sz. This patch
corrects the while loop logic in those routines to ensure that does not
happen and prints a warning if a PT_NOTE entry is dropped. If zero
PT_NOTE entries are found or this condition is encountered because the
only entry was dropped, a warning is printed and an error is returned.

One possible negative side effect of exceeding the max_sz limit is an
allocation failure in merge_note_headers_elf64() or
merge_note_headers_elf32() which would produce console output such as
the following while booting the crash kernel.

vmalloc: allocation failure: 14076997632 bytes
swapper/0: page allocation failure: order:0, mode:0x80d2
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7
Call Trace:
dump_stack+0x19/0x1b
warn_alloc_failed+0xf0/0x160
__vmalloc_node_range+0x19e/0x250
vmalloc_user+0x4c/0x70
merge_note_headers_elf64.constprop.9+0x116/0x24a
vmcore_init+0x2d4/0x76c
do_one_initcall+0xe2/0x190
kernel_init_freeable+0x17c/0x207
kernel_init+0xe/0x180
ret_from_fork+0x7c/0xb0

Kdump: vmcore not initialized

kdump: dump target is /dev/sda4
kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/
kdump: saving vmcore-dmesg.txt
Cannot open /proc/vmcore: No such file or directory
kdump: saving vmcore-dmesg.txt failed
kdump: saving vmcore
kdump: saving vmcore failed

This type of failure has been seen on a four socket prototype system
with certain memory configurations. Most PT_NOTE sections have a single
entry similar to:

n_namesz = 0x5
n_descsz = 0x150
n_type = 0x1

Occasionally, a second entry is encountered with very large n_namesz and
n_descsz sizes:

n_namesz = 0x80000008
n_descsz = 0x510ae163
n_type = 0x80000008

Not yet sure of the source of these extra entries, they seem bogus, but
they shouldn't cause crash dump to fail.

Signed-off-by: Greg Pearson
Acked-by: Vivek Goyal
Cc: HATAYAMA Daisuke
Cc: Michael Holzheu
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Pearson
2014-02-11 08:01:40 +0800

24 Jan, 2014

10 commits

185ee40ee fs/proc/array.c: change do_task_stat() to use while_each_thread() ... Browse Code »

Change the remaining next_thread (ab)users to use while_each_thread().

The last user which should be changed is next_tid(), but we can't do this
now.

__exit_signal() and complete_signal() are fine, they actually need
next_thread() logic.

This patch (of 3):

do_task_stat() can use while_each_thread(), no changes in
the compiled code.

Signed-off-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Al Viro
Cc: Kees Cook
Reviewed-by: Sameer Nanda
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:02 +0800
abaf3787a fs/proc: don't use module_init for non-modular core code ... Browse Code »

PROC_FS is a bool, so this code is either present or absent. It will
never be modular, so using module_init as an alias for __initcall is
rather misleading.

Fix this up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h to
obviously non-modular code, and that would be ugly at best.

Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change these registrations from level 6-device to level 5-fs
(i.e. slightly earlier). However no observable impact of that small
difference has been observed during testing, or is expected.

Also note that this change uncovers a missing semicolon bug in the
registration of vmcore_init as an initcall.

Signed-off-by: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2014-01-24 08:37:02 +0800
c1d867a54 fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware exists. ... Browse Code »

Distribution kernels might want to build in support for /proc/device-tree
for kernels that might end up running on hardware that doesn't support
openfirmware. This results in an empty /proc/device-tree existing.
Remove it if the OFW root node doesn't exist.

This situation actually confuses grub2, resulting in install failures.
grub2 sees the /proc/device-tree and picks the wrong install target cf.
http://bzr.savannah.gnu.org/lh/grub/trunk/grub/annotate/4300/util/grub-install.in#L311
grub should be more robust, but still, leaving an empty proc dir seems
pointless.

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=818378.

Signed-off-by: Dave Jones
Cc: Al Viro
Cc: Paul Mackerras
Cc: Josh Boyer
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2014-01-24 08:37:01 +0800
cdf7e8dde proc: set attributes of pde using accessor functions ... Browse Code »

Use existing accessors proc_set_user() and proc_set_size() to set
attributes. Just a cleanup.

Signed-off-by: Rui Xiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rui Xiang
2014-01-24 08:37:01 +0800
9f6e963f0 proc: fix ->f_pos overflows in first_tid() ... Browse Code »

1. proc_task_readdir()->first_tid() path truncates f_pos to int, this
is wrong even on 64bit.

We could check that f_pos < PID_MAX or even INT_MAX in
proc_task_readdir(), but this patch simply checks the potential
overflow in first_tid(), this check is nop on 64bit. We do not care if
it was negative and the new unsigned value is huge, all we need to
ensure is that we never wrongly return !NULL.

2. Remove the 2nd "nr != 0" check before get_nr_threads(),
nr_threads == 0 is not distinguishable from !pid_task() above.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Michal Hocko
Cc: Sameer Nanda
Cc: Sergey Dyasly
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:01 +0800
d855a4b79 proc: don't (ab)use ->group_leader in proc_task_readdir() paths ... Browse Code »

proc_task_readdir() does not really need "leader", first_tid() has to
revalidate it anyway. Just pass proc_pid(inode) to first_tid() instead,
it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if
necessary.

The patch also extracts the "inode is dead" code from
pid_delete_dentry(dentry) into the new trivial helper,
proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT
if this dir was removed.

This is a bit racy, but the race is very inlikely and the getdents() after
openndir() can see the empty "." + ".." dir only once.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Michal Hocko
Cc: Sameer Nanda
Cc: Sergey Dyasly
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:01 +0800
c986c14a6 proc: change first_tid() to use while_each_thread() rather than next_thread() ... Browse Code »

Rerwrite the main loop to use while_each_thread() instead of
next_thread(). We are going to fix or replace while_each_thread(),
next_thread() should be avoided whenever possible.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Michal Hocko
Cc: Sameer Nanda
Cc: Sergey Dyasly
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:01 +0800
940fe4793 proc: fix the potential use-after-free in first_tid() ... Browse Code »

proc_task_readdir() verifies that the result of get_proc_task() is
pid_alive() and thus its ->group_leader is fine too. However this is not
necessarily true after rcu_read_unlock(), we need to recheck this again
after first_tid() does rcu_read_lock(). Otherwise
leader->thread_group.next (used by next_thread()) can be invalid if the
rcu grace period expires in between.

The race is subtle and unlikely, but still it is possible afaics. To
simplify lets ignore the "likely" case when tid != 0, f_version can be
cleared by proc_task_operations->llseek().

Suppose we have a main thread M and its subthread T. Suppose that f_pos
== 3, iow first_tid() should return T. Now suppose that the following
happens between rcu_read_unlock() and rcu_read_lock():

1. T execs and becomes the new leader. This removes M from
->thread_group but next_thread(M) is still T.

2. T creates another thread X which does exec as well, T
goes away.

3. X creates another subthread, this increments nr_threads.

4. first_tid() does next_thread(M) and returns the already
dead T.

Note also that we need 2. and 3. only because of get_nr_threads() check,
and this check was supposed to be optimization only.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Michal Hocko
Cc: Sameer Nanda
Cc: Sergey Dyasly
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:01 +0800
74e37200d proc: cleanup/simplify get_task_state/task_state_array ... Browse Code »

get_task_state() and task_state_array[] look confusing and suboptimal, it
is not clear what it can actually report to user-space and
task_state_array[] blows .data for no reason.

1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
clear. TASK_REPORT is self-documenting but it is not clear
what ->exit_state can add.

Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
into TASK_REPORT and use it to calculate the final result.

2. With the change above it is obvious that task_state_array[]
has the unused entries just to make BUILD_BUG_ON() happy.

Change this BUILD_BUG_ON() to use TASK_REPORT rather than
TASK_STATE_MAX and shrink task_state_array[].

3. Turn the "while (state)" loop into fls(state).

Signed-off-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: David Laight
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:01 +0800
e3bba3c3c fs/proc/page.c: add PageAnon check to surely detect thp ... Browse Code »

stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
know that a specified page is thp or not. But sometimes it's not enough
and we fail to detect thp when the thp is on pagevec. This happens only
for a few seconds after LRU list operations, but it makes it difficult
to control our applications depending on this flag.

So this patch adds another check PageAnon to detect thps on pagevec. It
might not give the future extensibility for thp pagecache, but it's OK
at least for now.

Signed-off-by: Naoya Horiguchi
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-01-24 08:36:50 +0800

22 Jan, 2014

1 commit

34e431b0a /proc/meminfo: provide estimated available memory ... Browse Code »

Many load balancing and workload placing programs check /proc/meminfo to
estimate how much free memory is available. They generally do this by
adding up "free" and "cached", which was fine ten years ago, but is
pretty much guaranteed to be wrong today.

It is wrong because Cached includes memory that is not freeable as page
cache, for example shared memory segments, tmpfs, and ramfs, and it does
not include reclaimable slab memory, which can take up a large fraction
of system memory on mostly idle systems with lots of files.

Currently, the amount of memory that is available for a new workload,
without pushing the system into swap, can be estimated from MemFree,
Active(file), Inactive(file), and SReclaimable, as well as the "low"
watermarks from /proc/zoneinfo.

However, this may change in the future, and user space really should not
be expected to know kernel internals to come up with an estimate for the
amount of free memory.

It is more convenient to provide such an estimate in /proc/meminfo. If
things change in the future, we only have to change it in one place.

Signed-off-by: Rik van Riel
Reported-by: Erik Mouw
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-01-22 08:19:43 +0800

13 Dec, 2013

1 commit

ae5758a1a procfs: also fix proc_reg_get_unmapped_area() for !MMU case ... Browse Code »

Commit fad1a86e25e0 ("procfs: call default get_unmapped_area on
MMU-present architectures"), as its title says, took care of only the
MMU case, leaving the !MMU side still in the regressed state (returning
-EIO in all cases where pde->proc_fops->get_unmapped_area is NULL).

From the fad1a86e25e0 changelog:

"Commit c4fe24485729 ("sparc: fix PCI device proc file mmap(2)") added
proc_reg_get_unmapped_area in proc_reg_file_ops and
proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
get_unmapped_area method is not defined for the target procfs file, which
causes regression of mmap on /proc/vmcore.

To address this issue, like get_unmapped_area(), call default
current->mm->get_unmapped_area on MMU-present architectures if
pde->proc_fops->get_unmapped_area, i.e. the one in actual file operation
in the procfs file, is not defined"

Signed-off-by: Jan Beulich
Cc: HATAYAMA Daisuke
Cc: Alexey Dobriyan
Cc: David S. Miller
Cc: [3.12.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Beulich
2013-12-13 10:19:26 +0800

22 Nov, 2013

1 commit

3eaded86a Merge git://git.infradead.org/users/eparis/audit ... Browse Code »

Pull audit updates from Eric Paris:
"Nothing amazing. Formatting, small bug fixes, couple of fixes where
we didn't get records due to some old VFS changes, and a change to how
we collect execve info..."

Fixed conflict in fs/exec.c as per Eric and linux-next.

* git://git.infradead.org/users/eparis/audit: (28 commits)
audit: fix type of sessionid in audit_set_loginuid()
audit: call audit_bprm() only once to add AUDIT_EXECVE information
audit: move audit_aux_data_execve contents into audit_context union
audit: remove unused envc member of audit_aux_data_execve
audit: Kill the unused struct audit_aux_data_capset
audit: do not reject all AUDIT_INODE filter types
audit: suppress stock memalloc failure warnings since already managed
audit: log the audit_names record type
audit: add child record before the create to handle case where create fails
audit: use given values in tty_audit enable api
audit: use nlmsg_len() to get message payload length
audit: use memset instead of trying to initialize field by field
audit: fix info leak in AUDIT_GET requests
audit: update AUDIT_INODE filter rule to comparator function
audit: audit feature to set loginuid immutable
audit: audit feature to only allow unsetting the loginuid
audit: allow unsetting the loginuid (with priv)
audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
audit: loginuid functions coding style
selinux: apply selinux checks on new audit message types
...

Linus Torvalds
2013-11-22 11:18:14 +0800

16 Nov, 2013

1 commit

b26d4cd38 consolidate simple ->d_delete() instances ... Browse Code »

Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates

Signed-off-by: Al Viro

Al Viro
2013-11-16 11:04:17 +0800

15 Nov, 2013

1 commit

652586df9 seq_file: remove "%n" usage from seq_file users ... Browse Code »

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa
Signed-off-by: Kees Cook
Cc: Joe Perches
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2013-11-15 08:32:20 +0800