Eric Lee / smarc-fsl-linux-kernel

20 Jan, 2017

1 commit

00cf64fba sysctl: Drop reference added by grab_header in proc_sys_readdir ... Browse Code »

commit 93362fa47fe98b62e4a34ab408c4a418432e7939 upstream.

Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.

The calltrace of CVE-2016-9191:

[ 5535.960522] Call Trace:
[ 5535.963265] [] schedule+0x3f/0xa0
[ 5535.968817] [] schedule_timeout+0x3db/0x6f0
[ 5535.975346] [] ? wait_for_completion+0x45/0x130
[ 5535.982256] [] wait_for_completion+0xc3/0x130
[ 5535.988972] [] ? wake_up_q+0x80/0x80
[ 5535.994804] [] drop_sysctl_table+0xc4/0xe0
[ 5536.001227] [] drop_sysctl_table+0x77/0xe0
[ 5536.007648] [] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654] [] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657] [] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344] [] partition_sched_domains+0x44/0x450
[ 5536.036447] [] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844] [] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336] [] update_flag+0x11d/0x210
[ 5536.057373] [] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186] [] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899] [] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420] [] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234] [] ? css_killed_work_fn+0x25/0x220
[ 5536.091049] [] cpuset_css_offline+0x35/0x60
[ 5536.097571] [] css_killed_work_fn+0x5c/0x220
[ 5536.104207] [] process_one_work+0x1df/0x710
[ 5536.110736] [] ? process_one_work+0x160/0x710
[ 5536.117461] [] worker_thread+0x12b/0x4a0
[ 5536.123697] [] ? process_one_work+0x710/0x710
[ 5536.130426] [] kthread+0xfe/0x120
[ 5536.135991] [] ret_from_fork+0x1f/0x40
[ 5536.142041] [] ? kthread_create_on_node+0x230/0x230

One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex. The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.

See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

Fixes: f0c3b5093add ("[readdir] convert procfs")
Reported-by: CAI Qian
Tested-by: Yang Shukui
Signed-off-by: Zhou Chengming
Acked-by: Al Viro
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Zhou Chengming
2017-01-20 03:18:04 +0800

28 Oct, 2016

1 commit

06b2849d1 proc: fix NULL dereference when reading /proc/<pid>/auxv ... Browse Code »

Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL. Fix that by checking for NULL mm
and bailing out early. This is also the original behavior changed by
recent commit c5317167854e ("proc: switch auxv to use of __mem_open()").

# cat /proc/2/auxv
Unable to handle kernel NULL pointer dereference at virtual address 000000a8
Internal error: Oops: 17 [#1] PREEMPT SMP ARM
CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
Hardware name: BCM2709
task: ea3b0b00 task.stack: e99b2000
PC is at auxv_read+0x24/0x4c
LR is at do_readv_writev+0x2fc/0x37c
Process cat (pid: 113, stack limit = 0xe99b2210)
Call chain:
auxv_read
do_readv_writev
vfs_readv
default_file_splice_read
splice_direct_to_actor
do_splice_direct
do_sendfile
SyS_sendfile64
ret_fast_syscall

Fixes: c5317167854e ("proc: switch auxv to use of __mem_open()")
Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Leon Yu
Acked-by: Oleg Nesterov
Acked-by: Michal Hocko
Cc: Al Viro
Cc: Kees Cook
Cc: John Stultz
Cc: Mateusz Guzik
Cc: Janis Danisevskis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Leon Yu
2016-10-28 09:43:43 +0800

25 Oct, 2016

1 commit

272ddc8b3 proc: don't use FOLL_FORCE for reading cmdline and environment ... Browse Code »

Now that Lorenzo cleaned things up and made the FOLL_FORCE users
explicit, it becomes obvious how some of them don't really need
FOLL_FORCE at all.

So remove FOLL_FORCE from the proc code that reads the command line and
arguments from user space.

The mem_rw() function actually does want FOLL_FORCE, because gdd (and
possibly many other debuggers) use it as a much more convenient version
of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
conditional on actually being a ptracer. This does not actually do
that, just moves adds a comment to that effect and moves the gup_flags
settings next to each other.

Signed-off-by: Linus Torvalds

Linus Torvalds
2016-10-25 10:00:44 +0800

23 Oct, 2016

1 commit

86c5bf710 Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull vmap stack fixes from Ingo Molnar:
"This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
accesses that used to be just somewhat questionable are now totally
buggy.

These changes try to do it without breaking the ABI: the fields are
left there, they are just reporting zero, or reporting narrower
information (the maps file change)"

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
fs/proc: Stop trying to report thread stacks
fs/proc: Stop reporting eip and esp in /proc/PID/stat
mm/numa: Remove duplicated include from mprotect.c

Linus Torvalds
2016-10-23 00:39:10 +0800

20 Oct, 2016

2 commits

b18cb64ea fs/proc: Stop trying to report thread stacks ... Browse Code »

This reverts more of:

b76437579d13 ("procfs: mark thread stack correctly in proc//maps")

... which was partially reverted by:

65376df58217 ("proc: revert /proc//maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA. This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range. This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn
Signed-off-by: Andy Lutomirski
Acked-by: Thomas Gleixner
Cc: Al Viro
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Linux API
Cc: Peter Zijlstra
Cc: Tycho Andersen
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar

Andy Lutomirski
2016-10-20 15:21:41 +0800
0a1eb2d47 fs/proc: Stop reporting eip and esp in /proc/PID/stat ... Browse Code »

Reporting these fields on a non-current task is dangerous. If the
task is in any state other than normal kernel code, they may contain
garbage or even kernel addresses on some architectures. (x86_64
used to do this. I bet lots of architectures still do.) With
CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too.

As far as I know, there are no use programs that make any material
use of these fields, so just get rid of them.

Reported-by: Jann Horn
Signed-off-by: Andy Lutomirski
Acked-by: Thomas Gleixner
Cc: Al Viro
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Linux API
Cc: Peter Zijlstra
Cc: Tetsuo Handa
Cc: Tycho Andersen
Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar

Andy Lutomirski
2016-10-20 15:21:41 +0800

19 Oct, 2016

1 commit

6347e8d5b mm: replace access_remote_vm() write parameter with gup_flags ... Browse Code »

This removes the 'write' argument from access_remote_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 23:12:14 +0800

11 Oct, 2016

3 commits

101105b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning

Linus Torvalds
2016-10-11 11:16:43 +0800
3873691e5 Merge remote-tracking branch 'ovl/rename2' into for-linus Browse Code »

Al Viro
2016-10-11 11:02:51 +0800
abb5a14fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.

There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...

Linus Torvalds
2016-10-11 04:04:49 +0800

08 Oct, 2016

10 commits

e55f1d1d1 Merge remote-tracking branch 'jk/vfs' into work.misc Browse Code »

Al Viro
2016-10-08 23:06:08 +0800
81243eacf cred: simpler, 1D supplementary groups ... Browse Code »

Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is
Cc: Vasily Kulikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-10-08 09:46:30 +0800
855af072b mm, proc: fix region lost in /proc/self/smaps ... Browse Code »

Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:

https://bugzilla.redhat.com/show_bug.cgi?id=1365721

Actually, this bug is not only for NVDIMM/DAX but also for any other
file systems. This simple test case abstracted from nvml can easily
reproduce this bug in common environment:

-------------------------- testcase.c -----------------------------

int
is_pmem_proc(const void *addr, size_t len)
{
const char *caddr = addr;

FILE *fp;
if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
printf("!/proc/self/smaps");
return 0;
}

int retval = 0; /* assume false until proven otherwise */
char line[PROCMAXLEN]; /* for fgets() */
char *lo = NULL; /* beginning of current range in smaps file */
char *hi = NULL; /* end of current range in smaps file */
int needmm = 0; /* looking for mm flag for current range */
while (fgets(line, PROCMAXLEN, fp) != NULL) {
static const char vmflags[] = "VmFlags:";
static const char mm[] = " wr";

/* check for range line */
if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
if (needmm) {
/* last range matched, but no mm flag found */
printf("never found mm flag.\n");
break;
} else if (caddr < lo) {
/* never found the range for caddr */
printf("#######no match for addr %p.\n", caddr);
break;
} else if (caddr < hi) {
/* start address is in this range */
size_t rangelen = (size_t)(hi - caddr);

/* remember that matching has started */
needmm = 1;

/* calculate remaining range to search for */
if (len > rangelen) {
len -= rangelen;
caddr += rangelen;
printf("matched %zu bytes in range "
"%p-%p, %zu left over.\n",
rangelen, lo, hi, len);
} else {
len = 0;
printf("matched all bytes in range "
"%p-%p.\n", lo, hi);
}
}
} else if (needmm && strncmp(line, vmflags,
sizeof(vmflags) - 1) == 0) {
if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
printf("mm flag found.\n");
if (len == 0) {
/* entire range matched */
retval = 1;
break;
}
needmm = 0; /* saw what was needed */
} else {
/* mm flag not set for some or all of range */
printf("range has no mm flag.\n");
break;
}
}
}

fclose(fp);

printf("returning %d.\n", retval);
return retval;
}

void *Addr;
size_t Size;

/*
* worker -- the work each thread performs
*/
static void *
worker(void *arg)
{
int *ret = (int *)arg;
*ret = is_pmem_proc(Addr, Size);
return NULL;
}

int main(int argc, char *argv[])
{
if (argc < 2 || argc > 3) {
printf("usage: %s file [env].\n", argv[0]);
return -1;
}

int fd = open(argv[1], O_RDWR);

struct stat stbuf;
fstat(fd, &stbuf);

Size = stbuf.st_size;
Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

close(fd);

pthread_t threads[NTHREAD];
int ret[NTHREAD];

/* kick off NTHREAD threads */
for (int i = 0; i < NTHREAD; i++)
pthread_create(&threads[i], NULL, worker, &ret[i]);

/* wait for all the threads to complete */
for (int i = 0; i < NTHREAD; i++)
pthread_join(threads[i], NULL);

/* verify that all the threads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}

printf("%d", ret[0]);
return 0;
}

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}

However, VMA will be lost if the last VMA is gone, e.g:

The process VMA list is A->B->C->D

CPU 0 CPU 1
read() system call
handle VMA B
version = B
return to userspace

unmap VMA B

issue read() again to continue to get
the region info
find_vma(version) will get VMA C
m_next_vma(C) will get VMA D
handle D
!!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the end address
of the current VMA. m_start will then look up a vma which with vma_start
< last_vm_end and moves on to the next vma if we found the same or an
overlapping vma. This will guarantee that we will not miss an exclusive
vma but we can still miss one if the previous vma was shrunk. This is
acceptable because guaranteeing "never miss a vma" is simply not feasible.
User has to cope with some inconsistencies if the file is not read in one
go.

[mhocko@suse.com: changelog fixes]
Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
Acked-by: Dave Hansen
Signed-off-by: Xiao Guangrong
Signed-off-by: Robert Hu
Acked-by: Michal Hocko
Acked-by: Oleg Nesterov
Cc: Paolo Bonzini
Cc: Dan Williams
Cc: Gleb Natapov
Cc: Marcelo Tosatti
Cc: Stefan Hajnoczi
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert Ho
2016-10-08 09:46:30 +0800
4b2bd5fec proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self ... Browse Code »

In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
== current, but the CAP_SYS_NICE doesn't.

Thus while the previous commit was intended to loosen the needed
privileges to modify a processes timerslack, it needlessly restricted a
task modifying its own timerslack via the proc//timerslack_ns
(which is permitted also via the PR_SET_TIMERSLACK method).

This patch corrects this by checking if p == current before checking the
CAP_SYS_NICE value.

This patch applies on top of my two previous patches currently in -mm

Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz
Acked-by: Kees Cook
Cc: "Serge E. Hallyn"
Cc: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Todd Kjos
Cc: Colin Cross
Cc: Nick Kralevich
Cc: Dmitry Shmidt
Cc: Elliott Hughes
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Stultz
2016-10-08 09:46:30 +0800
904763e1f proc: add LSM hook checks to /proc/<tid>/timerslack_ns ... Browse Code »

As requested, this patch checks the existing LSM hooks
task_getscheduler/task_setscheduler when reading or modifying the task's
timerslack value.

Previous versions added new get/settimerslack LSM hooks, but since they
checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
suggested we just use the existing ones.

Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz
Cc: Kees Cook
Cc: "Serge E. Hallyn"
Cc: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Todd Kjos
Cc: Colin Cross
Cc: Nick Kralevich
Cc: Dmitry Shmidt
Cc: Elliott Hughes
Cc: James Morris
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Stultz
2016-10-08 09:46:30 +0800
7abbaf940 proc: relax /proc/<tid>/timerslack_ns capability requirements ... Browse Code »

When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.

So CAP_SYS_PTRACE was adopted instead for what became the
/proc//timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.

After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.

So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc//timerslack_ns seems sufficient.

Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.

Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz
Reviewed-by: Nick Kralevich
Acked-by: Serge Hallyn
Acked-by: Kees Cook
Cc: Kees Cook
Cc: "Serge E. Hallyn"
Cc: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Todd Kjos
Cc: Colin Cross
Cc: Nick Kralevich
Cc: Dmitry Shmidt
Cc: Elliott Hughes
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Stultz
2016-10-08 09:46:30 +0800
e16e2d8e1 meminfo: break apart a very long seq_printf with #ifdefs ... Browse Code »

Use a specific routine to emit most lines so that the code is easier to
read and maintain.

akpm:
text data bss dec hex filename
2976 8 0 2984 ba8 fs/proc/meminfo.o before
2669 8 0 2677 a75 fs/proc/meminfo.o after

Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
Signed-off-by: Joe Perches
Cc: Andi Kleen
Cc: Alexey Dobriyan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-10-08 09:46:30 +0800
75ba1d07f seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char ... Browse Code »

Allow some seq_puts removals by taking a string instead of a single
char.

[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-10-08 09:46:30 +0800
f7a5f132b proc: faster /proc/*/status ... Browse Code »

top(1) opens the following files for every PID:

/proc/*/stat
/proc/*/statm
/proc/*/status

This patch switches /proc/*/status away from seq_printf().
The result is 13.5% speedup.

Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-self-status

Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
12 context-switches # 0.001 K/sec ( +- 1.09% )
1 cpu-migrations # 0.000 K/sec
104 page-faults # 0.010 K/sec ( +- 0.45% )
37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
65,632,764,147 instructions # 1.75 insn per cycle
# 0.13 stalled cycles per insn ( +- 0.00% )
13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

11.263885428 seconds time elapsed ( +- 0.04% )
^^^^^^^^^^^^

AFTER
$ perf stat -r 10 taskset -c 3 ./proc-self-status

Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
11 context-switches # 0.001 K/sec ( +- 1.54% )
1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
103 page-faults # 0.011 K/sec ( +- 0.60% )
32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
56,012,163,567 instructions # 1.73 insn per cycle
# 0.14 stalled cycles per insn ( +- 0.00% )
11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

9.741247736 seconds time elapsed ( +- 0.07% )
^^^^^^^^^^^

Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
Signed-off-by: Alexey Dobriyan
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-10-08 09:46:30 +0800
0f30206bf fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious ... Browse Code »

Trying to walk all of virtual memory requires architecture specific
knowledge. On x86_64, addresses must be sign extended from bit 48,
whereas on arm64 the top VA_BITS of address space have their own set of
page tables.

clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
provides a test_walk() callback that only expects to be walking over
VMAs. Currently walk_pmd_range() will skip memory regions that don't
have a VMA, reporting them as a hole.

As this call only expects to walk user address space, make it walk 0 to
'highest_vm_end'.

Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

James Morse
2016-10-08 09:46:28 +0800

07 Oct, 2016

1 commit

14986a34e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.

The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.

To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.

Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.

There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.

These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...

Linus Torvalds
2016-10-07 00:52:23 +0800

06 Oct, 2016

2 commits

c53171678 proc: switch auxv to use of __mem_open() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2016-10-06 06:43:43 +0800
687ee0ad4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
co. at Google. https://lwn.net/Articles/701165/

2) Do TCP Small Queues for retransmits, from Eric Dumazet.

3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
Starovoitov.

4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

7) Support ndo_poll_controller in mlx5, from Calvin Owens.

8) Move VRF processing to an output hook and allow l3mdev to be
loopback, from David Ahern.

9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

10) Congestion control in RXRPC, from David Howells.

11) Support geneve RX offload in ixgbe, from Emil Tantilov.

12) When hitting pressure for new incoming TCP data SKBs, perform a
partial rathern than a full purge of the OFO queue (which could be
huge). From Eric Dumazet.

13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

14) Support RX network flow classification to igb, from Gangfeng Huang.

15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

16) New skbmod packet action, from Jamal Hadi Salim.

17) Remove some inefficiencies in snmp proc output, from Jia He.

18) Add FIB notifications to properly propagate route changes to
hardware which is doing forwarding offloading. From Jiri Pirko.

19) New dsa driver for qca8xxx chips, from John Crispin.

20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
Żenczykowski.

21) Add L3 mode to ipvlan, from Mahesh Bandewar.

22) Support 802.1ad in mlx4, from Moshe Shemesh.

23) Support hardware LRO in mediatek driver, from Nelson Chang.

24) Add TC offloading to mlx5, from Or Gerlitz.

25) Convert various drivers to ethtool ksettings interfaces, from
Philippe Reynes.

26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

27) NAPI support for ath10k, from Rajkumar Manoharan.

28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

29) UDP replicast support in TIPC, from Richard Alpe.

30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

31) Support BQL in thunderx driver, from Sunil Goutham.

32) TSO support in alx driver, from Tobias Regnery.

33) Add stream parser engine and use it in kcm.

34) Support async DHCP replies in ipconfig module, from Uwe
Kleine-König.

35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
mlxsw: switchx2: Fix misuse of hard_header_len
mlxsw: spectrum: Fix misuse of hard_header_len
net/faraday: Stop NCSI device on shutdown
net/ncsi: Introduce ncsi_stop_dev()
net/ncsi: Rework the channel monitoring
net/ncsi: Allow to extend NCSI request properties
net/ncsi: Rework request index allocation
net/ncsi: Don't probe on the reserved channel ID (0x1f)
net/ncsi: Introduce NCSI_RESERVED_CHANNEL
net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
net: Add netdev all_adj_list refcnt propagation to fix panic
net: phy: Add Edge-rate driver for Microsemi PHYs.
vmxnet3: Wake queue from reset work
i40e: avoid NULL pointer dereference and recursive errors on early PCI error
qed: Add RoCE ll2 & GSI support
qed: Add support for memory registeration verbs
qed: Add support for QP verbs
qed: PD,PKEY and CQ verb support
qed: Add support for RoCE hw init
qede: Add qedr framework
...

Linus Torvalds
2016-10-06 01:11:24 +0800

30 Sep, 2016

1 commit

d7e25c66c Merge branch 'x86/urgent' into x86/asm ... Browse Code »

Get the cr4 fixes so we can apply the final cleanup

Thomas Gleixner
2016-09-30 18:38:28 +0800

28 Sep, 2016

3 commits

078cd8279 fs: Replace CURRENT_TIME with current_time() for inode timestamps ... Browse Code »

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani
Reviewed-by: Arnd Bergmann
Acked-by: Felipe Balbi
Acked-by: Steven Whitehouse
Acked-by: Ryusuke Konishi
Acked-by: David Sterba
Signed-off-by: Al Viro

Deepa Dinamani
2016-09-28 09:06:21 +0800
2554c72ed fs: proc: Delete inode time initializations in proc_alloc_inode() ... Browse Code »

proc uses new_inode_pseudo() to allocate a new inode.
This in turn calls the proc_inode_alloc() callback.
But, at this point, inode is still not initialized
with the super_block pointer which only happens just
before alloc_inode() returns after the call to
inode_init_always().

Also, the inode times are initialized again after the
call to new_inode_pseudo() in proc_inode_alloc().
The assignemet in proc_alloc_inode() is redundant and
also doesn't work after the current_time() api is
changed to take struct inode* instead of
struct *super_block.

This bug was reported after current_time() was used to
assign times in proc_alloc_inode().

Signed-off-by: Deepa Dinamani
Reported-by: Fengguang Wu [0-day test robot]
Reviewed-by: Arnd Bergmann
Signed-off-by: Al Viro

Deepa Dinamani
2016-09-28 09:06:20 +0800
771187d61 proc: unsigned file descriptors ... Browse Code »

Make struct proc_inode::fd unsigned.

This allows better code generation on x86_64 (less sign extensions).

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2016-09-28 06:47:38 +0800

23 Sep, 2016

1 commit

d6989d4bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2016-09-23 18:46:57 +0800

22 Sep, 2016

1 commit

31051c85b fs: Give dentry to inode_change_ok() instead of inode ... Browse Code »

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara

Jan Kara
2016-09-22 16:56:19 +0800

21 Sep, 2016

2 commits

df04abfd1 fs/proc/kcore.c: Add bounce buffer for ktext data ... Browse Code »

We hit hardened usercopy feature check for kernel text access by reading
kcore file:

usercopy: kernel memory exposure attempt detected from ffffffff8179a01f () (4065 bytes)
kernel BUG at mm/usercopy.c:75!

Bypassing this check for kcore by adding bounce buffer for ktext data.

Reported-by: Steve Best
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Suggested-by: Kees Cook
Signed-off-by: Jiri Olsa
Acked-by: Kees Cook
Signed-off-by: Linus Torvalds

Jiri Olsa
2016-09-21 04:32:49 +0800
f5beeb185 fs/proc/kcore.c: Make bounce buffer global for read ... Browse Code »

Next patch adds bounce buffer for ktext area, so it's
convenient to have single bounce buffer for both
vmalloc/module and ktext cases.

Suggested-by: Linus Torvalds
Signed-off-by: Jiri Olsa
Acked-by: Kees Cook
Signed-off-by: Linus Torvalds

Jiri Olsa
2016-09-21 04:32:49 +0800

15 Sep, 2016

1 commit

d4b80afbb Merge branch 'linus' into x86/asm, to pick up recent fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-15 14:24:53 +0800

13 Sep, 2016

1 commit

b20b378d4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller

David S. Miller
2016-09-13 06:52:44 +0800

11 Sep, 2016

1 commit

98ac9a608 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"nvdimm fixes for v4.8, two of them are tagged for -stable:

- Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
DAX pmd mappings end up with an uncached pgprot, and unusable
performance for the device-dax interface. The device-dax interface
appeared in 4.7 so this is tagged for -stable.

- Fix a couple VM_BUG_ON() checks in the show_smaps() path to
understand DAX pmd entries. This fix is tagged for -stable.

- Fix a mis-merge of the nfit machine-check handler to flip the
polarity of an if() to match the final version of the patch that
Vishal sent for 4.8-rc1. Without this the nfit machine check
handler never detects / inserts new 'badblocks' entries which
applications use to identify lost portions of files.

- For test purposes, fix the nvdimm_clear_poison() path to operate on
legacy / simulated nvdimm memory ranges. Without this fix a test
can set badblocks, but never clear them on these ranges.

- Fix the range checking done by dax_dev_pmd_fault(). This is not
tagged for -stable since this problem is mitigated by specifying
aligned resources at device-dax setup time.

These patches have appeared in a next release over the past week. The
recent rebase you can see in the timestamps was to drop an invalid fix
as identified by the updated device-dax unit tests [1]. The -mm
touches have an ack from Andrew"

[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: allow legacy (e820) pmem region to clear bad blocks
nfit, mce: Fix SPA matching logic in MCE handler
mm: fix cache mode of dax pmd mappings
mm: fix show_smap() for zone_device-pmd ranges
dax: fix mapping size check

Linus Torvalds
2016-09-11 00:58:52 +0800

10 Sep, 2016

1 commit

ca120cf68 mm: fix show_smap() for zone_device-pmd ranges ... Browse Code »

Attempting to dump /proc//smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:

kernel BUG at mm/huge_memory.c:1105!
task: ffff88045f16b140 task.stack: ffff88045be14000
RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
[..]
Call Trace:
[] smaps_pte_range+0xa0/0x4b0
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

kernel BUG at fs/proc/task_mmu.c:585!
RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
Call Trace:
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.

Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Acked-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2016-09-10 08:34:45 +0800

02 Sep, 2016

1 commit

511a8cdb6 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit fixes from Paul Moore:
"Two small patches to fix some bugs with the audit-by-executable
functionality we introduced back in v4.3 (both patches are marked
for the stable folks)"

* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
audit: fix exe_file access in audit_exe_compare
mm: introduce get_task_exe_file

Linus Torvalds
2016-09-02 06:55:56 +0800

01 Sep, 2016

1 commit

cd81a9170 mm: introduce get_task_exe_file ... Browse Code »

For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik
Acked-by: Konstantin Khlebnikov
Acked-by: Richard Guy Briggs
Cc: # 4.3.x
Signed-off-by: Paul Moore

Mateusz Guzik
2016-09-01 04:11:20 +0800

19 Aug, 2016

1 commit

8b927d734 proc: Fix return address printk conversion specifer in /proc/<pid>/stack ... Browse Code »

When printing call return addresses found on a stack, /proc//stack
can sometimes give a confusing result. If the call instruction was the
last instruction in the function (which can happen when calling a
noreturn function), '%pS' will incorrectly display the name of the
function which happens to be next in the object code, rather than the
name of the actual calling function.

Use '%pB' instead, which was created for this exact purpose.

Signed-off-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Byungchul Park
Cc: Denys Vlasenko
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Nilay Vaish
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/47ad2821e5ebdbed1fbf83fb85424ae4fbdf8b6e.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2016-08-19 00:41:32 +0800

18 Aug, 2016

1 commit

60747ef4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Minor overlapping changes for both merge conflicts.

Resolution work done by Stephen Rothwell was used
as a reference.

Signed-off-by: David S. Miller

David S. Miller
2016-08-18 13:17:32 +0800

15 Aug, 2016

1 commit

e79c6a4fc net: make net namespace sysctls belong to container's owner ... Browse Code »

If net namespace is attached to a user namespace let's make container's
root owner of sysctls affecting said network namespace instead of global
root.

This also allows us to clean up net_ctl_permissions() because we do not
need to fudge permissions anymore for the container's owner since it now
owns the objects in question.

Acked-by: "Eric W. Biederman"
Signed-off-by: Dmitry Torokhov
Signed-off-by: David S. Miller

Dmitry Torokhov
2016-08-15 12:08:58 +0800