12 Oct, 2016
40 commits
-
Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a
physical address to a virtual one using __va(). However, such physical
addresses may sometimes be located in highmem and using __va() is
incorrect, leading to inconsistent object tracking in kmemleak.The following functions have been added to the kmemleak API and they take
a physical address as the object pointer. They only perform the
corresponding action if the address has a lowmem mapping:kmemleak_alloc_phys
kmemleak_free_part_phys
kmemleak_not_leak_phys
kmemleak_ignore_physThe affected calling places have been updated to use the new kmemleak
API.Link: http://lkml.kernel.org/r/1471531432-16503-1-git-send-email-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas
Reported-by: Vignesh R
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
KASLR memory randomization can randomize the base of the physical memory
mapping (PAGE_OFFSET), vmalloc (VMALLOC_START) and vmemmap
(VMEMMAP_START). Adding these variables on VMCOREINFO so tools can easily
identify the base of each memory section.Link: http://lkml.kernel.org/r/1471531632-23003-1-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier
Acked-by: Baoquan He
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H . Peter Anvin"
Cc: Eric Biederman
Cc: Xunlei Pang
Cc: HATAYAMA Daisuke
Cc: Kees Cook
Cc: Eugene Surovegin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
exit_sem. Apparently it's possible for the loop to take quite a long time
and it doesn't have a scheduling point in it. Since the codes is
executing under an rcu read section this may also cause rcu stalls, which
in turn block synchronize_rcu operations, which more or less de-stabilises
the whole system.Fix this by introducing a cond_resched() at the beginning of the loop.
So this patch fixes the following:
NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
RIP: 0010:[] [] _raw_spin_lock+0x17/0x30
RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
Call Trace:
? exit_sem+0x7c/0x280
do_exit+0x338/0xb40
do_group_exit+0x43/0xd0
SyS_exit_group+0x14/0x20
entry_SYSCALL_64_fastpath+0x16/0x6eLink: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
Signed-off-by: Nikolay Borisov
Cc: Herton R. Krzesinski
Cc: Fabian Frederick
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Blocked tasks queued in q_senders waiting for their message to fit in the
queue are blindly awoken every time we think there's a remote chance this
might happen. This could cause numerous (and expensive -- thundering
herd-ish) bogus wakeups if the queue is still really full. Adding to the
scheduling cost/overhead, there's also the fact that we need to take the
ipc object lock and requeue ourselves in the q_senders list.By keeping track of the blocked sender's message size, we can know
previously if the wakeup ought to occur or not. Otherwise, to maintain
the current wakeup order we just move it to the tail. This is exactly
what occurs right now if the sender needs to go back to sleep.The case of EIDRM is left completely untouched, as we need to wakeup all
the tasks, and shouldn't be playing games in the first place.This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context
switches (avg out of ten runs). Although these tests are really about
functionality (as opposed to performance), is does show the direct
benefits of the optimization.[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Cc: Manfred Spraul
Cc: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
... 'tis annoying.
Link: http://lkml.kernel.org/r/1469748819-19484-4-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Cc: Manfred Spraul
Cc: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently the use of wake_qs in sysv msg queues are only for the receiver
tasks that are blocked on the queue. But blocked sender tasks (due to
queue size constraints) still are awoken with the ipc object lock held,
which can be a problem particularly for small sized queues and far from
gracious for -rt (just like it was for the receiver side).The paths that actually wakeup a sender are obviously related to when we
are either getting rid of the queue or after (some) space is freed-up
after a receiver takes the msg (msgrcv). Furthermore, with the exception
of msgrcv, we can always piggy-back on expunge_all that has its own tasks
lined-up for waking. Finally, upon unlinking the message, it should be no
problem delaying the wakeups a bit until after we've released the lock.Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Cc: Manfred Spraul
Cc: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch moves the wakeup_process() invocation so it is not done under
the ipc global lock by making use of a lockless wake_q. With this change,
the waiter is woken up once the message has been assigned and it does not
need to loop on SMP if the message points to NULL. In the signal case we
still need to check the pointer under the lock to verify the state.This change should also avoid the introduction of preempt_disable() in -RT
which avoids a busy-loop which pools for the NULL -> !NULL change if the
waiter has a higher priority compared to the waker.By making use of wake_qs, the logic of sysv msg queues is greatly
simplified (and very well suited as we can batch lockless wakeups),
particularly around the lockless receive algorithm.This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800
Radeon R7, 12 Compute Cores 4C+8G":test | before | after | diff
-----------------|------------|------------|----------
pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 %
pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 %
pmsg-shared 2 60 | 22,884,224 | 24,278,200 | + ~6.09 %Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.net
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.The patch also updates stale documentation regarding the locking.
With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)The alternative is to revert the patch that introduced the race.
The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait().Background:
Here is the race of the current implementation:Thread A: (simple op)
- does the first "sma->complex_count == 0" testThread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
drops sem_lock, sleepsThread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count testThread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_countThread A:
- does the complex_count testBug:
Now both thread A and thread C operate on the same array, without
any synchronization.Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
Reported-by:
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc:
Cc: [3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There's no point in collecting coverage from lib/stackdepot.c, as it is
not a function of syscall inputs. Disabling kcov instrumentation for that
file will reduce the coverage noise level.Link: http://lkml.kernel.org/r/1474640972-104131-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko
Acked-by: Dmitry Vyukov
Cc: Kostya Serebryany
Cc: Andrey Konovalov
Cc: syzkaller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As of Android N, SECCOMP is required. Without it, we will get
mediaextractor error:E /system/bin/mediaextractor: libminijail: prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER): Invalid argument
Link: http://lkml.kernel.org/r/20160908185934.18098-3-robh@kernel.org
Signed-off-by: Rob Herring
Acked-by: John Stultz
Cc: Amit Pundir
Cc: Dmitry Shmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Android won't boot without SELinux enabled, so make it the default.
Link: http://lkml.kernel.org/r/20160908185934.18098-2-robh@kernel.org
Signed-off-by: Rob Herring
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CONFIG_MD is in recommended, but other dependent options like DM_CRYPT and
DM_VERITY options are in base. The result is the options in base don't
get enabled when applying both base and recommended fragments. Move all
the options to recommended.Link: http://lkml.kernel.org/r/20160908185934.18098-1-robh@kernel.org
Signed-off-by: Rob Herring
Acked-by: John Stultz
Cc: Amit Pundir
Cc: Dmitry Shmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Option is long gone, see commit 5d9efa7ee99e ("ipv6: Remove privacy
config option.")Link: http://lkml.kernel.org/r/20160811170340.9859-1-bp@alien8.de
Signed-off-by: Borislav Petkov
Cc: Rob Herring
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Relay avoids calling wake_up_interruptible() for doing the wakeup of
readers/consumers, waiting for the generation of new data, from the
context of a process which produced the data. This is apparently done to
prevent the possibility of a deadlock in case Scheduler itself is is
generating data for the relay, after acquiring rq->lock.The following patch used a timer (to be scheduled at next jiffy), for
delegating the wakeup to another context.
commit 7c9cb38302e78d24e37f7d8a2ea7eed4ae5f2fa7
Author: Tom Zanussi
Date: Wed May 9 02:34:01 2007 -0700relay: use plain timer instead of delayed work
relay doesn't need to use schedule_delayed_work() for waking readers
when a simple timer will do.Scheduling a plain timer, at next jiffies boundary, to do the wakeup
causes a significant wakeup latency for the Userspace client, which makes
relay less suitable for the high-frequency low-payload use cases where the
data gets generated at a very high rate, like multiple sub buffers getting
filled within a milli second. Moreover the timer is re-scheduled on every
newly produced sub buffer so the timer keeps getting pushed out if sub
buffers are filled in a very quick succession (less than a jiffy gap
between filling of 2 sub buffers). As a result relay runs out of sub
buffers to store the new data.By using irq_work it is ensured that wakeup of userspace client, blocked
in the poll call, is done at earliest (through self IPI or next timer
tick) enabling it to always consume the data in time. Also this makes
relay consistent with printk & ring buffers (trace), as they too use
irq_work for deferred wake up of readers.[arnd@arndb.de: select CONFIG_IRQ_WORK]
Link: http://lkml.kernel.org/r/20160912154035.3222156-1-arnd@arndb.de
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1472906487-1559-1-git-send-email-akash.goel@intel.com
Signed-off-by: Peter Zijlstra
Signed-off-by: Akash Goel
Cc: Tom Zanussi
Cc: Chris Wilson
Cc: Tvrtko Ursulin
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CONFIG_NO_HZ currently only sets the default value of dynticks config so
if PPS kernel consumer needs periodic timer ticks it should depend on
!CONFIG_NO_HZ_COMMON instead of !CONFIG_NO_HZ.Otherwise it is possible to enable it even on tickless system which has
CONFIG_NO_HZ not set and CONFIG_NO_HZ_IDLE (or CONFIG_NO_HZ_FULL) set.Link: http://lkml.kernel.org/r/57E2B769.50202@maciej.szmigiero.name
Signed-off-by: Maciej S. Szmigiero
Acked-by: Rodolfo Giometti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online. As the result, kdump
routines fail to save other CPUs' registers. Additionally for MIPS
OCTEON, it misses to stop the watchdog timer.To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled. crash_smp_send_stop() is a
weak function, and it just call smp_send_stop(). Architecture
codes should override it so that kdump can work appropriately.
This patch provides MIPS version.Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080950.11028.28000.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai
Reported-by: Daniel Walker
Cc: Dave Young
Cc: Baoquan He
Cc: Vivek Goyal
Cc: Eric Biederman
Cc: Masami Hiramatsu
Cc: Daniel Walker
Cc: Xunlei Pang
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Borislav Petkov
Cc: David Vrabel
Cc: Toshi Kani
Cc: Ralf Baechle
Cc: David Daney
Cc: Aaro Koskinen
Cc: "Steven J. Hill"
Cc: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online. As the result, for x86, kdump
routines fail to save other CPUs' registers and disable virtualization
extensions.To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled. crash_smp_send_stop() is a weak
function, and it just call smp_send_stop(). Architecture codes should
override it so that kdump can work appropriately. This patch only
provides x86-specific version.For Xen's PV kernel, just keep the current behavior.
NOTES:
- Right solution would be to place crash_smp_send_stop() before
__crash_kexec() invocation in all cases and remove smp_send_stop(), but
we can't do that until all architectures implement own
crash_smp_send_stop()- crash_smp_send_stop()-like work is still needed by
machine_crash_shutdown() because crash_kexec() can be called without
entering panic()Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080948.11028.15344.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai
Reported-by: Daniel Walker
Cc: Dave Young
Cc: Baoquan He
Cc: Vivek Goyal
Cc: Eric Biederman
Cc: Masami Hiramatsu
Cc: Daniel Walker
Cc: Xunlei Pang
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Borislav Petkov
Cc: David Vrabel
Cc: Toshi Kani
Cc: Ralf Baechle
Cc: David Daney
Cc: Aaro Koskinen
Cc: "Steven J. Hill"
Cc: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).Link: http://lkml.kernel.org/r/1470092390-25451-4-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira
Reviewed-by: Gabriel Krisman Bertazi
Cc: Keith Busch
Cc: Jens Axboe
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Krzysztof Kozlowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add support for the DMA_ATTR_NO_WARN attribute on powerpc iommu code.
Link: http://lkml.kernel.org/r/1470092390-25451-3-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira
Acked-by: Michael Ellerman
Cc: Keith Busch
Cc: Jens Axboe
Cc: Benjamin Herrenschmidt
Cc: Krzysztof Kozlowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce the DMA_ATTR_NO_WARN attribute, and document it.
Link: http://lkml.kernel.org/r/1470092390-25451-2-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira
Cc: Keith Busch
Cc: Jens Axboe
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Krzysztof Kozlowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All call sites for randomize_range have been updated to use the much
simpler and more robust randomize_addr(). Remove the now unnecessary
code.Link: http://lkml.kernel.org/r/20160803233913.32511-8-jason@lakedaemon.net
Signed-off-by: Jason Cooper
Acked-by: Kees Cook
Cc: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.Use the new randomize_addr(start, range) call to set the requested
address.Link: http://lkml.kernel.org/r/20160803233913.32511-7-jason@lakedaemon.net
Signed-off-by: Jason Cooper
Acked-by: Kees Cook
Cc: "Theodore Ts'o"
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.Use the new randomize_addr(start, range) call to set the requested
address.Link: http://lkml.kernel.org/r/20160803233913.32511-6-jason@lakedaemon.net
Signed-off-by: Jason Cooper
Acked-by: Kees Cook
Cc: "Theodore Ts'o"
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.Use the new randomize_addr(start, range) call to set the requested
address.Link: http://lkml.kernel.org/r/20160803233913.32511-5-jason@lakedaemon.net
Signed-off-by: Jason Cooper
Acked-by: Will Deacon
Acked-by: Kees Cook
Cc: "Russell King - ARM Linux"
Cc: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.Use the new randomize_addr(start, range) call to set the requested
address.Link: http://lkml.kernel.org/r/20160803233913.32511-4-jason@lakedaemon.net
Signed-off-by: Jason Cooper
Acked-by: Kees Cook
Cc: "Russell King - ARM Linux"
Cc: "Theodore Ts'o"
Cc: Catalin Marinas
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.Use the new randomize_addr(start, range) call to set the requested
address.Link: http://lkml.kernel.org/r/20160803233913.32511-3-jason@lakedaemon.net
Signed-off-by: Jason Cooper
Acked-by: Kees Cook
Cc: "Theodore Ts'o"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H . Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To date, all callers of randomize_range() have set the length to 0, and
check for a zero return value. For the current callers, the only way to
get zero returned is if end
Cc: Nick Kralevich
Cc: Jeffrey Vander Stoep
Cc: Daniel Cashman
Cc: Chris Metcalf
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix coccinelle warning about duplicating existing memdup_user function.
Link: http://lkml.kernel.org/r/20160811151737.20140-1-alexandre.bounine@idt.com
Link: https://lkml.org/lkml/2016/8/11/29
Signed-off-by: Alexandre Bounine
Reported-by: kbuild test robot
Cc: Matt Porter
Cc: Andre van Herk
Cc: Barry Wood
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On __ptrace_detach(), called from do_exit()->exit_notify()->
forget_original_parent()->exit_ptrace(), the TIF_SYSCALL_TRACE in
thread->flags of the tracee is not cleared up. This results in the
tracehook_report_syscall_* being called (though there's no longer a tracer
listening to that) upon its further syscalls.Example scenario - attach "strace" to a running process and kill it (the
strace) with SIGKILL. You'll see that the syscall trace hooks are still
being called.The clearing of this flag should be moved from ptrace_detach() to
__ptrace_detach().Link: http://lkml.kernel.org/r/1472759493-20554-1-git-send-email-alnovak@suse.cz
Signed-off-by: Ales Novak
Acked-by: Oleg Nesterov
Cc: Jiri Kosina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a patch that provides behavior that is more consistent, and
probably less surprising to users. I consider the change optional, and
welcome opinions about whether it should be applied.By default, pipes are created with a capacity of 64 kiB. However,
/proc/sys/fs/pipe-max-size may be set smaller than this value. In this
scenario, an unprivileged user could thus create a pipe whose initial
capacity exceeds the limit. Therefore, it seems logical to cap the
initial pipe capacity according to the value of pipe-max-size.The test program shown earlier in this patch series can be used to
demonstrate the effect of the change brought about with this patch:# cat /proc/sys/fs/pipe-max-size
1048576
# sudo -u mtk ./test_F_SETPIPE_SZ 1
Initial pipe capacity: 65536
# echo 10000 > /proc/sys/fs/pipe-max-size
# cat /proc/sys/fs/pipe-max-size
16384
# sudo -u mtk ./test_F_SETPIPE_SZ 1
Initial pipe capacity: 16384
# ./test_F_SETPIPE_SZ 1
Initial pipe capacity: 65536The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size
caps the initial allocation for a new pipe for unprivileged users, but
not for privileged users.Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is an optional patch, to provide a small performance
improvement. Alter account_pipe_buffers() so that it returns the
new value in user->pipe_bufs. This means that we can refactor
too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to
avoid the costs of repeated use of atomic_long_read() to get the
value user->pipe_bufs.Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The limit checking in alloc_pipe_info() (used by pipe(2) and when
opening a FIFO) has the following problems:(1) When checking capacity required for the new pipe, the checks against
the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made
against existing consumption, and exclude the memory required for
the new pipe capacity. As a consequence: (1) the memory allocation
throttling provided by the soft limit does not kick in quite as
early as it should, and (2) the user can overrun the hard limit.(2) As currently implemented, accounting and checking against the limits
is done as follows:(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]This patch addresses the above problems as follows:
* Alter the checks against limits to include the memory required for the
new pipe.
* Re-order the accounting step so that it precedes the buffer allocation.
If the accounting step determines that a limit has been reached, revert
the accounting and cause the operation to fail.Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace an 'if' block that covers most of the code in this function
with a 'goto'. This makes the code a little simpler to read, and also
simplifies the next patch (fix limit checking in alloc_pipe_info())Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:(1) When increasing the pipe capacity, the checks against the limits in
/proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
consumption, and exclude the memory required for the increased pipe
capacity. The new increase in pipe capacity can then push the total
memory used by the user for pipes (possibly far) over a limit. This
can also trigger the problem described next.(2) The limit checks are performed even when the new pipe capacity is
less than the existing pipe capacity. This can lead to problems if a
user sets a large pipe capacity, and then the limits are lowered,
with the result that the user will no longer be able to decrease the
pipe capacity.(3) As currently implemented, accounting and checking against the
limits is done as follows:(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.This is racey. Multiple processes may pass point (a)
simultaneously, and then allocate pipe buffers that are accounted
for only in step (c). The race means that the user's pipe buffer
allocation could be pushed over the limit (by an arbitrary amount,
depending on how unlucky we were in the race). [Thanks to Vegard
Nossum for spotting this point, which I had missed.]This patch addresses the above problems as follows:
* Perform checks against the limits only when increasing a pipe's
capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
allocation. If the accounting step determines that a limit has
been reached, revert the accounting and cause the operation to fail.The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)Problem 1
=========Using the test program on an unpatched kernel, we first set some
limits:# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MBThen show that we can set a pipe with capacity (100MB) that is
over the hard limit# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Loop 2: set pipe capacity to 100000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permittedWith a patched kernel, setting a capacity over the limit fails at the
first attempt:# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permittedThere is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited. Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).Problem 2
=========Running the test program on an unpatched kernel, we first set some limits:
# getconf PAGESIZE
4096
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MBNow perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 748
# Initial pipe capacity: 65536
Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds# echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB
# Loop 2: set pipe capacity to 5000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permittedIn this case, the user should be able to lower the limit.
With a kernel that has the patch below, the second fcntl()
succeeds:# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 3215
# Initial pipe capacity: 65536
# Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds# echo 1000 > /proc/sys/fs/pipe-user-pages-hard
# Loop 2: set pipe capacity to 5000000 bytes
F_SETPIPE_SZ returned 83886088x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
/* test_F_SETPIPE_SZ.c
(C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later
Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/#define _GNU_SOURCE
#include
#include
#include
#includeint
main(int argc, char *argv[])
{
int (*pfd)[2];
int npipes;
int pcap, rcap;
int j, p, s, stime, loop;if (argc < 2) {
fprintf(stderr, "Usage: %s num-pipes "
"[pipe-capacity sleep-time]...\n", argv[0]);
exit(EXIT_FAILURE);
}npipes = atoi(argv[1]);
pfd = calloc(npipes, sizeof (int [2]));
if (pfd == NULL) {
perror("calloc");
exit(EXIT_FAILURE);
}for (j = 0; j < npipes; j++) {
if (pipe(pfd[j]) == -1) {
fprintf(stderr, "Loop %d: pipe() failed: ", j);
perror("pipe");
exit(EXIT_FAILURE);
}
}printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));
for (j = 2; j < argc; j += 2 ) {
loop = j / 2;
pcap = atoi(argv[j]);
printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap);for (p = 0; p < npipes; p++) {
s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
if (s == -1) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"failed: ", loop, p);
perror("fcntl");
exit(EXIT_FAILURE);
}if (p == 0) {
printf(" F_SETPIPE_SZ returned %d\n", s);
rcap = s;
} else {
if (s != rcap) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"unexpected return: %d\n", loop, p, s);
exit(EXIT_FAILURE);
}
}stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
if (stime > 0) {
printf(" Sleeping %d seconds\n", stime);
sleep(stime);
}
}
}exit(EXIT_SUCCESS);
}8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
Patch history:
v2
* Switch order of test in 'if' statement to avoid function call
(to capability()) in normal path. [This is a fix to a preexisting
wart in the code. Thanks to Willy Tarreau]
* Perform (size > pipe_max_size) check before calling
account_pipe_buffers(). [Thanks to Vegard Nossum]
Quoting Vegard:The potential problem happens if the user passes a very large number
which will overflow pipe->user->pipe_bufs.On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
then round_pipe_size() returns INT_MAX. Although it's true that the
accounting is done in terms of pages and not bytes, so you'd need on
the order of (1 << 13) = 8192 processes hitting the limit at the same
time in order to make it overflow, which seems a bit unlikely.(See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
limit checking)Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a preparatory patch for following work. account_pipe_buffers()
performs accounting in the 'user_struct'. There is no need to pass a
pointer to a 'pipe_inode_info' struct (which is then dereferenced to
obtain a pointer to the 'user' field). Instead, pass a pointer directly
to the 'user_struct'. This change is needed in preparation for a
subsequent patch that the fixes the limit checking in alloc_pipe_info()
(and the resulting code is a little more logical).Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a preparatory patch for following work. Move the F_SETPIPE_SZ
limit-checking logic from pipe_fcntl() into pipe_set_size(). This
simplifies the code a little, and allows for reworking required in
a later patch that fixes the limit checking in pipe_set_size()Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "pipe: fix limit handling", v2.
When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:(1) When increasing the pipe capacity, the checks against the limits
in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
existing consumption, and exclude the memory required for the
increased pipe capacity. The new increase in pipe capacity can then
push the total memory used by the user for pipes (possibly far) over
a limit. This can also trigger the problem described next.(2) The limit checks are performed even when the new pipe capacity
is less than the existing pipe capacity. This can lead to problems
if a user sets a large pipe capacity, and then the limits are
lowered, with the result that the user will no longer be able to
decrease the pipe capacity.(3) As currently implemented, accounting and checking against the
limits is done as follows:(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]This patch series addresses these three problems.
This patch (of 8):
This is a minor preparatory patch. After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk
Reviewed-by: Vegard Nossum
Cc: Willy Tarreau
Cc:
Cc: Tetsuo Handa
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cmd part of this struct is the same as an index of itself within
_ioctls[]. In fact this cmd is unused, so we can drop this part.Link: http://lkml.kernel.org/r/20160831033414.9910.66697.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi
Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Having this in autofs_i.h gives illusion that uncommenting this enables
pr_debug(), but it doesn't enable all the pr_debug() in autofs because
inclusion order matters.XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS
seems to have a rule to include this prior to other XFS headers as well as
kernel headers. This is not the case with autofs, and DEBUG could be
enabled via Makefile, so autofs should just get rid of this comment to
make the code less confusing. It's a comment, so there is literally no
functional difference.Link: http://lkml.kernel.org/r/20160831033409.9910.77067.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi
Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since linux/auto_dev-ioctl.h wasn't included in include/linux/Kbuild
it wasn't moved to uapi/linux as part of the uapi series.Link: http://lkml.kernel.org/r/20160812024901.12352.10984.stgit@pluto.themaw.net
Signed-off-by: Ian Kent
Cc: Tomohiro Kusumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds