27 Jul, 2016
1 commit
-
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct
for anonymous pipe buffer pages, because:- We lock out all other pipe users, because ->steal is called with
pipe_lock held, so the page can't be spliced to another pipe from
under us.- The page is not on LRU and it never was.
- Thus a parallel thread can access it only by PFN. Although this is
quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
this is not dangerous, because all such functions do is increase page
ref count, check if the page is the one they are looking for, and
decrease ref count if it isn't. Since our page is clean except for
PageKmemcg mark, which doesn't conflict with other _mapcount users,
the worst that can happen is we see page_count > 2 due to a transient
ref, in which case we false-positively abort ->steal, which is still
fine, because ->steal is not guaranteed to succeed.Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov
Cc: Alexander Viro
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Apr, 2016
1 commit
-
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.Let's stop pretending that pages in page cache are special. They are
not.The changes are pretty straight-forward:
- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
20 Jan, 2016
1 commit
-
On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds
Signed-off-by: Willy Tarreau
Signed-off-by: Al Viro
11 Nov, 2015
2 commits
-
pipe_write() would return 0 if it failed to merge the beginning of the
data to write with the last, partially filled pipe buffer. It should
return an error code instead. Userspace programs could be confused by
write() returning 0 when called with a nonzero 'count'.The EFAULT error case was a regression from f0d1bec9d5 ("new helper:
copy_page_from_iter()"), while the ops->confirm() error case was a much
older bug.Test program:
#include
#include
#includeint main(void)
{
int fd[2];
char data[1] = {0};assert(0 == pipe(fd));
assert(1 == write(fd[1], data, 1));/* prior to this patch, write() returned 0 here */
assert(-1 == write(fd[1], NULL, 1));
assert(errno == EFAULT);
}Cc: stable@vger.kernel.org # at least v3.15+
Signed-off-by: Eric Biggers
Signed-off-by: Al Viro -
If sys_pipe() was unable to allocate a 'struct file', it always failed
with ENFILE, which means "The number of simultaneously open files in the
system would exceed a system-imposed limit." However, alloc_file()
actually returns an ERR_PTR value and might fail with other error codes.
Currently, in addition to ENFILE, it can fail with ENOMEM, potentially
when there are few open files in the system. Update sys_pipe() to
preserve this error code.In a prior submission of a similar patch (1) some concern was raised
about introducing a new error code for sys_pipe(). However, for most
system calls, programs cannot assume that new error codes will never be
introduced. In addition, ENOMEM was, in fact, already a possible error
code for sys_pipe(), in the case where the file descriptor table could
not be expanded due to insufficient memory.(1) http://comments.gmane.org/gmane.linux.kernel/1357942
Signed-off-by: Eric Biggers
Signed-off-by: Al Viro
16 Apr, 2015
1 commit
-
Signed-off-by: David Howells
Signed-off-by: Al Viro
12 Apr, 2015
1 commit
-
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.Signed-off-by: Al Viro
26 Mar, 2015
1 commit
-
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
07 May, 2014
3 commits
-
parallel to copy_page_to_iter(). pipe_write() switched to it (and became
->write_iter()).Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment. Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.Signed-off-by: Al Viro
02 Apr, 2014
2 commits
-
Signed-off-by: Al Viro
-
all pipe_buffer_operations have the same instances of those...
Signed-off-by: Al Viro
24 Jan, 2014
1 commit
-
Pipe has no data associated with fs so it is not good idea to block
pipe_write() if FS is frozen, but we can not update file's time on such
filesystem. Let's use same idea as we use in touch_time().Addresses https://bugzilla.kernel.org/show_bug.cgi?id=65701
Signed-off-by: Dmitry Monakhov
Reviewed-by: Jan Kara
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Dec, 2013
1 commit
-
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);where the final freeing is done last.
HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").Reported-by: Simon Kirby
Reported-by: Ian Applegate
Acked-by: Al Viro
Cc: stable@kernel.org # v3.10+
Signed-off-by: Linus Torvalds
08 May, 2013
1 commit
-
Faster kernel compiles by way of fewer unnecessary includes.
[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet
Cc: Zach Brown
Cc: Felipe Balbi
Cc: Greg Kroah-Hartman
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Rusty Russell
Cc: Jens Axboe
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Jeff Moyer
Cc: Al Viro
Cc: Benjamin LaHaise
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Apr, 2013
11 commits
-
and rename __free_pipe_info() to free_pipe_info()
Signed-off-by: Al Viro
-
not used anymore
Signed-off-by: Al Viro
-
it's used only as a flag to distinguish normal pipes/FIFOs from the
internal per-task one used by file-to-file splice. And pipe->files
would work just as well for that purpose...Signed-off-by: Al Viro
-
fs/pipe.c file_operations methods *know* that pipe is not an internal one;
no need to check pipe->inode for those callers.Signed-off-by: Al Viro
-
simplify get_pipe_info(), while we are at it
Signed-off-by: Al Viro
-
now it can be done - put mutex into pipe_inode_info, use it instead
of ->i_mutexSigned-off-by: Al Viro
-
* new field - pipe->files; number of struct file over that pipe (all
sharing the same inode, of course); protected by inode->i_lock.
* pipe_release() decrements pipe->files, clears inode->i_pipe when
if the counter has reached 0 (all under ->i_lock) and, in that case,
frees pipe after having done pipe_unlock()
* fifo_open() starts with grabbing ->i_lock, and either bumps pipe->files
if ->i_pipe was non-NULL or allocates a new pipe (dropping and regaining
->i_lock) and rechecks ->i_pipe; if it's still NULL, inserts new pipe
there, otherwise bumps ->i_pipe->files and frees the one we'd allocated.
At that point we know that ->i_pipe is non-NULL and won't go away, so
we can do pipe_lock() on it and proceed as we used to. If we end up
failing, decrement pipe->files and if it reaches 0 clear ->i_pipe and
free the sucker after pipe_unlock().Signed-off-by: Al Viro
-
* use the fact that file_inode(file)->i_pipe doesn't change
while the file is opened - no locks needed to access that.
* switch to pipe_lock/pipe_unlock where it's easy to doSigned-off-by: Al Viro
-
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
12 Mar, 2013
1 commit
-
If you open a pipe for neither read nor write, the pipe code will not
add any usage counters to the pipe, causing the 'struct pipe_inode_info"
to be potentially released early.That doesn't normally matter, since you cannot actually use the pipe,
but the pipe release code - particularly fasync handling - still expects
the actual pipe infrastructure to all be there. And rather than adding
NULL pointer checks, let's just disallow this case, the same way we
already do for the named pipe ("fifo") case.This is ancient going back to pre-2.4 days, and until trinity, nobody
naver noticed.Reported-by: Dave Jones
Signed-off-by: Linus Torvalds
23 Feb, 2013
2 commits
-
Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limitCurrently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.Return error through pointer. Change all callers to preserve this error code.
[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]Signed-off-by: Anatol Pomozov
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Al Viro -
Signed-off-by: Al Viro
27 Sep, 2012
1 commit
-
don't mess with sys_close() if copy_to_user() fails; just postpone
fd_install() until we know it hasn't.Signed-off-by: Al Viro
02 Aug, 2012
1 commit
-
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
30 Jul, 2012
1 commit
-
Signed-off-by: Al Viro
24 Jul, 2012
1 commit
-
Signed-off-by: Cong Wang
02 Jun, 2012
1 commit
-
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,Signed-off-by: Josef Bacik
31 May, 2012
1 commit
-
As described in commit 07d106d0a ("vfs: fix up ENOIOCTLCMD error
handling"), drivers should return -ENOIOCTLCMD if they receive an ioctl
command which they don't understand. Doing so will result in -ENOTTY
being returned to userspace, which matches the behaviour of the compat
layer if it fails to translate an ioctl command.This patch fixes the pipe ioctl to return -ENOIOCTLCMD instead of
-EINVAL when passed an unknown ioctl command.Cc: Al Viro
Cc: Andrew Morton
Signed-off-by: Will Deacon
Signed-off-by: Al Viro
30 Apr, 2012
1 commit
-
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.Tested-by: Michael Tokarev
Cc: Alan Cox
Cc: David Miller
Cc: Ian Kent
Cc: Thomas Meyer
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds
24 Mar, 2012
1 commit
-
- Move open-coded filesystem magic numbers into magic.h
- Rearrange magic.h so that the filesystem-related constants are grouped
together.Signed-off-by: Muthukumar R
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Mar, 2012
1 commit
-
Acked-by: Benjamin LaHaise
Signed-off-by: Cong Wang
13 Jan, 2012
1 commit
-
When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
Call Trace:
warn_slowpath_common+0x75/0xb0
warn_slowpath_null+0x15/0x20
__alloc_pages_nodemask+0x5d3/0x7a0
__get_free_pages+0x12/0x50
__kmalloc+0x12b/0x150
pipe_set_size+0x75/0x120
pipe_fcntl+0xf8/0x140
do_fcntl+0x2d4/0x410
sys_fcntl+0x66/0xa0
system_call_fastpath+0x16/0x1b
---[ end trace 432f702e6db7b5ee ]---Instead, make kcalloc() handle the overflow case and fail quietly.
[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin
Cc: Alexander Viro
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds