06 Jan, 2017

1 commit

  • commit 52bce91165e5f2db422b2b972e83d389e5e4725c upstream.

    Commit 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()")
    caused a regression when there were no more readers left on a pipe that
    was being spliced into: rather than the expected SIGPIPE and -EPIPE
    return value, the writer would end up waiting forever for space to free
    up (which obviously was not going to happen with no readers around).

    Fixes: 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()")
    Reported-and-tested-by: Andreas Schwab
    Debugged-by: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

27 Nov, 2016

1 commit

  • Botched calculation of number of pages. As the result,
    we were dropping pieces when doing splice to pipe from
    e.g. 9p.

    Reported-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: Al Viro

    Al Viro
     

11 Nov, 2016

1 commit

  • i_size check is a leftover from the horrors that used to play with
    the page cache in that function. With the switch to ->read_iter(),
    it's neither needed nor correct - for gfs2 it ends up being buggy,
    since i_size is not guaranteed to be correct until later (inside
    ->read_iter()).

    Spotted-by: Abhi Das
    Signed-off-by: Al Viro

    Al Viro
     

11 Oct, 2016

1 commit

  • by making sure we call iov_iter_advance() on original
    iov_iter even if direct_IO (done on its copy) has returned 0.
    It's a no-op for old iov_iter flavours and does the right thing
    (== truncation of the stuff we'd allocated, but not filled) in
    ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
    by cleanup in generic_file_read_iter().

    Signed-off-by: Al Viro

    Al Viro
     

06 Oct, 2016

6 commits

  • Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
    pages are filled by kernel_readv() via a kvec array (as we used
    to do all along), so iov_iter here is used only as a way of
    arranging for those pages to be in pipe.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and kill the ->splice_read() instances that can be switched to it

    Signed-off-by: Al Viro

    Al Viro
     
  • iov_iter variant for passing data into pipe. copy_to_iter()
    copies data into page(s) it has allocated and stuffs them into
    the pipe; copy_page_to_iter() stuffs there a reference to the
    page given to it. Both will try to coalesce if possible.
    iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
    and friends will do as copy_to_iter() would have and return the
    pages where the data would've been copied. iov_iter_advance()
    will truncate everything past the spot it has advanced to.

    New primitive: iov_iter_pipe(), used for initializing those.
    pipe should be locked all along.

    Running out of space acts as fault would for iovec-backed ones;
    in other words, giving it to ->read_iter() may result in short
    read if the pipe overflows, or -EFAULT if it happens with nothing
    copied there.

    In other words, ->read_iter() on those acts pretty much like
    ->splice_read(). Moreover, all generic_file_splice_read() users,
    as well as many other ->splice_read() instances can be switched
    to that scheme - that'll happen in the next commit.

    Signed-off-by: Al Viro

    Al Viro
     

04 Oct, 2016

4 commits

  • single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
    to that, leaving splice_to_pipe() only for ->splice_read() instances
    (and that only until they are converted as well).

    Signed-off-by: Al Viro

    Al Viro
     
  • * splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
    * ->splice_read() instances do the same
    * vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
    arrange for waiting, looping, etc. themselves.

    That should make pipe_lock the outermost one.

    Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
    and do_splice() are quite ugly _and_ userland code can be easily broken
    by changing those. It's not even "no more than the maximal capacity of
    this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
    leave instead of waiting".

    Considering how poorly these rules are documented, let's try "wait for some
    space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
    and if we run into overflow, we are done".

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

11 May, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

04 Apr, 2016

1 commit


19 Mar, 2016

2 commits

  • Al Viro
     
  • Running the following command:

    busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null

    with any tracing enabled pretty very quickly leads to various NULL
    pointer dereferences and VM BUG_ON()s, such as these:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: [] generic_pipe_buf_release+0xc/0x40
    Call Trace:
    [] splice_direct_to_actor+0x143/0x1e0
    [] ? generic_pipe_buf_nosteal+0x10/0x10
    [] do_splice_direct+0x8f/0xb0
    [] do_sendfile+0x199/0x380
    [] SyS_sendfile64+0x90/0xa0
    [] entry_SYSCALL_64_fastpath+0x12/0x6d

    page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
    kernel BUG at include/linux/mm.h:367!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    RIP: [] generic_pipe_buf_release+0x3c/0x40
    Call Trace:
    [] splice_direct_to_actor+0x143/0x1e0
    [] ? generic_pipe_buf_nosteal+0x10/0x10
    [] do_splice_direct+0x8f/0xb0
    [] do_sendfile+0x199/0x380
    [] SyS_sendfile64+0x90/0xa0
    [] tracesys_phase2+0x84/0x89

    (busybox's cat uses sendfile(2), unlike the coreutils version)

    This is because tracing_splice_read_pipe() can call splice_to_pipe()
    with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and
    we fill the page pointers and the other fields of the pipe_buffers with
    garbage.

    All other callers of splice_to_pipe() avoid calling it when nr_pages ==
    0, and we could make tracing_splice_read_pipe() do that too, but it
    seems reasonable to have splice_to_page() handle this condition
    gracefully.

    Cc: stable@vger.kernel.org
    Signed-off-by: Rabin Vincent
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Rabin Vincent
     

05 Mar, 2016

1 commit

  • This way we can set kiocb flags also from the sync read/write path for
    the read_iter/write_iter operations. For now there is no way to pass
    flags to plain read/write operations as there is no real need for that,
    and all flags passed are explicitly rejected for these files.

    Signed-off-by: Milosz Tanski
    [hch: rebased on top of my kiocb changes]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Stephen Bates
    Tested-by: Stephen Bates
    Acked-by: Jeff Moyer
    Signed-off-by: Al Viro

    Christoph Hellwig
     

09 Jan, 2016

1 commit

  • During testing, I discovered that __generic_file_splice_read() returns
    0 (EOF) when aops->readpage fails with AOP_TRUNCATED_PAGE on the first
    page of a single/multi-page splice read operation. This EOF return code
    causes the userspace test to (correctly) report a zero-length read error
    when it was expecting otherwise.

    The current strategy of returning a partial non-zero read when ->readpage
    returns AOP_TRUNCATED_PAGE works only when the failed page is not the
    first of the lot being processed.

    This patch attempts to retry lookup and call ->readpage again on pages
    that had previously failed with AOP_TRUNCATED_PAGE. With this patch, my
    tests pass and I haven't noticed any unwanted side effects.

    This version removes the thrice-retry loop and instead indefinitely
    retries lookups on AOP_TRUNCATED_PAGE errors from ->readpage. This
    behavior is now similar to do_generic_file_read().

    Signed-off-by: Abhi Das
    Reviewed-by: Jan Kara
    Cc: Bob Peterson
    Cc: Al Viro
    Signed-off-by: Al Viro

    Abhi Das
     

24 Nov, 2015

2 commits

  • The following test program from Dmitry can cause softlockups or RCU
    stalls as it copies 1GB from tmpfs into eventfd and we don't have any
    scheduling point at that path in sendfile(2) implementation:

    int r1 = eventfd(0, 0);
    int r2 = memfd_create("", 0);
    unsigned long n = 1<
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Commit 296291cdd162 (mm: make sendfile(2) killable) fixed an issue where
    sendfile(2) was doing a lot of tiny writes into a filesystem and thus
    was unkillable for a long time. However sendfile(2) can be (mis)used to
    issue lots of writes into arbitrary file descriptor such as evenfd or
    similar special file descriptors which never hit the standard filesystem
    write path and thus are still unkillable. E.g. the following example
    from Dmitry burns CPU for ~16s on my test system without possibility to
    be killed:

    int r1 = eventfd(0, 0);
    int r2 = memfd_create("", 0);
    unsigned long n = 1<
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

07 Nov, 2015

1 commit

  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Jun, 2015

2 commits

  • Merge first patchbomb from Andrew Morton:

    - a few misc things

    - ocfs2 udpates

    - kernel/watchdog.c feature work (took ages to get right)

    - most of MM. A few tricky bits are held up and probably won't make 4.2.

    * emailed patches from Andrew Morton : (91 commits)
    mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
    mm, thp: respect MPOL_PREFERRED policy with non-local node
    tmpfs: truncate prealloc blocks past i_size
    mm/memory hotplug: print the last vmemmap region at the end of hot add memory
    mm/mmap.c: optimization of do_mmap_pgoff function
    mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
    mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
    mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
    mm: kmemleak: fix delete_object_*() race when called on the same memory block
    mm: kmemleak: allow safe memory scanning during kmemleak disabling
    memcg: convert mem_cgroup->under_oom from atomic_t to int
    memcg: remove unused mem_cgroup->oom_wakeups
    frontswap: allow multiple backends
    x86, mirror: x86 enabling - find mirrored memory ranges
    mm/memblock: allocate boot time data structures from mirrored memory
    mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
    mm: do not ignore mapping_gfp_mask in page cache allocation paths
    mm/cma.c: fix typos in comments
    mm/oom_kill.c: print points as unsigned int
    mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
    ...

    Linus Torvalds
     
  • page_cache_read, do_generic_file_read, __generic_file_splice_read and
    __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
    add_to_page_cache_lru which might cause recursion into fs down in the
    direct reclaim path if the mapping really relies on GFP_NOFS semantic.

    This doesn't seem to be the case now because page_cache_read (page fault
    path) doesn't seem to suffer from the reclaim recursion issues and
    do_generic_file_read and __generic_file_splice_read also shouldn't be
    called under fs locks which would deadlock in the reclaim path. Anyway it
    is better to obey mapping gfp mask and prevent from later breakage.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Johannes Weiner
    Cc: Al Viro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Cc: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 May, 2015

1 commit


06 May, 2015

1 commit

  • Using sendfile with below small program to get MD5 sums of some files,
    it appear that big files (over 64kbytes with 4k pages system) get a
    wrong MD5 sum while small files get the correct sum.
    This program uses sendfile() to send a file to an AF_ALG socket
    for hashing.

    /* md5sum2.c */
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
    struct stat st;
    struct sockaddr_alg sa = {
    .salg_family = AF_ALG,
    .salg_type = "hash",
    .salg_name = "md5",
    };
    int n;

    bind(sk, (struct sockaddr*)&sa, sizeof(sa));

    for (n = 1; n < argc; n++) {
    int size;
    int offset = 0;
    char buf[4096];
    int fd;
    int sko;
    int i;

    fd = open(argv[n], O_RDONLY);
    sko = accept(sk, NULL, 0);
    fstat(fd, &st);
    size = st.st_size;
    sendfile(sko, fd, &offset, size);
    size = read(sko, buf, sizeof(buf));
    for (i = 0; i < size; i++)
    printf("%2.2x", buf[i]);
    printf(" %s\n", argv[n]);
    close(fd);
    close(sko);
    }
    exit(0);
    }

    Test below is done using official linux patch files. First result is
    with a software based md5sum. Second result is with the program above.

    root@vgoip:~# ls -l patch-3.6.*
    -rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz
    -rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz

    root@vgoip:~# md5sum patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

    root@vgoip:~# ./md5sum2 patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz

    After investivation, it appears that sendfile() sends the files by blocks
    of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
    block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
    is reset as if it was the end of the file.

    This patch adds SPLICE_F_MORE to the flags when more data is pending.

    With the patch applied, we get the correct sums:

    root@vgoip:~# md5sum patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

    root@vgoip:~# ./md5sum2 patch-3.6.*
    b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
    c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

    Signed-off-by: Christophe Leroy
    Signed-off-by: Jens Axboe

    Christophe Leroy
     

16 Apr, 2015

1 commit

  • The original dax patchset split the ext2/4_file_operations because of the
    two NULL splice_read/splice_write in the dax case.

    In the vfs if splice_read/splice_write are NULL we then call
    default_splice_read/write.

    What we do here is make generic_file_splice_read aware of IS_DAX() so the
    original ext2/4_file_operations can be used as is.

    For write it appears that iter_file_splice_write is just fine. It uses
    the regular f_op->write(file,..) or new_sync_write(file, ...).

    Signed-off-by: Boaz Harrosh
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     

12 Apr, 2015

1 commit


26 Mar, 2015

1 commit


29 Jan, 2015

2 commits


24 Oct, 2014

1 commit


12 Jun, 2014

4 commits


28 May, 2014

1 commit

  • Commit 6130f5315ee8 "switch vmsplice_to_user() to copy_page_to_iter()" in
    v3.15-rc1 broke vmsplice(2).

    This patch fixes two bugs:

    - count is not initialized to a proper value, which resulted in no data
    being copied

    - if rw_copy_check_uvector() returns negative then the iov might be leaked.

    Tested OK.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

07 May, 2014

1 commit

  • For now, just use the same thing we pass to ->direct_IO() - it's all
    iovec-based at the moment. Pass it explicitly to iov_iter_init() and
    account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
    uses.

    Signed-off-by: Al Viro

    Al Viro