17 Feb, 2015

1 commit

  • All callers of get_xip_mem() are now gone. Remove checks for it,
    initialisers of it, documentation of it and the only implementation of it.
    Also remove mm/filemap_xip.c as it is now empty. Also remove
    documentation of the long-gone get_xip_page().

    Signed-off-by: Matthew Wilcox
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Dec, 2014

1 commit

  • A random seek IO benchmark appeared to regress because of a change to
    readahead but the real problem was the benchmark. To ensure the IO
    request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
    (512K) but the hint is ignored by the kernel. This is correct but not
    necessarily obvious behaviour. As much as I dislike comment patches, the
    explanation for this behaviour predates current git history. Clarify why
    it behaves like this in case someone "fixes" fadvise or readahead for the
    wrong reasons.

    Signed-off-by: Mel Gorman
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Mar, 2013

1 commit


27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • Rob van der Heij reported the following (paraphrased) on private mail.

    The scenario is that I want to avoid backups to fill up the page
    cache and purge stuff that is more likely to be used again (this is
    with s390x Linux on z/VM, so I don't give it as much memory that
    we don't care anymore). So I have something with LD_PRELOAD that
    intercepts the close() call (from tar, in this case) and issues
    a posix_fadvise() just before closing the file.

    This mostly works, except for small files (less than 14 pages)
    that remains in page cache after the face.

    Unfortunately Rob has not had a chance to test this exact patch but the
    test program below should be reproducing the problem he described.

    The issue is the per-cpu pagevecs for LRU additions. If the pages are
    added by one CPU but fadvise() is called on another then the pages
    remain resident as the invalidate_mapping_pages() only drains the local
    pagevecs via its call to pagevec_release(). The user-visible effect is
    that a program that uses fadvise() properly is not obeyed.

    A possible fix for this is to put the necessary smarts into
    invalidate_mapping_pages() to globally drain the LRU pagevecs if a
    pagevec page could not be discarded. The downside with this is that an
    inode cache shrink would send a global IPI and memory pressure
    potentially causing global IPI storms is very undesirable.

    Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
    check if invalidate_mapping_pages() discarded all the requested pages.
    If a subset of pages are discarded it drains the LRU pagevecs and tries
    again. If the second attempt fails, it assumes it is due to the pages
    being mapped, locked or dirty and does not care. With this patch, an
    application using fadvise() correctly will be obeyed but there is a
    downside that a malicious application can force the kernel to send
    global IPIs and increase overhead.

    If accepted, I would like this to be considered as a -stable candidate.
    It's not an urgent issue but it's a system call that is not working as
    advertised which is weak.

    The following test program demonstrates the problem. It should never
    report that pages are still resident but will without this patch. It
    assumes that CPU 0 and 1 exist.

    int main() {
    int fd;
    int pagesize = getpagesize();
    ssize_t written = 0, expected;
    char *buf;
    unsigned char *vec;
    int resident, i;
    cpu_set_t set;

    /* Prepare a buffer for writing */
    expected = FILESIZE_PAGES * pagesize;
    buf = malloc(expected + 1);
    if (buf == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }
    buf[expected] = 0;
    memset(buf, 'a', expected);

    /* Prepare the mincore vec */
    vec = malloc(FILESIZE_PAGES);
    if (vec == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }

    /* Bind ourselves to CPU 0 */
    CPU_ZERO(&set);
    CPU_SET(0, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* open file, unlink and write buffer */
    fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
    if (fd == -1) {
    perror("open");
    exit(EXIT_FAILURE);
    }
    unlink("fadvise-test-file");
    while (written < expected) {
    ssize_t this_write;
    this_write = write(fd, buf + written, expected - written);

    if (this_write == -1) {
    perror("write");
    exit(EXIT_FAILURE);
    }

    written += this_write;
    }
    free(buf);

    /*
    * Force ourselves to another CPU. If fadvise only flushes the local
    * CPUs pagevecs then the fadvise will fail to discard all file pages
    */
    CPU_ZERO(&set);
    CPU_SET(1, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* sync and fadvise to discard the page cache */
    fsync(fd);
    if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
    perror("posix_fadvise");
    exit(EXIT_FAILURE);
    }

    /* map the file and use mincore to see which parts of it are resident */
    buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
    if (buf == NULL) {
    perror("mmap");
    exit(EXIT_FAILURE);
    }
    if (mincore(buf, expected, vec) == -1) {
    perror("mincore");
    exit(EXIT_FAILURE);
    }

    /* Check residency */
    for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
    if (vec[i])
    resident++;
    }
    if (resident != 0) {
    printf("Nr unexpected pages resident: %d\n", resident);
    exit(EXIT_FAILURE);
    }

    munmap(buf, expected);
    close(fd);
    free(vec);
    exit(EXIT_SUCCESS);
    }

    Signed-off-by: Mel Gorman
    Reported-by: Rob van der Heij
    Tested-by: Rob van der Heij
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Feb, 2013

1 commit


27 Sep, 2012

2 commits


01 Aug, 2012

1 commit

  • Eric Wong reported his test suite failex when /tmp is tmpfs.

    https://lkml.org/lkml/2012/2/24/479

    Currentlt the input check of POSIX_FADV_WILLNEED has two problems.

    - requires a_ops->readpage. But in fact, force_page_cache_readahead()
    requires that the target filesystem has either ->readpage or ->readpages.

    - returns -EINVAL when the filesystem doesn't have ->readpage. But
    posix says that fadvise is merely a hint. Thus fadvise() should return
    0 if filesystem has no means of implementing fadvise(). The userland
    application should not know nor care whcih type of filesystem backs the
    TMPDIR directory, as Eric pointed out. There is nothing which userspace
    can do to solve this error.

    So change the return value to 0 when filesytem doesn't support readahead.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Signed-off-by: Eric Wong
    Tested-by: Eric Wong
    Reviewed-by: Wanlong Gao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Jan, 2012

1 commit

  • Previously POSIX_FADV_DONTNEED would start writeback for the entire file
    when the bdi was not write congested. This negatively impacts performance
    if the file contains dirty pages outside of the requested range. This
    change uses __filemap_fdatawrite_range() to only initiate writeback for
    the requested range.

    Signed-off-by: Shawn Bohrer
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Bohrer
     

07 Mar, 2010

1 commit

  • This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

    POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
    a 16K read will be carried out in 4 _sync_ 1-page reads.

    In other places, ra_pages==0 means
    - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
    - some IO error happened
    where multi-page read IO won't help or should be avoided.

    POSIX_FADV_RANDOM actually want a different semantics: to disable the
    *heuristic* readahead algorithm, and to use a dumb one which faithfully
    submit read IO for whatever application requests.

    So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

    Note that the random hint is not likely to help random reads performance
    noticeably. And it may be too permissive on huge request size (its IO
    size is not limited by read_ahead_kb).

    In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
    (NFS read) performance of the application increased by 313%!

    Tested-by: Quentin Barnes
    Signed-off-by: Wu Fengguang
    Cc: Nick Piggin
    Cc: Andi Kleen
    Cc: Steven Whitehouse
    Cc: David Howells
    Cc: Jonathan Corbet
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Chuck Lever
    Cc: [2.6.33.x]
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

17 Jun, 2009

1 commit


14 Jan, 2009

1 commit

  • System calls with an unsigned long long argument can't be converted with
    the standard wrappers since that would include a cast to long, which in
    turn means that we would lose the upper 32 bit on 32 bit architectures.
    Also semctl can't use the standard wrapper since it has a 'union'
    parameter.

    So we handle them as special case and add some extra wrappers instead.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

17 Oct, 2008

1 commit


28 Apr, 2008

1 commit

  • Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
    the user mappings.

    This requires the get_xip_page API to be changed to an address based one.
    Improve the API layering a little bit too, while we're here.

    This is required in order to support XIP filesystems on memory that isn't
    backed with struct page (but memory with struct page is still supported too).

    Signed-off-by: Nick Piggin
    Acked-by: Carsten Otte
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Feb, 2008

1 commit

  • I've written some test programs in ltp project. During writing I met an
    problem which I cannot solve in user land. So I wrote a patch for linux
    kernel. Please, include this patch if acceptable.

    The test program tests the 4th parameter of fadvise64_64:

    long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);

    My test case calls fadvise64_64 with invalid advice value and checks errno is
    set to EINVAL. About the advice parameter man page says:

    ...
    Permissible values for advice include:

    POSIX_FADV_NORMAL
    ...
    POSIX_FADV_SEQUENTIAL
    ...
    POSIX_FADV_RANDOM
    ...
    POSIX_FADV_NOREUSE
    ...
    POSIX_FADV_WILLNEED
    ...
    POSIX_FADV_DONTNEED
    ...
    ERRORS
    ...
    EINVAL An invalid value was specified for advice.

    However, I got a bug report that the system call invocations
    in my test case returned 0 unexpectedly.

    I've inspected the kernel code:

    asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
    {
    struct file *file = fget(fd);
    struct address_space *mapping;
    struct backing_dev_info *bdi;
    loff_t endbyte; /* inclusive */
    pgoff_t start_index;
    pgoff_t end_index;
    unsigned long nrpages;
    int ret = 0;

    if (!file)
    return -EBADF;

    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
    ret = -ESPIPE;
    goto out;
    }

    mapping = file->f_mapping;
    if (!mapping || len < 0) {
    ret = -EINVAL;
    goto out;
    }

    if (mapping->a_ops->get_xip_page)
    /* no bad return value, but ignore advice */
    goto out;
    ...
    out:
    fput(file);
    return ret;
    }

    I found the advice parameter is just ignored in the case
    mapping->a_ops->get_xip_page is given. This behavior is different from
    what is written on the man page. Is this o.k.?

    get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
    Anyway I cannot find the easy way to detect get_xip_page
    field is given or CONFIG_EXT2_FS_XIP is true from the
    user space.

    I propose the following patch which checks the advice parameter
    even if get_xip_page is given.

    Signed-off-by: Masatake YAMATO
    Acked-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masatake YAMATO
     

09 Dec, 2006

1 commit


06 Aug, 2006

1 commit

  • The POSIX_FADV_NOREUSE hint means "the application will use this range of the
    file a single time". It seems to be intended that the implementation will use
    this hint to perform drop-behind of that part of the file when the application
    gets around to reading or writing it.

    However for reasons which aren't obvious (or sane?) I mapped
    POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED. ie: it does readahead.

    That's daft. So for now, make POSIX_FADV_NOREUSE a no-op.

    This is a non-back-compatible change. If someone was using POSIX_FADV_NOREUSE
    to perform readahead, they lose. The likelihood is low.

    If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
    do it fully we'll need to maintain file offset/length ranges and peform all
    sorts of complex tricks, and managing the lifetime of those ranges' data
    structures will be interesting..

    A sensible implementation would probably ignore the file range and would
    simply mark the entire file as needing some form of drop-behind treatment.

    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Jul, 2006

1 commit


01 Apr, 2006

1 commit

  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Mar, 2006

1 commit

  • Add two new linux-specific fadvise extensions():

    LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
    offsets `offset' and `offset+len'. Any pages which are currently under
    writeout are skipped, whether or not they are dirty.

    LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
    offsets `offset' and `offset+len'.

    By combining these two operations the application may do several things:

    LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.

    LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
    pages at the disk.

    LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
    of the currently dirty pages at the disk, wait until they have been written.

    It should be noted that none of these operations write out the file's
    metadata. So unless the application is strictly performing overwrites of
    already-instantiated disk blocks, there are no guarantees here that the data
    will be available after a crash.

    To complete this suite of operations I guess we should have a "sync file
    metadata only" operation. This gives applications access to all the building
    blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
    well with the fadvise() interface. Probably it should be a new syscall:
    sys_fmetadatasync().

    The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
    It is made to represent that last affected byte in the file (ie: it is
    inclusive). Generally, all these byterange and pagerange functions are
    inclusive so we can easily represent EOF with -1.

    As Ulrich notes, these two functions are somewhat abusive of the fadvise()
    concept, which appears to be "set the future policy for this fd".

    But these commands are a perfect fit with the fadvise() impementation, and
    several of the existing fadvise() commands are synchronous and don't affect
    future policy either. I think we can live with the slight incongruity.

    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Jan, 2006

1 commit


24 Jun, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds