15 Sep, 2018

1 commit

  • [ Upstream commit a718e28f538441a3b6612da9ff226973376cdf0f ]

    Signed integer overflow is undefined according to the C standard. The
    overflow in ksys_fadvise64_64() is deliberate, but since it is signed
    overflow, UBSAN complains:

    UBSAN: Undefined behaviour in mm/fadvise.c:76:10
    signed integer overflow:
    4 + 9223372036854775805 cannot be represented in type 'long long int'

    Use unsigned types to do math. Unsigned overflow is defined so UBSAN
    will not complain about it. This patch doesn't change generated code.

    [akpm@linux-foundation.org: add comment explaining the casts]
    Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

26 Apr, 2018

1 commit

  • [ Upstream commit a7ab400d6fe73d0119fdc234e9982a6f80faea9f ]

    During our recent testing with fadvise(FADV_DONTNEED), we find that if
    given offset/length is not page-aligned, the last page will not be
    discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/),
    we map a 10KB-sized file into memory and then try to run this tool to
    evict the whole file mapping, but the last single page always remains
    staying in the memory:

    $./vmtouch -e test_10K
    Files: 1
    Directories: 0
    Evicted Pages: 3 (12K)
    Elapsed: 2.1e-05 seconds

    $./vmtouch test_10K
    Files: 1
    Directories: 0
    Resident Pages: 1/3 4K/12K 33.3%
    Elapsed: 5.5e-05 seconds

    However when we test with an older kernel, say 3.10, this problem is
    gone. So we wonder if this is a regression:

    $./vmtouch -e test_10K
    Files: 1
    Directories: 0
    Evicted Pages: 3 (12K)
    Elapsed: 8.2e-05 seconds

    $./vmtouch test_10K
    Files: 1
    Directories: 0
    Resident Pages: 0/3 0/12K 0%
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    int i, fd, ret, len;
    struct stat buf;
    void *addr;
    unsigned char *vec;
    char *strbuf;
    ssize_t pagesize = getpagesize();
    ssize_t filesize;

    fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
    if (fd < 0)
    return -1;
    filesize = strtoul(argv[2], NULL, 10);

    strbuf = malloc(filesize);
    memset(strbuf, 42, filesize);
    write(fd, strbuf, filesize);
    free(strbuf);
    fsync(fd);

    len = (filesize + pagesize - 1) / pagesize;
    printf("length of pages: %d\n", len);

    addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED)
    return -1;

    ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
    if (ret < 0)
    return -1;

    vec = malloc(len);
    ret = mincore(addr, filesize, (void *)vec);
    if (ret < 0)
    return -1;

    for (i = 0; i < len; i++)
    printf("pages[%d]: %x\n", i, vec[i] & 0x1);

    free(vec);
    close(fd);

    return 0;
    }
    ==============================

    Test 1: running on kernel with commit 18aba41cbf reverted:

    [root@caspar ~]# uname -r
    4.15.0-rc6.revert+
    [root@caspar ~]# ./test_fadvise file1 1024
    length of pages: 1
    pages[0]: 0 #
    Signed-off-by: Caspar Zhang
    Reviewed-by: Oliver Yang
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    shidao.ytt
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

1 commit

  • The fadvise() manpage is silent on fadvise()'s effect on memory-based
    filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
    sysfs, kernfs). The current implementaion of fadvise is mostly a noop
    for such filesystems except for FADV_DONTNEED which will trigger
    expensive remote LRU cache draining. This patch makes the noop of
    fadvise() on such file systems very explicit.

    However this change has two side effects for ramfs and one for tmpfs.
    First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
    pages of ramfs (allocated through read, readahead & read fault) and
    tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could
    create such clean zero'ed pages for ramfs. This change removes those
    possibilities.

    One of our generic libraries does fadvise(FADV_DONTNEED). Recently we
    observed high latency in fadvise() and noticed that the users have
    started using tmpfs files and the latency was due to expensive remote
    LRU cache draining. For normal tmpfs files (have data written on them),
    fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
    draining.

    Link: http://lkml.kernel.org/r/20170818011023.181465-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

21 Dec, 2016

1 commit

  • When FADV_DONTNEED cannot drop all pages in the range, it observes that
    some pages might still be on per-cpu LRU caches after recent
    instantiation and so initiates remote calls to all CPUs to flush their
    local caches. However, in most cases, the fadvise happens from the same
    context that instantiated the pages, and any pre-LRU pages in the
    specified range are most likely sitting on the local CPU's LRU cache,
    and so in many cases this results in unnecessary remote calls, which, in
    a loaded system, can hold up the fadvise() call significantly.

    [ I didn't record it in the extreme case we observed at Facebook,
    unfortunately. We had a slow-to-respond system and noticed it
    lru_add_drain_all() leading the profile during fadvise calls. This
    patch came out of thinking about the code and how we commonly call
    FADV_DONTNEED.

    FWIW, I wrote a silly directory tree walker/searcher that recurses
    through /usr to read and FADV_DONTNEED each file it finds. On a 2
    socket 40 ht machine, over 1% is spent in lru_add_drain_all(). With
    the patch, that cost is gone; the local drain cost shows at 0.09%. ]

    Try to avoid the remote call by flushing the local LRU cache before even
    attempting to invalidate anything. It's a cheap operation, and the
    local LRU cache is the most likely to hold any pre-LRU pages in the
    specified fadvise range.

    Link: http://lkml.kernel.org/r/20161214210017.GA1465@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Jun, 2016

1 commit

  • I noticed that the logic in the fadvise64_64 syscall is incorrect for
    partial pages. While first page of the region is correctly skipped if
    it is partial, the last page of the region is mistakenly discarded.
    This leads to problems for applications that read data in
    non-page-aligned chunks discarding already processed data between the
    reads.

    A somewhat misguided application that does something like write(XX bytes
    (non-page-alligned)); drop the data it just wrote; repeat gets a
    significant penalty in performance as a result.

    Link: http://lkml.kernel.org/r/1464917140-1506698-1-git-send-email-green@linuxhacker.ru
    Signed-off-by: Oleg Drokin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Drokin
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Jun, 2015

1 commit

  • In several places, bdi_congested() and its wrappers are used to
    determine whether more IOs should be issued. With cgroup writeback
    support, this question can't be answered solely based on the bdi
    (backing_dev_info). It's dependent on whether the filesystem and bdi
    support cgroup writeback and the blkcg the inode is associated with.

    This patch implements inode_congested() and its wrappers which take
    @inode and determines the congestion state considering cgroup
    writeback. The new functions replace bdi_*congested() calls in places
    where the query is about specific inode and task.

    There are several filesystem users which also fit this criteria but
    they should be updated when each filesystem implements cgroup
    writeback support.

    v2: Now that a given inode is associated with only one wb, congestion
    state can be determined independent from the asking task. Drop
    @task. Spotted by Vivek. Also, converted to take @inode instead
    of @mapping and renamed to inode_congested().

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Feb, 2015

1 commit

  • All callers of get_xip_mem() are now gone. Remove checks for it,
    initialisers of it, documentation of it and the only implementation of it.
    Also remove mm/filemap_xip.c as it is now empty. Also remove
    documentation of the long-gone get_xip_page().

    Signed-off-by: Matthew Wilcox
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Dec, 2014

1 commit

  • A random seek IO benchmark appeared to regress because of a change to
    readahead but the real problem was the benchmark. To ensure the IO
    request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
    (512K) but the hint is ignored by the kernel. This is correct but not
    necessarily obvious behaviour. As much as I dislike comment patches, the
    explanation for this behaviour predates current git history. Clarify why
    it behaves like this in case someone "fixes" fadvise or readahead for the
    wrong reasons.

    Signed-off-by: Mel Gorman
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Mar, 2013

1 commit


27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • Rob van der Heij reported the following (paraphrased) on private mail.

    The scenario is that I want to avoid backups to fill up the page
    cache and purge stuff that is more likely to be used again (this is
    with s390x Linux on z/VM, so I don't give it as much memory that
    we don't care anymore). So I have something with LD_PRELOAD that
    intercepts the close() call (from tar, in this case) and issues
    a posix_fadvise() just before closing the file.

    This mostly works, except for small files (less than 14 pages)
    that remains in page cache after the face.

    Unfortunately Rob has not had a chance to test this exact patch but the
    test program below should be reproducing the problem he described.

    The issue is the per-cpu pagevecs for LRU additions. If the pages are
    added by one CPU but fadvise() is called on another then the pages
    remain resident as the invalidate_mapping_pages() only drains the local
    pagevecs via its call to pagevec_release(). The user-visible effect is
    that a program that uses fadvise() properly is not obeyed.

    A possible fix for this is to put the necessary smarts into
    invalidate_mapping_pages() to globally drain the LRU pagevecs if a
    pagevec page could not be discarded. The downside with this is that an
    inode cache shrink would send a global IPI and memory pressure
    potentially causing global IPI storms is very undesirable.

    Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
    check if invalidate_mapping_pages() discarded all the requested pages.
    If a subset of pages are discarded it drains the LRU pagevecs and tries
    again. If the second attempt fails, it assumes it is due to the pages
    being mapped, locked or dirty and does not care. With this patch, an
    application using fadvise() correctly will be obeyed but there is a
    downside that a malicious application can force the kernel to send
    global IPIs and increase overhead.

    If accepted, I would like this to be considered as a -stable candidate.
    It's not an urgent issue but it's a system call that is not working as
    advertised which is weak.

    The following test program demonstrates the problem. It should never
    report that pages are still resident but will without this patch. It
    assumes that CPU 0 and 1 exist.

    int main() {
    int fd;
    int pagesize = getpagesize();
    ssize_t written = 0, expected;
    char *buf;
    unsigned char *vec;
    int resident, i;
    cpu_set_t set;

    /* Prepare a buffer for writing */
    expected = FILESIZE_PAGES * pagesize;
    buf = malloc(expected + 1);
    if (buf == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }
    buf[expected] = 0;
    memset(buf, 'a', expected);

    /* Prepare the mincore vec */
    vec = malloc(FILESIZE_PAGES);
    if (vec == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }

    /* Bind ourselves to CPU 0 */
    CPU_ZERO(&set);
    CPU_SET(0, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* open file, unlink and write buffer */
    fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
    if (fd == -1) {
    perror("open");
    exit(EXIT_FAILURE);
    }
    unlink("fadvise-test-file");
    while (written < expected) {
    ssize_t this_write;
    this_write = write(fd, buf + written, expected - written);

    if (this_write == -1) {
    perror("write");
    exit(EXIT_FAILURE);
    }

    written += this_write;
    }
    free(buf);

    /*
    * Force ourselves to another CPU. If fadvise only flushes the local
    * CPUs pagevecs then the fadvise will fail to discard all file pages
    */
    CPU_ZERO(&set);
    CPU_SET(1, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* sync and fadvise to discard the page cache */
    fsync(fd);
    if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
    perror("posix_fadvise");
    exit(EXIT_FAILURE);
    }

    /* map the file and use mincore to see which parts of it are resident */
    buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
    if (buf == NULL) {
    perror("mmap");
    exit(EXIT_FAILURE);
    }
    if (mincore(buf, expected, vec) == -1) {
    perror("mincore");
    exit(EXIT_FAILURE);
    }

    /* Check residency */
    for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
    if (vec[i])
    resident++;
    }
    if (resident != 0) {
    printf("Nr unexpected pages resident: %d\n", resident);
    exit(EXIT_FAILURE);
    }

    munmap(buf, expected);
    close(fd);
    free(vec);
    exit(EXIT_SUCCESS);
    }

    Signed-off-by: Mel Gorman
    Reported-by: Rob van der Heij
    Tested-by: Rob van der Heij
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Feb, 2013

1 commit


27 Sep, 2012

2 commits


01 Aug, 2012

1 commit

  • Eric Wong reported his test suite failex when /tmp is tmpfs.

    https://lkml.org/lkml/2012/2/24/479

    Currentlt the input check of POSIX_FADV_WILLNEED has two problems.

    - requires a_ops->readpage. But in fact, force_page_cache_readahead()
    requires that the target filesystem has either ->readpage or ->readpages.

    - returns -EINVAL when the filesystem doesn't have ->readpage. But
    posix says that fadvise is merely a hint. Thus fadvise() should return
    0 if filesystem has no means of implementing fadvise(). The userland
    application should not know nor care whcih type of filesystem backs the
    TMPDIR directory, as Eric pointed out. There is nothing which userspace
    can do to solve this error.

    So change the return value to 0 when filesytem doesn't support readahead.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Signed-off-by: Eric Wong
    Tested-by: Eric Wong
    Reviewed-by: Wanlong Gao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Jan, 2012

1 commit

  • Previously POSIX_FADV_DONTNEED would start writeback for the entire file
    when the bdi was not write congested. This negatively impacts performance
    if the file contains dirty pages outside of the requested range. This
    change uses __filemap_fdatawrite_range() to only initiate writeback for
    the requested range.

    Signed-off-by: Shawn Bohrer
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Bohrer
     

07 Mar, 2010

1 commit

  • This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

    POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
    a 16K read will be carried out in 4 _sync_ 1-page reads.

    In other places, ra_pages==0 means
    - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
    - some IO error happened
    where multi-page read IO won't help or should be avoided.

    POSIX_FADV_RANDOM actually want a different semantics: to disable the
    *heuristic* readahead algorithm, and to use a dumb one which faithfully
    submit read IO for whatever application requests.

    So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

    Note that the random hint is not likely to help random reads performance
    noticeably. And it may be too permissive on huge request size (its IO
    size is not limited by read_ahead_kb).

    In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
    (NFS read) performance of the application increased by 313%!

    Tested-by: Quentin Barnes
    Signed-off-by: Wu Fengguang
    Cc: Nick Piggin
    Cc: Andi Kleen
    Cc: Steven Whitehouse
    Cc: David Howells
    Cc: Jonathan Corbet
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Chuck Lever
    Cc: [2.6.33.x]
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

17 Jun, 2009

1 commit


14 Jan, 2009

1 commit

  • System calls with an unsigned long long argument can't be converted with
    the standard wrappers since that would include a cast to long, which in
    turn means that we would lose the upper 32 bit on 32 bit architectures.
    Also semctl can't use the standard wrapper since it has a 'union'
    parameter.

    So we handle them as special case and add some extra wrappers instead.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

17 Oct, 2008

1 commit


28 Apr, 2008

1 commit

  • Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
    the user mappings.

    This requires the get_xip_page API to be changed to an address based one.
    Improve the API layering a little bit too, while we're here.

    This is required in order to support XIP filesystems on memory that isn't
    backed with struct page (but memory with struct page is still supported too).

    Signed-off-by: Nick Piggin
    Acked-by: Carsten Otte
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Feb, 2008

1 commit

  • I've written some test programs in ltp project. During writing I met an
    problem which I cannot solve in user land. So I wrote a patch for linux
    kernel. Please, include this patch if acceptable.

    The test program tests the 4th parameter of fadvise64_64:

    long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);

    My test case calls fadvise64_64 with invalid advice value and checks errno is
    set to EINVAL. About the advice parameter man page says:

    ...
    Permissible values for advice include:

    POSIX_FADV_NORMAL
    ...
    POSIX_FADV_SEQUENTIAL
    ...
    POSIX_FADV_RANDOM
    ...
    POSIX_FADV_NOREUSE
    ...
    POSIX_FADV_WILLNEED
    ...
    POSIX_FADV_DONTNEED
    ...
    ERRORS
    ...
    EINVAL An invalid value was specified for advice.

    However, I got a bug report that the system call invocations
    in my test case returned 0 unexpectedly.

    I've inspected the kernel code:

    asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
    {
    struct file *file = fget(fd);
    struct address_space *mapping;
    struct backing_dev_info *bdi;
    loff_t endbyte; /* inclusive */
    pgoff_t start_index;
    pgoff_t end_index;
    unsigned long nrpages;
    int ret = 0;

    if (!file)
    return -EBADF;

    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
    ret = -ESPIPE;
    goto out;
    }

    mapping = file->f_mapping;
    if (!mapping || len < 0) {
    ret = -EINVAL;
    goto out;
    }

    if (mapping->a_ops->get_xip_page)
    /* no bad return value, but ignore advice */
    goto out;
    ...
    out:
    fput(file);
    return ret;
    }

    I found the advice parameter is just ignored in the case
    mapping->a_ops->get_xip_page is given. This behavior is different from
    what is written on the man page. Is this o.k.?

    get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
    Anyway I cannot find the easy way to detect get_xip_page
    field is given or CONFIG_EXT2_FS_XIP is true from the
    user space.

    I propose the following patch which checks the advice parameter
    even if get_xip_page is given.

    Signed-off-by: Masatake YAMATO
    Acked-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masatake YAMATO
     

09 Dec, 2006

1 commit


06 Aug, 2006

1 commit

  • The POSIX_FADV_NOREUSE hint means "the application will use this range of the
    file a single time". It seems to be intended that the implementation will use
    this hint to perform drop-behind of that part of the file when the application
    gets around to reading or writing it.

    However for reasons which aren't obvious (or sane?) I mapped
    POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED. ie: it does readahead.

    That's daft. So for now, make POSIX_FADV_NOREUSE a no-op.

    This is a non-back-compatible change. If someone was using POSIX_FADV_NOREUSE
    to perform readahead, they lose. The likelihood is low.

    If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
    do it fully we'll need to maintain file offset/length ranges and peform all
    sorts of complex tricks, and managing the lifetime of those ranges' data
    structures will be interesting..

    A sensible implementation would probably ignore the file range and would
    simply mark the entire file as needing some form of drop-behind treatment.

    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Jul, 2006

1 commit


01 Apr, 2006

1 commit

  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Mar, 2006

1 commit

  • Add two new linux-specific fadvise extensions():

    LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
    offsets `offset' and `offset+len'. Any pages which are currently under
    writeout are skipped, whether or not they are dirty.

    LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
    offsets `offset' and `offset+len'.

    By combining these two operations the application may do several things:

    LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.

    LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
    pages at the disk.

    LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
    of the currently dirty pages at the disk, wait until they have been written.

    It should be noted that none of these operations write out the file's
    metadata. So unless the application is strictly performing overwrites of
    already-instantiated disk blocks, there are no guarantees here that the data
    will be available after a crash.

    To complete this suite of operations I guess we should have a "sync file
    metadata only" operation. This gives applications access to all the building
    blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
    well with the fadvise() interface. Probably it should be a new syscall:
    sys_fmetadatasync().

    The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
    It is made to represent that last affected byte in the file (ie: it is
    inclusive). Generally, all these byterange and pagerange functions are
    inclusive so we can easily represent EOF with -1.

    As Ulrich notes, these two functions are somewhat abusive of the fadvise()
    concept, which appears to be "set the future policy for this fd".

    But these commands are a perfect fit with the fadvise() impementation, and
    several of the existing fadvise() commands are synchronous and don't affect
    future policy either. I think we can live with the slight incongruity.

    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Jan, 2006

1 commit


24 Jun, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds