06 Jan, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    inotify: fix type errors in interfaces
    fix breakage in reiserfs_new_inode()
    fix the treatment of jfs special inodes
    vfs: remove duplicate code in get_fs_type()
    add a vfs_fsync helper
    sys_execve and sys_uselib do not call into fsnotify
    zero i_uid/i_gid on inode allocation
    inode->i_op is never NULL
    ntfs: don't NULL i_op
    isofs check for NULL ->i_op in root directory is dead code
    affs: do not zero ->i_op
    kill suid bit only for regular files
    vfs: lseek(fd, 0, SEEK_CUR) race condition

    Linus Torvalds
     
  • We don't have to do it because it is useless for non regular files.
    In fact block device may trigger this path without dentry->d_inode->i_mutex.

    (akpm: concerns were expressed (by me) about S_ISDIR inodes)

    Signed-off-by: Dmitri Monakhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dmitri Monakhov
     

05 Jan, 2009

1 commit

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

31 Oct, 2008

1 commit

  • Nothing uses prepare_write or commit_write. Remove them from the tree
    completely.

    [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
    Signed-off-by: Nick Piggin
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Oct, 2008

3 commits

  • This patch tries to make page->mapping to be NULL before
    mem_cgroup_uncharge_cache_page() is called.

    "page->mapping == NULL" is a good check for "whether the page is still
    radix-tree or not". This patch also adds BUG_ON() to
    mem_cgroup_uncharge_cache_page();

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • trylock_page, unlock_page open and close a critical section. Hence,
    we can use the lock bitops to get the desired memory ordering.

    Also, mark trylock as likely to succeed (and remove the annotation from
    callers).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Oct, 2008

2 commits

  • * 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
    softirq, warning fix: correct a format to avoid a warning
    softirqs, debug: preemption check
    x86, pci-hotplug, calgary / rio: fix EBDA ioremap()
    IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix
    IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes
    softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description
    dmi scan: warn about too early calls to dmi_check_system()
    generic: redefine resource_size_t as phys_addr_t
    generic: make PFN_PHYS explicitly return phys_addr_t
    generic: add phys_addr_t for holding physical addresses
    softirq: allocate less vectors
    IO resources: fix/remove printk
    printk: robustify printk, update comment
    printk: robustify printk, fix #2
    printk: robustify printk, fix
    printk: robustify printk

    Fixed up conflicts in:
    arch/powerpc/include/asm/types.h
    arch/powerpc/platforms/Kconfig.cputype
    manually.

    Linus Torvalds
     
  • The 'filp' argument to do_generic_file_read() is never NULL.

    Signed-off-by: Krishna Kumar
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krishna Kumar
     

14 Oct, 2008

1 commit

  • If lock_page_killable() fails because the task was killed by SIGKILL or
    any other fatal signal, do_generic_file_read() returns -EIO.

    This seems to be OK, because in fact the userspace won't see this error,
    the task will dequeue SIGKILL and exit.

    However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
    return to the user-space with the bogus -EIO.

    Change the code to return the error code from lock_page_killable(), -EINTR.
    This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
    change the code looks a bit more logical, and the "good" init should handle
    the spurious EINTR or short read.

    Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
    but this can't prevent the short reads.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

03 Sep, 2008

1 commit

  • Dio write returns EIO when try_to_release_page fails because bh is
    still referenced.

    The patch

    commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
    Author: Mingming Cao
    Date: Fri Jul 25 01:46:22 2008 -0700

    jbd: fix race between free buffer and commit transaction

    was merged into 2.6.27-rc1, but I noticed that this patch is not enough
    to fix the race.

    I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
    sometimes got EIO through this test.

    The patch above fixed race between freeing buffer(dio) and committing
    transaction(jbd) but I discovered that there is another race, freeing
    buffer(dio) and ext3/4_ordered_writepage.

    : background_writeout()
    ->write_cache_pages()
    ->ext3_ordered_writepage()
    walk_page_buffers() -> take a bh ref
    block_write_full_page() -> unlock_page
    : try_to_release_page fails)
    walk_page_buffers() ->release a bh ref

    ext3_ordered_writepage holds bh ref and does unlock_page remaining
    taking a bh ref, so this causes the race and failure of
    try_to_release_page.

    To fix this race, I used the approach of falling back to buffered
    writes if try_to_release_page() fails on a page.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Hisashi Hifumi
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: Mingming Cao
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

31 Jul, 2008

1 commit

  • The iov_iter_advance() function would look at the iov->iov_len entry
    even though it might have iterated over the whole array, and iov was
    pointing past the end. This would cause DEBUG_PAGEALLOC to trigger a
    kernel page fault if the allocation was at the end of a page, and the
    next page was unallocated.

    The quick fix is to just change the order of the tests: check that there
    is any iovec data left before we check the iov entry itself.

    Thanks to Alexey Dobriyan for finding this case, and testing the fix.

    Reported-and-tested-by: Alexey Dobriyan
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Jul, 2008

1 commit

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

27 Jul, 2008

4 commits

  • All calls to remove_suid() are made with a file pointer, because
    (similarly to file_update_time) it is called when the file is written.

    Clean up callers by passing in a file instead of a dentry.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Combine page_cache_get_speculative with lockless radix tree lookups to
    introduce lockless page cache lookups (ie. no mapping->tree_lock on the
    read-side).

    The only atomicity changes this introduces is that the gang pagecache
    lookup functions now behave as if they are implemented with multiple
    find_get_page calls, rather than operating on a snapshot of the pages. In
    practice, this atomicity guarantee is not used anyway, and it is to
    replace individual lookups, so these semantics are natural.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Jul, 2008

2 commits

  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • journal_try_to_free_buffers() could race with jbd commit transaction when
    the later is holding the buffer reference while waiting for the data
    buffer to flush to disk. If the caller of journal_try_to_free_buffers()
    request tries hard to release the buffers, it will treat the failure as
    error and return back to the caller. We have seen the directo IO failed
    due to this race. Some of the caller of releasepage() also expecting the
    buffer to be dropped when passed with GFP_KERNEL mask to the
    releasepage()->journal_try_to_free_buffers().

    With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
    indicating this call could wait, in case of try_to_free_buffers() failed,
    let's waiting for journal_commit_transaction() to finish commit the
    current committing transaction, then try to free those buffers again.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mingming Cao
    Reviewed-by: Badari Pulavarty
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     

25 Jul, 2008

2 commits

  • As akpm points out, there's really no need for generic_file_aio_read to
    make a special case of count 0: just loop through nr_segs doing nothing.
    And as Harvey Harrison points out, there's no need to reset retval to 0
    where it's already 0.

    Setting count (or ocount) to 0 before calling generic_segment_checks is
    unnecessary too; but reluctantly I'll leave that removal to someone with a
    wider range of gcc versions to hand - 4.1.2 and 4.2.1 don't warn about it,
    but perhaps others do - I forget which are the warniest versions.

    Signed-off-by: Hugh Dickins
    Tested-by: Lawrence Greenfield
    Cc: Christoph Rohland
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • generic_file_direct_IO is a common helper around the invocation of
    ->direct_IO. But there's almost nothing shared between the read and write
    side, so we're better off without this helper.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

12 Jul, 2008

1 commit


15 May, 2008

1 commit

  • filemap_fault will go into an infinite loop if ->readpage() fails
    asynchronously.

    AFAICS the bug was introduced by this commit, which removed the wait after the
    final readpage:

    commit d00806b183152af6d24f46f0c33f14162ca1262a
    Author: Nick Piggin
    Date: Thu Jul 19 01:46:57 2007 -0700

    mm: fix fault vs invalidate race for linear mappings

    Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
    the page is up-to-date before jumping back to the beginning of the function.

    I've noticed this while testing nfs exporting on fuse. The patch
    fixes it.

    Signed-off-by: Miklos Szeredi
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

07 May, 2008

1 commit

  • generic_file_splice_write() duplicates remove_suid() just because it
    doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
    anyway, so this is rather pointless.

    Move locking to generic_file_splice_write() and call remove_suid() and
    __splice_from_pipe() instead.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

28 Apr, 2008

1 commit

  • Clean up messy conditional calling of test_clear_page_writeback() from both
    rotate_reclaimable_page() and end_page_writeback().

    The only user of rotate_reclaimable_page() is end_page_writeback() so this is
    OK.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

11 Mar, 2008

1 commit

  • iov_iter_advance() skips over zero-length iovecs, however it does not properly
    terminate at the end of the iovec array. Fix this by checking against
    i->count before we skip a zero-length iov.

    The bug was reproduced with a test program that continually randomly creates
    iovs to writev. The fix was also verified with the same program and also it
    could verify that the correct data was contained in the file after each
    writev.

    Signed-off-by: Nick Piggin
    Tested-by: "Kevin Coffman"
    Cc: "Alexey Dobriyan"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Mar, 2008

1 commit


14 Feb, 2008

1 commit


09 Feb, 2008

2 commits

  • do_generic_mapping_read was used by gfs2 for internals reads, but this use
    of the interface was rather suboptimal (as was the whole interface) and has
    been replaced by an internal helper now. This patch kills
    do_generic_mapping_read and surrounding damage in preparation of additional
    cleanups for the buffered read path.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Convert variables containing page indexes to pgoff_t.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Feb, 2008

5 commits

  • Need to strip __GFP_HIGHMEM flag while passing to mem_container_cache_charge().

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • Move mem_controller_cache_charge() above radix_tree_preload().
    radix_tree_preload() disables preemption, even though the gfp_mask passed
    contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we
    hit a BUG_ON() in kmem_cache_alloc().

    This patch moves mem_controller_cache_charge() to above radix_tree_preload()
    for cache charging.

    Signed-off-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Choose if we want cached pages to be accounted or not. By default both are
    accounted for. A new set of tunables are added.

    echo -n 1 > mem_control_type

    switches the accounting to account for only mapped pages

    echo -n 3 > mem_control_type

    switches the behaviour back

    [bunk@kernel.org: mm/memcontrol.c: clenups]
    [akpm@linux-foundation.org: fix sparc32 build]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

2 commits

  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Most pagecache (and some other) radix tree insertions have the great
    opportunity to preallocate a few nodes with relaxed gfp flags. But the
    preallocation is squandered when it comes time to allocate a node, we
    default to first attempting a GFP_ATOMIC allocation -- that doesn't
    normally fail, but it can eat into atomic memory reserves that we don't
    need to be using.

    Another upshot of this is that it removes the sometimes highly contended
    zone->lock from underneath tree_lock. Pagecache insertions are always
    performed with a radix tree preload, and after this change, such a
    situation will never fall back to kmem_cache_alloc within
    radix_tree_node_alloc.

    David Miller reports seeing this allocation fail on a highly threaded
    sparc64 system:

    [527319.459981] dd: page allocation failure. order:0, mode:0x20
    [527319.460403] Call Trace:
    [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
    [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
    [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
    [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
    [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
    [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
    [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
    [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
    [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
    [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
    [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
    [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
    [527319.461361] [00000000004bb920] sys_read+0x34/0x60
    [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

    The calltrace is significant: __do_page_cache_readahead allocates a number
    of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
    memory to satisfy GFP_ATOMIC allocations. However after the list of pages
    goes to mpage_readpages, there can be significant intervals (including disk
    IO) before all the pages are inserted into the radix-tree. So the reserves
    can easily be depleted at that point. The patch is confirmed to fix the
    problem.

    Signed-off-by: Nick Piggin
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

03 Feb, 2008

1 commit

  • Frederik Himpe reported an unkillable and un-straceable pan process.

    Zero length iovecs can go into an infinite loop in writev, because the
    iovec iterator does not always advance over them.

    The sequence required to trigger this is not trivial. I think it
    requires that a zero-length iovec be followed by a non-zero-length iovec
    which causes a pagefault in the atomic usercopy. This causes the writev
    code to drop back into single-segment copy mode, which then tries to
    copy the 0 bytes of the zero-length iovec; a zero length copy looks like
    a failure though, so it loops.

    Put a test into iov_iter_advance to catch zero-length iovecs. We could
    just put the test in the fallback path, but I feel it is more robust to
    skip over zero-length iovecs throughout the code (iovec iterator may be
    used in filesystems too, so it should be robust).

    Signed-off-by: Nick Piggin
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Nick Piggin