22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

21 May, 2016

1 commit

  • The nommu do_mmap expects f_op->get_unmapped_area to either succeed or
    return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings.
    Returning addr in the non-MAP_SHARED case was completely wrong, and only
    happened to work because addr was 0. However, it prevented VM_MAYSHARE
    mappings from sharing backing with the fs cache, and forced such
    mappings (including shareable program text) to be copied whenever the
    number of mappings transitioned from 0 to 1, impacting performance and
    memory usage. Subsequent mappings beyond the first still correctly
    shared memory with the first.

    Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level;
    do_mmap already handles the semantic differences between them.

    Signed-off-by: Rich Felker
    Cc: Michal Hocko
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rich Felker
     

17 Oct, 2015

1 commit

  • Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache
    allocation paths") has caught some users of hardcoded GFP_KERNEL used in
    the page cache allocation paths. This, however, wasn't complete and
    there were others which went unnoticed.

    Dave Chinner has reported the following deadlock for xfs on loop device:
    : With the recent merge of the loop device changes, I'm now seeing
    : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
    :
    : The deadlocked is as follows:
    :
    : kloopd1: loop_queue_read_work
    : xfs_file_iter_read
    : lock XFS inode XFS_IOLOCK_SHARED (on image file)
    : page cache read (GFP_KERNEL)
    : radix tree alloc
    : memory reclaim
    : reclaim XFS inodes
    : log force to unpin inodes
    :
    :
    : xfs-cil/loop1:
    : xlog_cil_push
    : xlog_write
    :
    : xlog_state_get_iclog_space()
    :
    :
    :
    : kloopd1: loop_queue_write_work
    : xfs_file_write_iter
    : lock XFS inode XFS_IOLOCK_EXCL (on image file)
    :
    :
    : i.e. the kloopd, with it's split read and write work queues, has
    : introduced a dependency through memory reclaim. i.e. that writes
    : need to be able to progress for reads make progress.
    :
    : The problem, fundamentally, is that mpage_readpages() does a
    : GFP_KERNEL allocation, rather than paying attention to the inode's
    : mapping gfp mask, which is set to GFP_NOFS.
    :
    : The didn't used to happen, because the loop device used to issue
    : reads through the splice path and that does:
    :
    : error = add_to_page_cache_lru(page, mapping, index,
    : GFP_KERNEL & mapping_gfp_mask(mapping));

    This has changed by commit aa4d86163e4 ("block: loop: switch to VFS
    ITER_BVEC").

    This patch changes mpage_readpage{s} to follow gfp mask set for the
    mapping. There are, however, other places which are doing basically the
    same.

    lustre:ll_dir_filler is doing GFP_KERNEL from the function which
    apparently uses GFP_NOFS for other allocations so let's make this
    consistent.

    cifs:readpages_get_pages is called from cifs_readpages and
    __cifs_readpages_from_fscache called from the same path obeys mapping
    gfp.

    ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
    regardless it uses mapping_gfp_mask for the page allocation.

    ext4_mpage_readpages is the called from the page cache allocation path
    same as read_pages and read_cache_pages

    As I've noticed in my previous post I cannot say I would be happy about
    sprinkling mapping_gfp_mask all over the place and it sounds like we
    should drop gfp_mask argument altogether and use it internally in
    __add_to_page_cache_locked that would require all the filesystems to use
    mapping gfp consistently which I am not sure is the case here. From a
    quick glance it seems that some file system use it all the time while
    others are selective.

    Signed-off-by: Michal Hocko
    Reported-by: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Ming Lei
    Cc: Andreas Dilger
    Cc: Oleg Drokin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

16 Apr, 2015

1 commit


12 Apr, 2015

1 commit

  • All places outside of core VFS that checked ->read and ->write for being NULL or
    called the methods directly are gone now, so NULL {read,write} with non-NULL
    {read,write}_iter will do the right thing in all cases.

    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2015

1 commit

  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Aug, 2014

1 commit


12 Jun, 2014

1 commit

  • iter_file_splice_write() - a ->splice_write() instance that gathers the
    pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
    it to ->write_iter(). A bunch of simple cases coverted to that...

    [AV: fixed the braino spotted by Cyrill]

    Signed-off-by: Al Viro

    Al Viro
     

07 May, 2014

2 commits


24 Jan, 2014

2 commits


23 Feb, 2013

1 commit


12 Jul, 2012

1 commit

  • There is a bug in the below scenario for !CONFIG_MMU:

    1. create a new file
    2. mmap the file and write to it
    3. read the file can't get the correct value

    Because

    sys_read() -> generic_file_aio_read() -> simple_readpage() -> clear_page()

    which causes the page to be zeroed.

    Add SetPageUptodate() to ramfs_nommu_expand_for_mapping() so that
    generic_file_aio_read() do not call simple_readpage().

    Signed-off-by: Bob Liu
    Cc: Hugh Dickins
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Greg Ungerer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

15 Apr, 2011

1 commit

  • On no-mmu arch, there is a memleak during shmem test. The cause of this
    memleak is ramfs_nommu_expand_for_mapping() added page refcount to 2
    which makes iput() can't free that pages.

    The simple test file is like this:

    int main(void)
    {
    int i;
    key_t k = ftok("/etc", 42);

    for ( i=0; i free
    total used free shared buffers
    Mem: 60320 17912 42408 0 0
    -/+ buffers: 17912 42408
    root:/> shmem
    run ok...
    root:/> free
    total used free shared buffers
    Mem: 60320 19096 41224 0 0
    -/+ buffers: 19096 41224
    root:/> shmem
    run ok...
    root:/> free
    total used free shared buffers
    Mem: 60320 20296 40024 0 0
    -/+ buffers: 20296 40024
    ...

    After this patch the test result is:(no memleak anymore)

    root:/> free
    total used free shared buffers
    Mem: 60320 16668 43652 0 0
    -/+ buffers: 16668 43652
    root:/> shmem
    run ok...
    root:/> free
    total used free shared buffers
    Mem: 60320 16668 43652 0 0
    -/+ buffers: 16668 43652

    Signed-off-by: Bob Liu
    Acked-by: Hugh Dickins
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

10 Aug, 2010

2 commits

  • Make sure we check the truncate constraints early on in ->setattr by adding
    those checks to inode_change_ok. Also clean up and document inode_change_ok
    to make this obvious.

    As a fallout we don't have to call inode_newsize_ok from simple_setsize and
    simplify it down to a truncate_setsize which doesn't return an error. This
    simplifies a lot of setattr implementations and means we use truncate_setsize
    almost everywhere. Get rid of fat_setsize now that it's trivial and mark
    ext2_setsize static to make the calling convention obvious.

    Keep the inode_newsize_ok in vmtruncate for now as all callers need an
    audit for its removal anyway.

    Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
    needs a deeper audit, but that is left for later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Despite its name it's now a generic implementation of ->setattr, but
    rather a helper to copy attributes from a struct iattr to the inode.
    Rename it to setattr_copy to reflect this fact.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 May, 2010

2 commits

  • Convert simple filesystems: ramfs, configfs, sysfs, block_dev to new truncate
    sequence.

    Cc: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • We don't name our generic fsync implementations very well currently.
    The no-op implementation for in-memory filesystems currently is called
    simple_sync_file which doesn't make too much sense to start with,
    the the generic one for simple filesystems is called simple_fsync
    which can lead to some confusion.

    This patch renames the generic file fsync method to generic_file_fsync
    to match the other generic_file_* routines it is supposed to be used
    with, and the no-op implementation to noop_fsync to make it obvious
    what to expect. In addition add some documentation for both methods.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

17 Jan, 2010

2 commits

  • Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
    over the end of a truncation. The problem is that
    ramfs_nommu_check_mappings() checks that the reduced file size against the
    VMA tree, but not the vm_region tree.

    The following sequence of events can cause the problem:

    fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
    ftruncate(fd, 32 * 1024);
    a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    munmap(a, 32 * 1024);
    ftruncate(fd, 16 * 1024);
    c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

    Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
    sees that the vm_region from 'a' is covering the region it wants and so
    shares it, pinning it in memory.

    Mapping 'a' then goes away and the file is truncated to the end of VMA
    'b'. However, the region allocated by 'a' is still in effect, and has
    _not_ been reduced.

    Mapping 'c' is then created, and because there's a vm_region covering the
    desired region, get_unmapped_area() is _not_ called to repeat the check,
    and the mapping is granted, even though the pages from the latter half of
    the mapping have been discarded.

    However:

    d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

    Mapping 'd' should work, and should end up sharing the region allocated by
    'a'.

    To deal with this, we shrink the vm_region struct during the truncation,
    lest do_mmap_pgoff() take it as licence to share the full region
    automatically without calling the get_unmapped_area() file op again.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix the race between the truncation of a ramfs file and an attempt to make
    a shared mmap of region of that file.

    The problem is that do_mmap_pgoff() calls f_op->get_unmapped_area() to
    verify that the file region is made of contiguous pages and to find its
    base address - but there isn't any locking to guarantee this region until
    vma_prio_tree_insert() is called by add_vma_to_mm().

    Note that moving the functionality into f_op->mmap() doesn't help as that
    is also called before vma_prio_tree_insert().

    Instead make ramfs_nommu_check_mappings() grab nommu_region_sem whilst it
    does its checks. This means that this function will wait whilst mmaps
    take place.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

18 Dec, 2009

1 commit


24 Sep, 2009

1 commit

  • Update some fs code to make use of new helper functions introduced
    in the previous patch. Should be no significant change in behaviour
    (except CIFS now calls send_sig under i_lock, via inode_newsize_ok).

    Reviewed-by: Christoph Hellwig
    Acked-by: Miklos Szeredi
    Cc: linux-nfs@vger.kernel.org
    Cc: Trond.Myklebust@netapp.com
    Cc: linux-cifs-client@lists.samba.org
    Cc: sfrench@samba.org
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

30 Jul, 2009

1 commit

  • This file makes use of various macros defined in files like asm/current.h
    or asm-generic/resource.h. All these files can be included via sched.h.
    The building of the !MMU ARM kernel (with additional patches) fails
    without this change.

    Signed-off-by: Catalin Marinas
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

01 Apr, 2009

1 commit

  • Instead of open-coding the lru-list-add pagevec batching when expanding a
    file mapping from zero, defer to the appropriate page cache function that
    also takes care of adding the page to the lru list.

    This is cleaner, saves code and reduces the stack footprint by 16 words
    worth of pagevec.

    Signed-off-by: Johannes Weiner
    Acked-by: David Howells
    Cc: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: MinChan Kim
    Cc: Lee Schermerhorn
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Mar, 2009

1 commit


15 Mar, 2009

2 commits

  • When a ramfs nommu mapping is expanded, contiguous pages are allocated
    and added to the pagecache. The caller's reference is then passed on
    by moving whole pagevecs to the file lru list.

    If the page cache adding fails, make sure that the error path also
    moves the pagevec contents which might still contain up to PAGEVEC_SIZE
    successfully added pages, of which we would leak references otherwise.

    Signed-off-by: Johannes Weiner
    Cc: David Howells
    Cc: Enrik Berkhan
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The pages attached to a ramfs inode's pagecache by truncation from nothing
    - as done by SYSV SHM for example - may get discarded under memory
    pressure.

    The problem is that the pages are not marked dirty. Anything that creates
    data in an MMU-based ramfs will cause the pages holding that data will
    cause the set_page_dirty() aop to be called.

    For the NOMMU-based mmap, set_page_dirty() may be called by write(), but
    it won't be called by page-writing faults on writable mmaps, and it isn't
    called by ramfs_nommu_expand_for_mapping() when a file is being truncated
    from nothing to allocate a contiguous run.

    The solution is to mark the pages dirty at the point of allocation by the
    truncation code.

    Signed-off-by: Enrik Berkhan
    Signed-off-by: David Howells
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Enrik Berkhan
     

08 Jan, 2009

1 commit

  • Fix cleanup handling in ramfs_nommu_get_umapped_area() by only freeing the
    number of pages that find_get_pages() said it had returned (nr) rather than
    attempting to free the number of pages we asked for (lpages) - thus avoiding
    the situation whereby put_page() may be handed NULL pointers if
    find_get_pages() returned fewer pages that were requested.

    Also avoid a warning about nr being uninitialised and the need for an
    if-statement in the cleanup path by using appropriate gotos.

    Signed-off-by: David Howells

    David Howells
     

20 Oct, 2008

1 commit

  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

03 Oct, 2008

1 commit

  • The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm:
    tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock
    ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU
    architectures because it was using the expanding truncate to signal ramfs
    to allocate a physically contiguous RAM backing the inode (otherwise it is
    unusable for "memory mapping" it to userspace).

    However do_truncate is what caused the lock ordering error, due to it
    taking i_mutex. In this case, we can actually just call ramfs directly to
    allocate memory for the mapping, rather than go via truncate.

    Acked-by: David Howells
    Acked-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

04 Jul, 2008

1 commit


17 Oct, 2007

1 commit


01 Aug, 2007

1 commit

  • Fix the SYSV IPC SHM to work with the changes applied by the new fault handler
    patches when CONFIG_MMU=n.

    Signed-off-by: David Howells
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

10 Jul, 2007

1 commit


08 Jun, 2007

1 commit

  • This bug was caught by LTP testcase fchmod06 on Blackfin platform.

    In the manpage of fchmod, "EPERM: The effective UID does not match the
    owner of the file, and the process is not privileged (Linux: it does not
    have the CAP_FOWNER capability)."

    But the ramfs nommu code missed the inode_change_ok POSIX UID/GID
    verification. This patch fixed this.

    Signed-off-by: Bryan Wu
    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Bryan Wu
     

31 May, 2007

1 commit


09 May, 2007

1 commit


13 Feb, 2007

1 commit

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven