05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2016

3 commits

  • Page faults can race with fallocate hole punch. If a page fault happens
    between the unmap and remove operations, the page is not removed and
    remains within the hole. This is not the desired behavior. The race is
    difficult to detect in user level code as even in the non-race case, a
    page within the hole could be faulted back in before fallocate returns.
    If userfaultfd is expanded to support hugetlbfs in the future, this race
    will be easier to observe.

    If this race is detected and a page is mapped, the remove operation
    (remove_inode_hugepages) will unmap the page before removing. The unmap
    within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
    so that no other faults will be processed until the page is removed.

    The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
    remove_inode_hugepages to satisfy the new reference.

    [akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
    Signed-off-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Davidlohr Bueso
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
    argument end is of type pgoff_t. It was being converted to a vaddr
    offset and passed to unmap_hugepage_range. However, end was also being
    used as an argument to the vma_interval_tree_foreach controlling loop.
    In addition, the conversion of end to vaddr offset was incorrect.

    hugetlb_vmtruncate_list is called as part of a file truncate or
    fallocate hole punch operation.

    When truncating a hugetlbfs file, this bug could prevent some pages from
    being unmapped. This is possible if there are multiple vmas mapping the
    file, and there is a sufficiently sized hole between the mappings. The
    size of the hole between two vmas (A,B) must be such that the starting
    virtual address of B is greater than (ending virtual address of A <<
    PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
    pages are not properly unmapped during truncate, the following BUG is
    hit:

    kernel BUG at fs/hugetlbfs/inode.c:428!

    In the fallocate hole punch case, this bug could prevent pages from
    being unmapped as in the truncate case. However, for hole punch the
    result is that unmapped pages will not be removed during the operation.
    For hole punch, it is also possible that more pages than desired will be
    unmapped. This unnecessary unmapping will cause page faults to
    reestablish the mappings on subsequent page access.

    Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton
    Signed-off-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: Dave Hansen
    Cc: [4.3]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Dmitry Vyukov has reported[1] possible deadlock (triggered by his
    syzkaller fuzzer):

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);

    Both traces points to mm_take_all_locks() as a source of the problem.
    It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
    mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.

    huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
    and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
    take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
    rest of VMAs.

    The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

    [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

3 commits

  • The Kconfig currently controlling compilation of this code is:

    config HUGETLBFS
    bool "HugeTLB file system support"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the modular code that is essentially orphaned, so that when
    reading the driver there is no doubt it is builtin-only.

    Since module_init translates to device_initcall in the non-modular case,
    the init ordering gets moved to earlier levels when we use the more
    appropriate initcalls here.

    Originally I had the fs part and the mm part as separate commits, just
    by happenstance of the nature of how I detected these non-modular use
    cases. But that can possibly introduce regressions if the patch merge
    ordering puts the fs part 1st -- as the 0-day testing reported a splat
    at mount time.

    Investigating with "initcall_debug" showed that the delta was
    init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
    both the fs change and the mm change are here together.

    In addition, it worked before due to luck of link order, since they were
    both in the same initcall category. So we now have the fs part using
    fs_initcall, and the mm part using subsys_initcall, which puts it one
    bucket earlier. It now passes the basic sanity test that failed in
    earlier 0-day testing.

    We delete the MODULE_LICENSE tag and capture that information at the top
    of the file alongside author comments, etc.

    We don't replace module.h with init.h since the file already has that.
    Also note that MODULE_ALIAS is a no-op for non-modular code.

    Signed-off-by: Paul Gortmaker
    Reported-by: kernel test robot
    Cc: Nadia Yvette Chambers
    Cc: Alexander Viro
    Cc: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Hillf Danton
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Dec, 2015

1 commit

  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

21 Nov, 2015

1 commit

  • Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
    punch code. These problems are in the routine remove_inode_hugepages and
    mostly occur in the case where there are holes in the range of pages to be
    removed. These holes could be the result of a previous hole punch or
    simply sparse allocation. The current code could access pages outside the
    specified range.

    remove_inode_hugepages handles both hole punch and truncate operations.
    Page index handling was fixed/cleaned up so that the loop index always
    matches the page being processed. The code now only makes a single pass
    through the range of pages as it was determined page faults could not race
    with truncate. A cond_resched() was added after removing up to
    PAGEVEC_SIZE pages.

    Some totally unnecessary code in hugetlbfs_fallocate() that remained from
    early development was also removed.

    Tested with fallocate tests submitted here:
    http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
    And, some ftruncate tests under development

    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Acked-by: Hugh Dickins
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: "Hillf Danton"
    Cc: [4.3]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

09 Sep, 2015

3 commits

  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to unmap a specific range of pages.
    Modify the existing hugetlb_vmtruncate_list() routine to take a
    start/end range. If end is 0, this indicates all pages after start
    should be unmapped. This is the same as the existing truncate
    functionality. Modify existing callers to add 0 as end of range.

    Since the routine will be used in hole punch as well as truncate
    operations, it is more appropriately renamed to hugetlb_vmdelete_list().

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

07 Aug, 2015

1 commit

  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     

25 Jun, 2015

1 commit


27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

16 Apr, 2015

4 commits

  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • Make 'min_size=' be an option when mounting a hugetlbfs. This
    option takes the same value as the 'size' option. min_size can be
    specified without specifying size. If both are specified, min_size must
    be less that or equal to size else the mount will fail. If min_size is
    specified, then at mount time an attempt is made to reserve min_size
    pages. If the reservation fails, the mount fails. At umount time, the
    reserved pages are released.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Pull second vfs update from Al Viro:
    "Now that net-next went in... Here's the next big chunk - killing
    ->aio_read() and ->aio_write().

    There'll be one more pile today (direct_IO changes and
    generic_write_checks() cleanups/fixes), but I'd prefer to keep that
    one separate"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    ->aio_read and ->aio_write removed
    pcm: another weird API abuse
    infinibad: weird APIs switched to ->write_iter()
    kill do_sync_read/do_sync_write
    fuse: use iov_iter_get_pages() for non-splice path
    fuse: switch to ->read_iter/->write_iter
    switch drivers/char/mem.c to ->read_iter/->write_iter
    make new_sync_{read,write}() static
    coredump: accept any write method
    switch /dev/loop to vfs_iter_write()
    serial2002: switch to __vfs_read/__vfs_write
    ashmem: use __vfs_read()
    export __vfs_read()
    autofs: switch to __vfs_write()
    new helper: __vfs_write()
    switch hugetlbfs to ->read_iter()
    coda: switch to ->read_iter/->write_iter
    ncpfs: switch to ->read_iter/->write_iter
    net/9p: remove (now-)unused helpers
    p9_client_attach(): set fid->uid correctly
    ...

    Linus Torvalds
     
  • that's the bulk of filesystem drivers dealing with inodes of their own

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

15 Apr, 2015

1 commit

  • This patch replaces cancel_dirty_page() with a helper function
    account_page_cleaned() which only updates counters. It's called from
    truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
    Page is locked in both cases, page-lock protects against concurrent
    dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
    ongoing truncation").

    Delete_from_page_cache() shouldn't be called for dirty pages, they must
    be handled by caller (either written or truncated). This patch treats
    final dirty accounting fixup at the end of __delete_from_page_cache() as
    a debug check and adds WARN_ON_ONCE() around it. If something removes
    dirty pages without proper handling that might be a bug and unwritten
    data might be lost.

    Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
    here.

    cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
    helper for nfs_invalidate_page() and it's called only in case complete
    invalidation.

    The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
    and make try_to_free_buffers() not race with dirty pages") and
    3e67c0987d75 ("truncate: clear page dirtiness before running
    try_to_free_buffers()") first was reverted right in v2.6.20 in commit
    ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
    v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
    data=journal").

    Custom fixes were introduced between these points. NFS in v2.6.23, commit
    1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
    Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
    dirty page accounting when removing a page from the page cache"). Since
    v2.6.25 all of them are redundant.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

12 Apr, 2015

2 commits


21 Jan, 2015

2 commits


14 Dec, 2014

2 commits

  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Jun, 2014

4 commits


07 May, 2014

1 commit

  • Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

04 Apr, 2014

1 commit

  • Currently, to track reserved and allocated regions, we use two different
    ways, depending on the mapping. For MAP_SHARED, we use
    address_mapping's private_list and, while for MAP_PRIVATE, we use a
    resv_map.

    Now, we are preparing to change a coarse grained lock which protect a
    region structure to fine grained lock, and this difference hinder it.
    So, before changing it, unify region structure handling, consistently
    using a resv_map regardless of the kind of mapping.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

25 Aug, 2013

1 commit


14 Aug, 2013

1 commit

  • Dave has reported the following lockdep splat:

    =================================
    [ INFO: inconsistent lock state ]
    3.11.0-rc1+ #9 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: [] page_referenced+0x87/0x5e3
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x81/0xe7
    lockdep_trace_alloc+0x5e/0xbc
    __alloc_pages_nodemask+0x8b/0x9b6
    __get_free_pages+0x20/0x31
    get_zeroed_page+0x12/0x14
    __pmd_alloc+0x1c/0x6b
    huge_pmd_share+0x265/0x283
    huge_pte_alloc+0x5d/0x71
    hugetlb_fault+0x7c/0x64a
    handle_mm_fault+0x255/0x299
    __do_page_fault+0x142/0x55c
    do_page_fault+0xd/0x16
    error_code+0x6c/0x74
    irq event stamp: 3136917
    hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50
    hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78
    softirqs last enabled at (3136180): __do_softirq+0x137/0x30f
    softirqs last disabled at (3136175): irq_exit+0xa8/0xaa
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&mapping->i_mmap_mutex);

    lock(&mapping->i_mmap_mutex);

    *** DEADLOCK ***
    no locks held by kswapd0/49.

    stack backtrace:
    CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
    Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
    Call Trace:
    dump_stack+0x4b/0x79
    print_usage_bug+0x1d9/0x1e3
    mark_lock+0x1e0/0x261
    __lock_acquire+0x623/0x17f2
    lock_acquire+0x7d/0x195
    mutex_lock_nested+0x6c/0x3a7
    page_referenced+0x87/0x5e3
    shrink_page_list+0x3d9/0x947
    shrink_inactive_list+0x155/0x4cb
    shrink_lruvec+0x300/0x5ce
    shrink_zone+0x53/0x14e
    kswapd+0x517/0xa75
    kthread+0xa8/0xaa
    ret_from_kernel_thread+0x1b/0x28

    which is a false positive caused by hugetlb pmd sharing code which
    allocates a new pmd from withing mapping->i_mmap_mutex. If this
    allocation causes reclaim then the lockdep detector complains that we
    might self-deadlock.

    This is not correct though, because hugetlb pages are not reclaimable so
    their mapping will be never touched from the reclaim path.

    The patch tells lockup detector that hugetlb i_mmap_mutex is special by
    assigning it a separate lockdep class so it won't report possible
    deadlocks on unrelated mappings.

    [peterz@infradead.org: comment for annotation]
    Reported-by: Dave Jones
    Signed-off-by: Michal Hocko
    Cc: Peter Zijlstra
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 May, 2013

1 commit

  • The current kernel returns -EINVAL unless a given mmap length is
    "almost" hugepage aligned. This is because in sys_mmap_pgoff() the
    given length is passed to vm_mmap_pgoff() as it is without being aligned
    with hugepage boundary.

    This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
    alignment of huge page requests"), where alignment code is pushed into
    hugetlb_file_setup() and the variable len in caller side is not changed.

    To fix this, this patch partially reverts that commit, and adds
    alignment code in caller side. And it also introduces hstate_sizelog()
    in order to get proper hstate to specified hugepage size.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

    [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Johannes Weiner
    Reported-by:
    Cc: Steven Truelove
    Cc: Jianguo Wu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

18 Apr, 2013

1 commit

  • Currently we fail to include any data on hugepages into coredump,
    because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was
    recently introduced by commit 314e51b9851b ("mm: kill vma flag
    VM_RESERVED and mm->reserved_vm counter").

    This looks to me a serious regression, so let's fix it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Reviewed-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Feb, 2013

1 commit