09 Sep, 2015

3 commits

  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to unmap a specific range of pages.
    Modify the existing hugetlb_vmtruncate_list() routine to take a
    start/end range. If end is 0, this indicates all pages after start
    should be unmapped. This is the same as the existing truncate
    functionality. Modify existing callers to add 0 as end of range.

    Since the routine will be used in hole punch as well as truncate
    operations, it is more appropriately renamed to hugetlb_vmdelete_list().

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

07 Aug, 2015

1 commit

  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     

25 Jun, 2015

1 commit


27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

16 Apr, 2015

4 commits

  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • Make 'min_size=' be an option when mounting a hugetlbfs. This
    option takes the same value as the 'size' option. min_size can be
    specified without specifying size. If both are specified, min_size must
    be less that or equal to size else the mount will fail. If min_size is
    specified, then at mount time an attempt is made to reserve min_size
    pages. If the reservation fails, the mount fails. At umount time, the
    reserved pages are released.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Pull second vfs update from Al Viro:
    "Now that net-next went in... Here's the next big chunk - killing
    ->aio_read() and ->aio_write().

    There'll be one more pile today (direct_IO changes and
    generic_write_checks() cleanups/fixes), but I'd prefer to keep that
    one separate"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    ->aio_read and ->aio_write removed
    pcm: another weird API abuse
    infinibad: weird APIs switched to ->write_iter()
    kill do_sync_read/do_sync_write
    fuse: use iov_iter_get_pages() for non-splice path
    fuse: switch to ->read_iter/->write_iter
    switch drivers/char/mem.c to ->read_iter/->write_iter
    make new_sync_{read,write}() static
    coredump: accept any write method
    switch /dev/loop to vfs_iter_write()
    serial2002: switch to __vfs_read/__vfs_write
    ashmem: use __vfs_read()
    export __vfs_read()
    autofs: switch to __vfs_write()
    new helper: __vfs_write()
    switch hugetlbfs to ->read_iter()
    coda: switch to ->read_iter/->write_iter
    ncpfs: switch to ->read_iter/->write_iter
    net/9p: remove (now-)unused helpers
    p9_client_attach(): set fid->uid correctly
    ...

    Linus Torvalds
     
  • that's the bulk of filesystem drivers dealing with inodes of their own

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

15 Apr, 2015

1 commit

  • This patch replaces cancel_dirty_page() with a helper function
    account_page_cleaned() which only updates counters. It's called from
    truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
    Page is locked in both cases, page-lock protects against concurrent
    dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
    ongoing truncation").

    Delete_from_page_cache() shouldn't be called for dirty pages, they must
    be handled by caller (either written or truncated). This patch treats
    final dirty accounting fixup at the end of __delete_from_page_cache() as
    a debug check and adds WARN_ON_ONCE() around it. If something removes
    dirty pages without proper handling that might be a bug and unwritten
    data might be lost.

    Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
    here.

    cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
    helper for nfs_invalidate_page() and it's called only in case complete
    invalidation.

    The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
    and make try_to_free_buffers() not race with dirty pages") and
    3e67c0987d75 ("truncate: clear page dirtiness before running
    try_to_free_buffers()") first was reverted right in v2.6.20 in commit
    ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
    v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
    data=journal").

    Custom fixes were introduced between these points. NFS in v2.6.23, commit
    1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
    Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
    dirty page accounting when removing a page from the page cache"). Since
    v2.6.25 all of them are redundant.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

12 Apr, 2015

2 commits


21 Jan, 2015

2 commits


14 Dec, 2014

2 commits

  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Jun, 2014

4 commits


07 May, 2014

1 commit

  • Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

04 Apr, 2014

1 commit

  • Currently, to track reserved and allocated regions, we use two different
    ways, depending on the mapping. For MAP_SHARED, we use
    address_mapping's private_list and, while for MAP_PRIVATE, we use a
    resv_map.

    Now, we are preparing to change a coarse grained lock which protect a
    region structure to fine grained lock, and this difference hinder it.
    So, before changing it, unify region structure handling, consistently
    using a resv_map regardless of the kind of mapping.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

25 Aug, 2013

1 commit


14 Aug, 2013

1 commit

  • Dave has reported the following lockdep splat:

    =================================
    [ INFO: inconsistent lock state ]
    3.11.0-rc1+ #9 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: [] page_referenced+0x87/0x5e3
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x81/0xe7
    lockdep_trace_alloc+0x5e/0xbc
    __alloc_pages_nodemask+0x8b/0x9b6
    __get_free_pages+0x20/0x31
    get_zeroed_page+0x12/0x14
    __pmd_alloc+0x1c/0x6b
    huge_pmd_share+0x265/0x283
    huge_pte_alloc+0x5d/0x71
    hugetlb_fault+0x7c/0x64a
    handle_mm_fault+0x255/0x299
    __do_page_fault+0x142/0x55c
    do_page_fault+0xd/0x16
    error_code+0x6c/0x74
    irq event stamp: 3136917
    hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50
    hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78
    softirqs last enabled at (3136180): __do_softirq+0x137/0x30f
    softirqs last disabled at (3136175): irq_exit+0xa8/0xaa
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&mapping->i_mmap_mutex);

    lock(&mapping->i_mmap_mutex);

    *** DEADLOCK ***
    no locks held by kswapd0/49.

    stack backtrace:
    CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
    Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
    Call Trace:
    dump_stack+0x4b/0x79
    print_usage_bug+0x1d9/0x1e3
    mark_lock+0x1e0/0x261
    __lock_acquire+0x623/0x17f2
    lock_acquire+0x7d/0x195
    mutex_lock_nested+0x6c/0x3a7
    page_referenced+0x87/0x5e3
    shrink_page_list+0x3d9/0x947
    shrink_inactive_list+0x155/0x4cb
    shrink_lruvec+0x300/0x5ce
    shrink_zone+0x53/0x14e
    kswapd+0x517/0xa75
    kthread+0xa8/0xaa
    ret_from_kernel_thread+0x1b/0x28

    which is a false positive caused by hugetlb pmd sharing code which
    allocates a new pmd from withing mapping->i_mmap_mutex. If this
    allocation causes reclaim then the lockdep detector complains that we
    might self-deadlock.

    This is not correct though, because hugetlb pages are not reclaimable so
    their mapping will be never touched from the reclaim path.

    The patch tells lockup detector that hugetlb i_mmap_mutex is special by
    assigning it a separate lockdep class so it won't report possible
    deadlocks on unrelated mappings.

    [peterz@infradead.org: comment for annotation]
    Reported-by: Dave Jones
    Signed-off-by: Michal Hocko
    Cc: Peter Zijlstra
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 May, 2013

1 commit

  • The current kernel returns -EINVAL unless a given mmap length is
    "almost" hugepage aligned. This is because in sys_mmap_pgoff() the
    given length is passed to vm_mmap_pgoff() as it is without being aligned
    with hugepage boundary.

    This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
    alignment of huge page requests"), where alignment code is pushed into
    hugetlb_file_setup() and the variable len in caller side is not changed.

    To fix this, this patch partially reverts that commit, and adds
    alignment code in caller side. And it also introduces hstate_sizelog()
    in order to get proper hstate to specified hugepage size.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

    [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Johannes Weiner
    Reported-by:
    Cc: Steven Truelove
    Cc: Jianguo Wu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

18 Apr, 2013

1 commit

  • Currently we fail to include any data on hugepages into coredump,
    because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was
    recently introduced by commit 314e51b9851b ("mm: kill vma flag
    VM_RESERVED and mm->reserved_vm counter").

    This looks to me a serious regression, so let's fix it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Reviewed-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Feb, 2013

1 commit


23 Feb, 2013

2 commits

  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     
  • Signed-off-by: Al Viro

    Al Viro
     

14 Dec, 2012

1 commit

  • Pull trivial branch from Jiri Kosina:
    "Usual stuff -- comment/printk typo fixes, documentation updates, dead
    code elimination."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    HOWTO: fix double words typo
    x86 mtrr: fix comment typo in mtrr_bp_init
    propagate name change to comments in kernel source
    doc: Update the name of profiling based on sysfs
    treewide: Fix typos in various drivers
    treewide: Fix typos in various Kconfig
    wireless: mwifiex: Fix typo in wireless/mwifiex driver
    messages: i2o: Fix typo in messages/i2o
    scripts/kernel-doc: check that non-void fcts describe their return value
    Kernel-doc: Convention: Use a "Return" section to describe return values
    radeon: Fix typo and copy/paste error in comments
    doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
    various: Fix spelling of "asynchronous" in comments.
    Fix misspellings of "whether" in comments.
    eisa: Fix spelling of "asynchronous".
    various: Fix spelling of "registered" in comments.
    doc: fix quite a few typos within Documentation
    target: iscsi: fix comment typos in target/iscsi drivers
    treewide: fix typo of "suport" in various comments and Kconfig
    treewide: fix typo of "suppport" in various comments
    ...

    Linus Torvalds
     

12 Dec, 2012

3 commits

  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a
    guest, thus imposing performance penalties associated with the reduced
    number of transparent huge pages that could be used by the guest workload.

    This patch-set follows the main idea discussed at 2012 LSFMMS session:
    "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
    to introduce the required changes to the virtio_balloon driver, as well as
    the changes to the core compaction & migration bits, in order to make
    those subsystems aware of ballooned pages and allow memory balloon pages
    become movable within a guest, thus avoiding the aforementioned
    fragmentation issue

    Following are numbers that prove this patch benefits on allowing
    compaction to be more effective at memory ballooned guests.

    Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
    running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
    chunks, at every minute (inflating/deflating), while test was running:

    ===BEGIN stress-highalloc

    STRESS-HIGHALLOC
    highalloc-3.7 highalloc-3.7
    rc4-clean rc4-patch
    Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
    Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
    while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)

    MMTests Statistics: duration
    3.7 3.7
    rc4-clean rc4-patch
    User 1207.59 1207.46
    System 1300.55 1299.61
    Elapsed 2273.72 2157.06

    MMTests Statistics: vmstat
    3.7 3.7
    rc4-clean rc4-patch
    Page Ins 3581516 2374368
    Page Outs 11148692 10410332
    Swap Ins 80 47
    Swap Outs 3641 476
    Direct pages scanned 37978 33826
    Kswapd pages scanned 1828245 1342869
    Kswapd pages reclaimed 1710236 1304099
    Direct pages reclaimed 32207 31005
    Kswapd efficiency 93% 97%
    Kswapd velocity 804.077 622.546
    Direct efficiency 84% 91%
    Direct velocity 16.703 15.682
    Percentage direct scans 2% 2%
    Page writes by reclaim 79252 9704
    Page writes file 75611 9228
    Page writes anon 3641 476
    Page reclaim immediate 16764 11014
    Page rescued immediate 0 0
    Slabs scanned 2171904 2152448
    Direct inode steals 385 2261
    Kswapd inode steals 659137 609670
    Kswapd skipped wait 1 69
    THP fault alloc 546 631
    THP collapse alloc 361 339
    THP splits 259 263
    THP fault fallback 98 50
    THP collapse fail 20 17
    Compaction stalls 747 499
    Compaction success 244 145
    Compaction failures 503 354
    Compaction pages moved 370888 474837
    Compaction move failure 77378 65259

    ===END stress-highalloc

    This patch:

    Introduce MIGRATEPAGE_SUCCESS as the default return code for
    address_space_operations.migratepage() method and documents the expected
    return code for the same method in failure cases.

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Update the hugetlb_get_unmapped_area function to make use of
    vm_unmapped_area() instead of implementing a brute force search.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • There was some desire in large applications using MAP_HUGETLB or
    SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
    others. This is useful together with NUMA policy: use 2MB interleaving
    on some mappings, but 1GB on local mappings.

    This patch extends the IPC/SHM syscall interfaces slightly to allow
    specifying the page size.

    It borrows some upper bits in the existing flag arguments and allows
    encoding the log of the desired page size in addition to the *_HUGETLB
    flag. When 0 is specified the default size is used, this makes the
    change fully compatible.

    Extending the internal hugetlb code to handle this is straight forward.
    Instead of a single mount it just keeps an array of them and selects the
    right mount based on the specified page size. When no page size is
    specified it uses the mount of the default page size.

    The change is not visible in /proc/mounts because internal mounts don't
    appear there. It also has very little overhead: the additional mounts
    just consume a super block, but not more memory when not used.

    I also exported the new flags to the user headers (they were previously
    under __KERNEL__). Right now only symbols for x86 and some other
    architecture for 1GB and 2MB are defined. The interface should already
    work for all other architectures though. Only architectures that define
    multiple hugetlb sizes actually need it (that is currently x86, tile,
    powerpc). However tile and powerpc have user configurable hugetlb
    sizes, so it's not easy to add defines. A program on those
    architectures would need to query sysfs and use the appropiate log2.

    [akpm@linux-foundation.org: cleanups]
    [rientjes@google.com: fix build]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Michael Kerrisk
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

06 Dec, 2012

1 commit


09 Oct, 2012

2 commits

  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

03 Oct, 2012

2 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov