05 Jun, 2014

4 commits


07 May, 2014

1 commit

  • Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

04 Apr, 2014

1 commit

  • Currently, to track reserved and allocated regions, we use two different
    ways, depending on the mapping. For MAP_SHARED, we use
    address_mapping's private_list and, while for MAP_PRIVATE, we use a
    resv_map.

    Now, we are preparing to change a coarse grained lock which protect a
    region structure to fine grained lock, and this difference hinder it.
    So, before changing it, unify region structure handling, consistently
    using a resv_map regardless of the kind of mapping.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

25 Aug, 2013

1 commit


14 Aug, 2013

1 commit

  • Dave has reported the following lockdep splat:

    =================================
    [ INFO: inconsistent lock state ]
    3.11.0-rc1+ #9 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: [] page_referenced+0x87/0x5e3
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x81/0xe7
    lockdep_trace_alloc+0x5e/0xbc
    __alloc_pages_nodemask+0x8b/0x9b6
    __get_free_pages+0x20/0x31
    get_zeroed_page+0x12/0x14
    __pmd_alloc+0x1c/0x6b
    huge_pmd_share+0x265/0x283
    huge_pte_alloc+0x5d/0x71
    hugetlb_fault+0x7c/0x64a
    handle_mm_fault+0x255/0x299
    __do_page_fault+0x142/0x55c
    do_page_fault+0xd/0x16
    error_code+0x6c/0x74
    irq event stamp: 3136917
    hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50
    hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78
    softirqs last enabled at (3136180): __do_softirq+0x137/0x30f
    softirqs last disabled at (3136175): irq_exit+0xa8/0xaa
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&mapping->i_mmap_mutex);

    lock(&mapping->i_mmap_mutex);

    *** DEADLOCK ***
    no locks held by kswapd0/49.

    stack backtrace:
    CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
    Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
    Call Trace:
    dump_stack+0x4b/0x79
    print_usage_bug+0x1d9/0x1e3
    mark_lock+0x1e0/0x261
    __lock_acquire+0x623/0x17f2
    lock_acquire+0x7d/0x195
    mutex_lock_nested+0x6c/0x3a7
    page_referenced+0x87/0x5e3
    shrink_page_list+0x3d9/0x947
    shrink_inactive_list+0x155/0x4cb
    shrink_lruvec+0x300/0x5ce
    shrink_zone+0x53/0x14e
    kswapd+0x517/0xa75
    kthread+0xa8/0xaa
    ret_from_kernel_thread+0x1b/0x28

    which is a false positive caused by hugetlb pmd sharing code which
    allocates a new pmd from withing mapping->i_mmap_mutex. If this
    allocation causes reclaim then the lockdep detector complains that we
    might self-deadlock.

    This is not correct though, because hugetlb pages are not reclaimable so
    their mapping will be never touched from the reclaim path.

    The patch tells lockup detector that hugetlb i_mmap_mutex is special by
    assigning it a separate lockdep class so it won't report possible
    deadlocks on unrelated mappings.

    [peterz@infradead.org: comment for annotation]
    Reported-by: Dave Jones
    Signed-off-by: Michal Hocko
    Cc: Peter Zijlstra
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 May, 2013

1 commit

  • The current kernel returns -EINVAL unless a given mmap length is
    "almost" hugepage aligned. This is because in sys_mmap_pgoff() the
    given length is passed to vm_mmap_pgoff() as it is without being aligned
    with hugepage boundary.

    This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
    alignment of huge page requests"), where alignment code is pushed into
    hugetlb_file_setup() and the variable len in caller side is not changed.

    To fix this, this patch partially reverts that commit, and adds
    alignment code in caller side. And it also introduces hstate_sizelog()
    in order to get proper hstate to specified hugepage size.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

    [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Johannes Weiner
    Reported-by:
    Cc: Steven Truelove
    Cc: Jianguo Wu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

18 Apr, 2013

1 commit

  • Currently we fail to include any data on hugepages into coredump,
    because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was
    recently introduced by commit 314e51b9851b ("mm: kill vma flag
    VM_RESERVED and mm->reserved_vm counter").

    This looks to me a serious regression, so let's fix it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Reviewed-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Feb, 2013

1 commit


23 Feb, 2013

2 commits

  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     
  • Signed-off-by: Al Viro

    Al Viro
     

14 Dec, 2012

1 commit

  • Pull trivial branch from Jiri Kosina:
    "Usual stuff -- comment/printk typo fixes, documentation updates, dead
    code elimination."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    HOWTO: fix double words typo
    x86 mtrr: fix comment typo in mtrr_bp_init
    propagate name change to comments in kernel source
    doc: Update the name of profiling based on sysfs
    treewide: Fix typos in various drivers
    treewide: Fix typos in various Kconfig
    wireless: mwifiex: Fix typo in wireless/mwifiex driver
    messages: i2o: Fix typo in messages/i2o
    scripts/kernel-doc: check that non-void fcts describe their return value
    Kernel-doc: Convention: Use a "Return" section to describe return values
    radeon: Fix typo and copy/paste error in comments
    doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
    various: Fix spelling of "asynchronous" in comments.
    Fix misspellings of "whether" in comments.
    eisa: Fix spelling of "asynchronous".
    various: Fix spelling of "registered" in comments.
    doc: fix quite a few typos within Documentation
    target: iscsi: fix comment typos in target/iscsi drivers
    treewide: fix typo of "suport" in various comments and Kconfig
    treewide: fix typo of "suppport" in various comments
    ...

    Linus Torvalds
     

12 Dec, 2012

3 commits

  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a
    guest, thus imposing performance penalties associated with the reduced
    number of transparent huge pages that could be used by the guest workload.

    This patch-set follows the main idea discussed at 2012 LSFMMS session:
    "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
    to introduce the required changes to the virtio_balloon driver, as well as
    the changes to the core compaction & migration bits, in order to make
    those subsystems aware of ballooned pages and allow memory balloon pages
    become movable within a guest, thus avoiding the aforementioned
    fragmentation issue

    Following are numbers that prove this patch benefits on allowing
    compaction to be more effective at memory ballooned guests.

    Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
    running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
    chunks, at every minute (inflating/deflating), while test was running:

    ===BEGIN stress-highalloc

    STRESS-HIGHALLOC
    highalloc-3.7 highalloc-3.7
    rc4-clean rc4-patch
    Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
    Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
    while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)

    MMTests Statistics: duration
    3.7 3.7
    rc4-clean rc4-patch
    User 1207.59 1207.46
    System 1300.55 1299.61
    Elapsed 2273.72 2157.06

    MMTests Statistics: vmstat
    3.7 3.7
    rc4-clean rc4-patch
    Page Ins 3581516 2374368
    Page Outs 11148692 10410332
    Swap Ins 80 47
    Swap Outs 3641 476
    Direct pages scanned 37978 33826
    Kswapd pages scanned 1828245 1342869
    Kswapd pages reclaimed 1710236 1304099
    Direct pages reclaimed 32207 31005
    Kswapd efficiency 93% 97%
    Kswapd velocity 804.077 622.546
    Direct efficiency 84% 91%
    Direct velocity 16.703 15.682
    Percentage direct scans 2% 2%
    Page writes by reclaim 79252 9704
    Page writes file 75611 9228
    Page writes anon 3641 476
    Page reclaim immediate 16764 11014
    Page rescued immediate 0 0
    Slabs scanned 2171904 2152448
    Direct inode steals 385 2261
    Kswapd inode steals 659137 609670
    Kswapd skipped wait 1 69
    THP fault alloc 546 631
    THP collapse alloc 361 339
    THP splits 259 263
    THP fault fallback 98 50
    THP collapse fail 20 17
    Compaction stalls 747 499
    Compaction success 244 145
    Compaction failures 503 354
    Compaction pages moved 370888 474837
    Compaction move failure 77378 65259

    ===END stress-highalloc

    This patch:

    Introduce MIGRATEPAGE_SUCCESS as the default return code for
    address_space_operations.migratepage() method and documents the expected
    return code for the same method in failure cases.

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Update the hugetlb_get_unmapped_area function to make use of
    vm_unmapped_area() instead of implementing a brute force search.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • There was some desire in large applications using MAP_HUGETLB or
    SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
    others. This is useful together with NUMA policy: use 2MB interleaving
    on some mappings, but 1GB on local mappings.

    This patch extends the IPC/SHM syscall interfaces slightly to allow
    specifying the page size.

    It borrows some upper bits in the existing flag arguments and allows
    encoding the log of the desired page size in addition to the *_HUGETLB
    flag. When 0 is specified the default size is used, this makes the
    change fully compatible.

    Extending the internal hugetlb code to handle this is straight forward.
    Instead of a single mount it just keeps an array of them and selects the
    right mount based on the specified page size. When no page size is
    specified it uses the mount of the default page size.

    The change is not visible in /proc/mounts because internal mounts don't
    appear there. It also has very little overhead: the additional mounts
    just consume a super block, but not more memory when not used.

    I also exported the new flags to the user headers (they were previously
    under __KERNEL__). Right now only symbols for x86 and some other
    architecture for 1GB and 2MB are defined. The interface should already
    work for all other architectures though. Only architectures that define
    multiple hugetlb sizes actually need it (that is currently x86, tile,
    powerpc). However tile and powerpc have user configurable hugetlb
    sizes, so it's not easy to add defines. A program on those
    architectures would need to query sysfs and use the appropiate log2.

    [akpm@linux-foundation.org: cleanups]
    [rientjes@google.com: fix build]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Michael Kerrisk
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

06 Dec, 2012

1 commit


09 Oct, 2012

2 commits

  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

03 Oct, 2012

2 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     

21 Sep, 2012

1 commit


01 Aug, 2012

1 commit


14 Jul, 2012

1 commit


29 May, 2012

1 commit

  • Pull writeback tree from Wu Fengguang:
    "Mainly from Jan Kara to avoid iput() in the flusher threads."

    * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Avoid iput() from flusher thread
    vfs: Rename end_writeback() to clear_inode()
    vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
    writeback: Refactor writeback_single_inode()
    writeback: Remove wb->list_lock from writeback_single_inode()
    writeback: Separate inode requeueing after writeback
    writeback: Move I_DIRTY_PAGES handling
    writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
    writeback: Move clearing of I_SYNC into inode_sync_complete()
    writeback: initialize global_dirty_limit
    fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
    mm: page-writeback.c: local functions should not be exposed globally

    Linus Torvalds
     

06 May, 2012

1 commit

  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

26 Apr, 2012

1 commit

  • This fixes the below reported false lockdep warning. e096d0c7e2e4
    ("lockdep: Add helper function for dir vs file i_mutex annotation") added
    a similar annotation for every other inode in hugetlbfs but missed the
    root inode because it was allocated by a separate function.

    For HugeTLB fs we allow taking i_mutex in mmap. HugeTLB fs doesn't
    support file write and its file read callback is modified in a05b0855fd
    ("hugetlbfs: avoid taking i_mutex from hugetlbfs_read()") to not take
    i_mutex. Hence for HugeTLB fs with regular files we really don't take
    i_mutex with mmap_sem held.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.4.0-rc1+ #322 Not tainted
    -------------------------------------------------------
    bash/1572 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x40/0x90

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [] vfs_readdir+0x56/0xa8

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#12){+.+.+.}:
    [] lock_acquire+0xd5/0xfa
    [] __mutex_lock_common+0x48/0x350
    [] mutex_lock_nested+0x2a/0x31
    [] hugetlbfs_file_mmap+0x7d/0x104
    [] mmap_region+0x272/0x47d
    [] do_mmap_pgoff+0x294/0x2ee
    [] sys_mmap_pgoff+0xd2/0x10e
    [] sys_mmap+0x1d/0x1f
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa81/0xd75
    [] lock_acquire+0xd5/0xfa
    [] might_fault+0x6d/0x90
    [] filldir+0x6a/0xc2
    [] dcache_readdir+0x5c/0x222
    [] vfs_readdir+0x76/0xa8
    [] sys_getdents+0x79/0xc9
    [] system_call_fastpath+0x16/0x1b

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&sb->s_type->i_mutex_key#12);
    lock(&mm->mmap_sem);
    lock(&sb->s_type->i_mutex_key#12);
    lock(&mm->mmap_sem);

    *** DEADLOCK ***

    1 lock held by bash/1572:
    #0: (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [] vfs_readdir+0x56/0xa8

    stack backtrace:
    Pid: 1572, comm: bash Not tainted 3.4.0-rc1+ #322
    Call Trace:
    [] print_circular_bug+0x1f8/0x209
    [] __lock_acquire+0xa81/0xd75
    [] ? handle_pte_fault+0x5ff/0x614
    [] ? mark_lock+0x2d/0x258
    [] ? might_fault+0x40/0x90
    [] lock_acquire+0xd5/0xfa
    [] ? might_fault+0x40/0x90
    [] ? __mutex_lock_common+0x333/0x350
    [] might_fault+0x6d/0x90
    [] ? might_fault+0x40/0x90
    [] filldir+0x6a/0xc2
    [] dcache_readdir+0x5c/0x222
    [] ? sys_ioctl+0x74/0x74
    [] ? sys_ioctl+0x74/0x74
    [] ? sys_ioctl+0x74/0x74
    [] vfs_readdir+0x76/0xa8
    [] sys_getdents+0x79/0xc9
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Aneesh Kumar K.V
    Cc: Dave Jones
    Cc: Al Viro
    Cc: Josh Boyer
    Cc: Peter Zijlstra
    Cc: Mimi Zohar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

06 Apr, 2012

1 commit

  • It was introduced by d1d5e05ffdc1 ("hugetlbfs: return error code when
    initializing module") but as Al pointed out, is a bad idea.

    Quoted comments from Al:
    "Note that unregister_filesystem() in module init is *always* wrong;
    it's not an issue here (it's done too early to care about and
    realistically the box is not going anywhere - it'll panic when attempt
    to exec /sbin/init fails, if not earlier), but it's a damn bad
    example.

    Consider a normal fs module. Somebody loads it and in parallel with
    that we get a mount attempt on that fs type. It comes between
    register and failure exits that causes unregister; at that point we
    are screwed since grabbing a reference to module as done by mount is
    enough to prevent exit, but not to prevent the failure of init. As
    the result, module will get freed when init fails, mounted fs of that
    type be damned."

    So remove it.

    Signed-off-by: Hillf Danton
    Cc: David Rientjes
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

7 commits

  • Return an errno upon failure to create inode kmem cache, and unregister
    the FS upon failure to mount.

    [akpm@linux-foundation.org: remove unneeded test of `error']
    Signed-off-by: Hillf Danton
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
    PAGE_SIZE, but this is not sufficient.

    Modify hugetlb_file_setup() to align requests to the huge page size, and
    to accept an address argument so that all alignment checks can be
    performed in hugetlb_file_setup(), rather than in its callers. Change
    newseg() and mmap_pgoff() to match the new prototype and eliminate a now
    redundant alignment check.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Steven Truelove
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Truelove
     
  • Add the thread name and pid of the application that is allocating shm
    segments with MAP_HUGETLB without being a part of
    /proc/sys/vm/hugetlb_shm_group or having CAP_IPC_LOCK.

    This identifies the application so it may be fixed by avoiding using the
    deprecated exception (see Documentation/feature-removal-schedule.txt).

    Signed-off-by: David Rientjes
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
    general quota handling code, and they don't much resemble its behaviour.
    Rather than being about maintaining limits on on-disk block usage by
    particular users, they are instead about maintaining limits on in-memory
    page usage (including anonymous MAP_PRIVATE copied-on-write pages)
    associated with a particular hugetlbfs filesystem instance.

    Worse, they work by having callbacks to the hugetlbfs filesystem code from
    the low-level page handling code, in particular from free_huge_page().
    This is a layering violation of itself, but more importantly, if the
    kernel does a get_user_pages() on hugepages (which can happen from KVM
    amongst others), then the free_huge_page() can be delayed until after the
    associated inode has already been freed. If an unmount occurs at the
    wrong time, even the hugetlbfs superblock where the "quota" limits are
    stored may have been freed.

    Andrew Barry proposed a patch to fix this by having hugepages, instead of
    storing a pointer to their address_space and reaching the superblock from
    there, had the hugepages store pointers directly to the superblock,
    bumping the reference count as appropriate to avoid it being freed.
    Andrew Morton rejected that version, however, on the grounds that it made
    the existing layering violation worse.

    This is a reworked version of Andrew's patch, which removes the extra, and
    some of the existing, layering violation. It works by introducing the
    concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
    finite logical pool of hugepages to allocate from. hugetlbfs now creates
    a subpool for each filesystem instance with a page limit set, and a
    pointer to the subpool gets added to each allocated hugepage, instead of
    the address_space pointer used now. The subpool has its own lifetime and
    is only freed once all pages in it _and_ all other references to it (i.e.
    superblocks) are gone.

    subpools are optional - a NULL subpool pointer is taken by the code to
    mean that no subpool limits are in effect.

    Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
    quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
    http://marc.info/?l=linux-mm&m=126928970510627&w=1

    v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
    alloc_huge_page() - since it already takes the vma, it is not necessary.

    Signed-off-by: Andrew Barry
    Signed-off-by: David Gibson
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Make a couple of small cleanups to linux/include/hugetlb.h. The
    set_file_hugepages() function, which was not used anywhere is removed,
    and the hugetlbfs_config and hugetlbfs_inode_info structures with its
    HUGETLBFS_I helper function are moved into inode.c, the only place they
    were used.

    These structures are really linked to the hugetlbfs filesystem
    specifically not to hugepage mm handling in general, so they belong in
    the filesystem code not in a generally available header.

    It would be nice to move the hugetlbfs_sb_info (superblock) structure in
    there as well, but it's currently needed in a number of places via the
    hstate_vma() and hstate_inode().

    Signed-off-by: David Gibson
    Cc: Hugh Dickins
    Cc: Paul Mackerras
    Cc: Andrew Barry
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as
    explained below

    Thread A:
    read() on hugetlbfs
    hugetlbfs_read() called
    i_mutex grabbed
    hugetlbfs_read_actor() called
    __copy_to_user() called
    page fault is triggered
    Thread B, sharing address space with A:
    mmap() the same file
    ->mmap_sem is grabbed on task_B->mm->mmap_sem
    hugetlbfs_file_mmap() is called
    attempt to grab ->i_mutex and block waiting for A to give it up
    Thread A:
    pagefault handled blocked on attempt to grab task_A->mm->mmap_sem,
    which happens to be the same thing as task_B->mm->mmap_sem. Block waiting
    for B to give it up.

    AFAIU the i_mutex locking was added to hugetlbfs_read() as per
    http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take
    care of the race between truncate and read. This patch fixes this by
    looking at page->mapping under lock_page() (find_lock_page()) to ensure
    that the inode didn't get truncated in the range during a parallel read.

    Ideally we can extend the patch to make sure we don't increase i_size in
    mmap. But that will break userspace, because applications will now have
    to use truncate(2) to increase i_size in hugetlbfs.

    Based on the original patch from Hillf Danton.

    Signed-off-by: Aneesh Kumar K.V
    Cc: Hillf Danton
    Cc: KAMEZAWA Hiroyuki
    Cc: Al Viro
    Cc: Hugh Dickins
    Cc: [everything after 2007 :)]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Use/update cached_hole_size and free_area_cache properly to speedup
    finding of a free region.

    Signed-off-by: Xiao Guangrong
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Michal Hocko
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     

21 Mar, 2012

1 commit


13 Jan, 2012

1 commit

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman