13 Oct, 2012

3 commits

  • Pull third pile of VFS updates from Al Viro:
    "Stuff from Jeff Layton, mostly. Sanitizing interplay between audit
    and namei, removing a lot of insanity from audit_inode() mess and
    getting things ready for his ESTALE patchset."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    procfs: don't need a PATH_MAX allocation to hold a string representation of an int
    vfs: embed struct filename inside of names_cache allocation if possible
    audit: make audit_inode take struct filename
    vfs: make path_openat take a struct filename pointer
    vfs: turn do_path_lookup into wrapper around struct filename variant
    audit: allow audit code to satisfy getname requests from its names_list
    vfs: define struct filename and have getname() return it
    vfs: unexport getname and putname symbols
    acct: constify the name arg to acct_on
    vfs: allocate page instead of names_cache buffer in mount_block_root
    audit: overhaul __audit_inode_child to accomodate retrying
    audit: optimize audit_compare_dname_path
    audit: make audit_compare_dname_path use parent_len helper
    audit: remove dirlen argument to audit_compare_dname_path
    audit: set the name_len in audit_inode for parent lookups
    audit: add a new "type" field to audit_names struct
    audit: reverse arguments to audit_inode_child
    audit: no need to walk list in audit_inode if name is NULL
    audit: pass in dentry to audit_copy_inode wherever possible
    audit: remove unnecessary NULL ptr checks from do_path_lookup

    Linus Torvalds
     
  • ...and fix up the callers. For do_file_open_root, just declare a
    struct filename on the stack and fill out the .name field. For
    do_filp_open, make it also take a struct filename pointer, and fix up its
    callers to call it appropriately.

    For filp_open, add a variant that takes a struct filename pointer and turn
    filp_open into a wrapper around it.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • getname() is intended to copy pathname strings from userspace into a
    kernel buffer. The result is just a string in kernel space. It would
    however be quite helpful to be able to attach some ancillary info to
    the string.

    For instance, we could attach some audit-related info to reduce the
    amount of audit-related processing needed. When auditing is enabled,
    we could also call getname() on the string more than once and not
    need to recopy it from userspace.

    This patchset converts the getname()/putname() interfaces to return
    a struct instead of a string. For now, the struct just tracks the
    string in kernel space and the original userland pointer for it.

    Later, we'll add other information to the struct as it becomes
    convenient.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

12 Oct, 2012

3 commits

  • Pull SLAB fix from Pekka Enberg:
    "This contains a lockdep false positive fix from Jiri Kosina I missed
    from the previous pull request."

    * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm, slab: release slab_mutex earlier in kmem_cache_destroy()

    Linus Torvalds
     
  • Pull pile 2 of vfs updates from Al Viro:
    "Stuff in this one - assorted fixes, lglock tidy-up, death to
    lock_super().

    There'll be a VFS pile tomorrow (with patches from Jeff Layton,
    sanitizing getname() and related parts of audit and preparing for
    ESTALE fixes), but I'd rather push the stuff in this one ASAP - some
    of the bugs closed here are quite unpleasant."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: bogus warnings in fs/namei.c
    consitify do_mount() arguments
    lglock: add DEFINE_STATIC_LGLOCK()
    lglock: make the per_cpu locks static
    lglock: remove unused DEFINE_LGLOCK_LOCKDEP()
    MAX_LFS_FILESIZE definition for 64bit needs LL...
    tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
    vfs: drop lock/unlock super
    ufs: drop lock/unlock super
    sysv: drop lock/unlock super
    hpfs: drop lock/unlock super
    fat: drop lock/unlock super
    ext3: drop lock/unlock super
    exofs: drop lock/unlock super
    dup3: Return an error when oldfd == newfd.
    fs: handle failed audit_log_start properly
    fs: prevent use after free in auditing when symlink following was denied

    Linus Torvalds
     
  • Pull writeback fixes from Fengguang Wu:
    "Three trivial writeback fixes"

    * 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug
    writeback: correct comment for move_expired_inodes()
    backing-dev: use kstrto* in preference to simple_strtoul

    Linus Torvalds
     

10 Oct, 2012

2 commits

  • Commit 1331e7a1bbe1 ("rcu: Remove _rcu_barrier() dependency on
    __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
    through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
    get_online_cpus().

    Lockdep thinks that this might actually result in ABBA deadlock,
    and reports it as below:

    === [ cut here ] ===
    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc5-00004-g0d8ee37 #143 Not tainted
    -------------------------------------------------------
    kworker/u:2/40 is trying to acquire lock:
    (rcu_sched_state.barrier_mutex){+.+...}, at: [] _rcu_barrier+0x26/0x1e0

    but task is already holding lock:
    (slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x45/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (slab_mutex){+.+.+.}:
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] cpuup_callback+0x2f/0xbe
    [] notifier_call_chain+0x93/0x140
    [] __raw_notifier_call_chain+0x9/0x10
    [] _cpu_up+0xba/0x14e
    [] cpu_up+0xbc/0x117
    [] smp_init+0x6b/0x9f
    [] kernel_init+0x147/0x1dc
    [] kernel_thread_helper+0x4/0x10

    -> #1 (cpu_hotplug.lock){+.+.+.}:
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] get_online_cpus+0x37/0x50
    [] _rcu_barrier+0xbb/0x1e0
    [] rcu_barrier_sched+0x10/0x20
    [] rcu_barrier+0x9/0x10
    [] deactivate_locked_super+0x49/0x90
    [] deactivate_super+0x61/0x70
    [] mntput_no_expire+0x127/0x180
    [] sys_umount+0x6e/0xd0
    [] system_call_fastpath+0x16/0x1b

    -> #0 (rcu_sched_state.barrier_mutex){+.+...}:
    [] check_prev_add+0x3de/0x440
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] _rcu_barrier+0x26/0x1e0
    [] rcu_barrier_sched+0x10/0x20
    [] rcu_barrier+0x9/0x10
    [] kmem_cache_destroy+0xd1/0xe0
    [] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
    [] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
    [] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
    [] ops_exit_list+0x39/0x60
    [] cleanup_net+0xfb/0x1b0
    [] process_one_work+0x26b/0x4c0
    [] worker_thread+0x12e/0x320
    [] kthread+0x9e/0xb0
    [] kernel_thread_helper+0x4/0x10

    other info that might help us debug this:

    Chain exists of:
    rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(cpu_hotplug.lock);
    lock(slab_mutex);
    lock(rcu_sched_state.barrier_mutex);

    *** DEADLOCK ***
    === [ cut here ] ===

    This is actually a false positive. Lockdep has no way of knowing the fact
    that the ABBA can actually never happen, because of special semantics of
    cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
    exclusion there is not achieved through mutex, but through
    cpu_hotplug.refcount.

    The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
    until everyone who called get_online_cpus() will call put_online_cpus()"
    semantics is totally invisible to lockdep.

    This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
    is being called with it unlocked. It has two advantages:

    - it slightly reduces hold time of slab_mutex; as it's used to protect
    the cachep list, it's not necessary to hold it over kmem_cache_free()
    call any more
    - it silences the lockdep false positive warning, as it avoids lockdep ever
    learning about slab_mutex -> cpu_hotplug.lock dependency

    Reviewed-by: Paul E. McKenney
    Reviewed-by: Srivatsa S. Bhat
    Acked-by: David Rientjes
    Signed-off-by: Jiri Kosina
    Signed-off-by: Pekka Enberg

    Jiri Kosina
     
  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     

09 Oct, 2012

32 commits

  • Invalidation sequences are handled in various ways on various
    architectures.

    One way, which sparc64 uses, is to let the set_*_at() functions accumulate
    pending flushes into a per-cpu array. Then the flush_tlb_range() et al.
    calls process the pending TLB flushes.

    In this regime, the __tlb_remove_*tlb_entry() implementations are
    essentially NOPs.

    The canonical PTE zap in mm/memory.c is:

    ptent = ptep_get_and_clear_full(mm, addr, pte,
    tlb->fullmm);
    tlb_remove_tlb_entry(tlb, pte, addr);

    With a subsequent tlb_flush_mmu() if needed.

    Mirror this in the THP PMD zapping using:

    orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
    page = pmd_page(orig_pmd);
    tlb_remove_pmd_tlb_entry(tlb, pmd, addr);

    And we properly accomodate TLB flush mechanims like the one described
    above.

    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • The transparent huge page code passes a PMD pointer in as the third
    argument of update_mmu_cache(), which expects a PTE pointer.

    This never got noticed because X86 implements update_mmu_cache() as a
    macro and thus we don't get any type checking, and X86 is the only
    architecture which supports transparent huge pages currently.

    Before other architectures can support transparent huge pages properly we
    need to add a new interface which will take a PMD pointer as the third
    argument rather than a PTE pointer.

    [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • …YYYYYYYYYYYYYYYY>" warning

    When our x86 box calls __remove_pages(), release_mem_region() shows many
    warnings. And x86 box cannot unregister iomem_resource.

    "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"

    release_mem_region() has been changed to be called in each
    PAGES_PER_SECTION by commit de7f0cba9678 ("memory hotplug: release
    memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers
    iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add
    memory on x86 box, iomem_resource is register in each _CRS not
    PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource.

    The patch fixes the problem.

    Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Jiang Liu <liuj97@gmail.com>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Wen Congyang <wency@cn.fujitsu.com>
    Cc: Dave Hansen <dave@linux.vnet.ibm.com>
    Cc: Nathan Fontenot <nfont@austin.ibm.com>
    Cc: Badari Pulavarty <pbadari@us.ibm.com>
    Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Yasuaki Ishimatsu
     
  • Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
    virtual addresses in /proc/vmallocinfo too.

    Signed-off-by: Kees Cook
    Reported-by: Brad Spengler
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • NR_MLOCK is only accounted in single page units: there's no logic to
    handle transparent hugepages. This patch checks the appropriate number of
    pages to adjust the statistics by so that the correct amount of memory is
    reflected.

    Currently:

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    #define MAP_SIZE (4 << 30) /* 4GB */

    void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    mlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 29844 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    And with this patch:

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    mlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 4213664 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    Signed-off-by: David Rientjes
    Reported-by: Hugh Dickens
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a transparent hugepage is mapped and it is included in an mlock()
    range, follow_page() incorrectly avoids setting the page's mlock bit and
    moving it to the unevictable lru.

    This is evident if you try to mlock(), munlock(), and then mlock() a
    range again. Currently:

    #define MAP_SIZE (4 << 30) /* 4GB */

    void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    mlock(ptr, MAP_SIZE);

    $ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
    Inactive(anon): 6304 kB
    Unevictable: 4213924 kB

    munlock(ptr, MAP_SIZE);

    Inactive(anon): 4186252 kB
    Unevictable: 19652 kB

    mlock(ptr, MAP_SIZE);

    Inactive(anon): 4198556 kB
    Unevictable: 21684 kB

    Notice that less than 2MB was added to the unevictable list; this is
    because these pages in the range are not transparent hugepages since the
    4GB range was allocated with mmap() and has no specific alignment. If
    posix_memalign() were used instead, unevictable would not have grown at
    all on the second mlock().

    The fix is to call mlock_vma_page() so that the mlock bit is set and the
    page is added to the unevictable list. With this patch:

    mlock(ptr, MAP_SIZE);

    Inactive(anon): 4056 kB
    Unevictable: 4213940 kB

    munlock(ptr, MAP_SIZE);

    Inactive(anon): 4198268 kB
    Unevictable: 19636 kB

    mlock(ptr, MAP_SIZE);

    Inactive(anon): 4008 kB
    Unevictable: 4213940 kB

    Signed-off-by: David Rientjes
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • remove_memory() will be called when hot removing a memory device. But
    even if offlining memory, we cannot notice it. So the patch updates the
    memory block's state and sends notification to userspace.

    Additionally, the memory device may contain more than one memory block.
    If the memory block has been offlined, __offline_pages() will fail. So we
    should try to offline one memory block at a time.

    Thus remove_memory() also check each memory block's state. So there is no
    need to check the memory block's state before calling remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • remove_memory() is called in two cases:
    1. echo offline >/sys/devices/system/memory/memoryXX/state
    2. hot remove a memory device

    In the 1st case, the memory block's state is changed and the notification
    that memory block's state changed is sent to userland after calling
    remove_memory(). So user can notice memory block is changed.

    But in the 2nd case, the memory block's state is not changed and the
    notification is not also sent to userspcae even if calling
    remove_memory(). So user cannot notice memory block is changed.

    For adding the notification at memory hot remove, the patch just prepare
    as follows:
    1st case uses offline_pages() for offlining memory.
    2nd case uses remove_memory() for offlining memory and changing memory block's
    state and notifing the information.

    The patch does not implement notification to remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Following section mismatch warning is thrown during build;

    WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
    The function memblock_type_name() references
    the variable __meminitdata memblock.
    This is often because memblock_type_name lacks a __meminitdata
    annotation or the annotation of memblock is wrong.

    This is because memblock_type_name makes reference to memblock variable
    with attribute __meminitdata. Hence, the warning (even if the function is
    inline).

    [akpm@linux-foundation.org: remove inline]
    Signed-off-by: Raghavendra D Prabhu
    Cc: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra D Prabhu
     
  • reclaim_clean_pages_from_list() reclaims clean pages before migration so
    cc.nr_migratepages should be updated. Currently, there is no problem but
    it can be wrong if we try to use the value in future.

    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
    from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
    have been used by any tool, and of course we can restore it easily enough
    if that turns out to be wrong.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
    causing the kernel to hang. When the system doesn't have enough free
    pages, it enters reclaim but never reclaim any pages due to
    too_many_isolated()==true and loops forever.

    The cause is that when we do memory-hotadd after memory-remove,
    __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
    although the vm_stat_diff of all CPUs still have values.

    In addtion, when we offline all pages of the zone, we reset them in
    zone_pcp_reset without draining so we loss some zone stat item.

    Reviewed-by: Wen Congyang
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Yasuaki Ishimatsu
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Revert commit 0def08e3acc2 because check_range can't fail in
    migrate_to_node with considering current usecases.

    Quote from Johannes

    : I think it makes sense to revert. Not because of the semantics, but I
    : just don't see how check_range() could even fail for this callsite:
    :
    : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
    : find_vma()
    :
    : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
    : and so can not fail
    :
    : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
    : continue until addr == end, so we never fail with -EIO

    And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
    which might pass to MPOL_MF_STRICT.

    Suggested-by: Johannes Weiner
    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vasiliy Kulikov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In order to allow sleeping during invalidate_page mmu notifier calls, we
    need to avoid calling when holding the PT lock. In addition to its direct
    calls, invalidate_page can also be called as a substitute for a change_pte
    call, in case the notifier client hasn't implemented change_pte.

    This patch drops the invalidate_page call from change_pte, and instead
    wraps all calls to change_pte with invalidate_range_start and
    invalidate_range_end calls.

    Note that change_pte still cannot sleep after this patch, and that clients
    implementing change_pte should not take action on it in case the number of
    outstanding invalidate_range_start calls is larger than one, otherwise
    they might miss a later invalidation.

    Signed-off-by: Haggai Eran
    Cc: Andrea Arcangeli
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haggai Eran
     
  • In order to allow sleeping during mmu notifier calls, we need to avoid
    invoking them under the page table spinlock. This patch solves the
    problem by calling invalidate_page notification after releasing the lock
    (but before freeing the page itself), or by wrapping the page invalidation
    with calls to invalidate_range_begin and invalidate_range_end.

    To prevent accidental changes to the invalidate_range_end arguments after
    the call to invalidate_range_begin, the patch introduces a convention of
    saving the arguments in consistently named locals:

    unsigned long mmun_start; /* For mmu_notifiers */
    unsigned long mmun_end; /* For mmu_notifiers */

    ...

    mmun_start = ...
    mmun_end = ...
    mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

    ...

    mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

    The patch changes code to use this convention for all calls to
    mmu_notifier_invalidate_range_start/end, except those where the calls are
    close enough so that anyone who glances at the code can see the values
    aren't changing.

    This patchset is a preliminary step towards on-demand paging design to be
    added to the RDMA stack.

    Why do we want on-demand paging for Infiniband?

    Applications register memory with an RDMA adapter using system calls,
    and subsequently post IO operations that refer to the corresponding
    virtual addresses directly to HW. Until now, this was achieved by
    pinning the memory during the registration calls. The goal of on demand
    paging is to avoid pinning the pages of registered memory regions (MRs).
    This will allow users the same flexibility they get when swapping any
    other part of their processes address spaces. Instead of requiring the
    entire MR to fit in physical memory, we can allow the MR to be larger,
    and only fit the current working set in physical memory.

    Why should anyone care? What problems are users currently experiencing?

    This can make programming with RDMA much simpler. Today, developers
    that are working with more data than their RAM can hold need either to
    deregister and reregister memory regions throughout their process's
    life, or keep a single memory region and copy the data to it. On demand
    paging will allow these developers to register a single MR at the
    beginning of their process's life, and let the operating system manage
    which pages needs to be fetched at a given time. In the future, we
    might be able to provide a single memory access key for each process
    that would provide the entire process's address as one large memory
    region, and the developers wouldn't need to register memory regions at
    all.

    Is there any prospect that any other subsystems will utilise these
    infrastructural changes? If so, which and how, etc?

    As for other subsystems, I understand that XPMEM wanted to sleep in
    MMU notifiers, as Christoph Lameter wrote at
    http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
    perhaps Andrea knows about other use cases.

    Scheduling in mmu notifications is required since we need to sync the
    hardware with the secondary page tables change. A TLB flush of an IO
    device is inherently slower than a CPU TLB flush, so our design works by
    sending the invalidation request to the device, and waiting for an
    interrupt before exiting the mmu notifier handler.

    Avi said:

    kvm may be a buyer. kvm::mmu_lock, which serializes guest page
    faults, also protects long operations such as destroying large ranges.
    It would be good to convert it into a spinlock, but as it is used inside
    mmu notifiers, this cannot be done.

    (there are alternatives, such as keeping the spinlock and using a
    generation counter to do the teardown in O(1), which is what the "may"
    is doing up there).

    [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Haggai Eran
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping
    page from vma") fixed pgoff calculation but it has replaced it by
    vma_hugecache_offset() which is not approapriate for offsets used for
    vma_prio_tree_foreach() because that one expects index in page units
    rather than in huge_page_shift.

    Johannes said:

    : The resulting index may not be too big, but it can be too small: assume
    : hpage size of 2M and the address to unmap to be 0x200000. This is regular
    : page index 512 and hpage index 1. If you have a VMA that maps the file
    : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
    : you ask for offset 1 and miss it even though it does map the page of
    : interest. hugetlb_cow() will try to unmap, miss the vma, and retry the
    : cow until the allocation succeeds or the skipped vma(s) go away.

    Signed-off-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Acked-by: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • RECLAIM_DISTANCE represents the distance between nodes at which it is
    deemed too costly to allocate from; it's preferred to try to reclaim from
    a local zone before falling back to allocating on a remote node with such
    a distance.

    To do this, zone_reclaim_mode is set if the distance between any two
    nodes on the system is greather than this distance. This, however, ends
    up causing the page allocator to reclaim from every zone regardless of
    its affinity.

    What we really want is to reclaim only from zones that are closer than
    RECLAIM_DISTANCE. This patch adds a nodemask to each node that
    represents the set of nodes that are within this distance. During the
    zone iteration, if the bit for a zone's node is set for the local node,
    then reclaim is attempted; otherwise, the zone is skipped.

    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
    remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
    already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
    checking it, reporting "BUG: Bad page state" if it's ever found set.
    Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In fuzzing with trinity, lockdep protested "possible irq lock inversion
    dependency detected" when isolate_lru_page() reenabled interrupts while
    still holding the supposedly irq-safe tree_lock:

    invalidate_inode_pages2
    invalidate_complete_page2
    spin_lock_irq(&mapping->tree_lock)
    clear_page_mlock
    isolate_lru_page
    spin_unlock_irq(&zone->lru_lock)

    isolate_lru_page() is correct to enable interrupts unconditionally:
    invalidate_complete_page2() is incorrect to call clear_page_mlock() while
    holding tree_lock, which is supposed to nest inside lru_lock.

    Both truncate_complete_page() and invalidate_complete_page() call
    clear_page_mlock() before taking tree_lock to remove page from radix_tree.
    I guess invalidate_complete_page2() preferred to test PageDirty (again)
    under tree_lock before committing to the munlock; but since the page has
    already been unmapped, its state is already somewhat inconsistent, and no
    worse if clear_page_mlock() moved up.

    Reported-by: Sasha Levin
    Deciphered-by: Andrew Morton
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • kmem code uses this function and it is better to not use forward
    declarations for static inline functions as some (older) compilers don't
    like it:

    gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)

    mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
    mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here

    Signed-off-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Sachin Kamat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
    the code is not used if !CONFIG_INET so we should rather test for both.
    The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
    let's keep those outside of any ifdefs because it is considered safer wrt.
    future maintainability.

    Tested with
    - CONFIG_INET && CONFIG_MEMCG_KMEM
    - !CONFIG_INET && CONFIG_MEMCG_KMEM
    - CONFIG_INET && !CONFIG_MEMCG_KMEM
    - !CONFIG_INET && !CONFIG_MEMCG_KMEM

    Signed-off-by: Sachin Kamat
    Signed-off-by: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • I think zone->present_pages indicates pages that buddy system can management,
    it should be:

    zone->present_pages = spanned pages - absent pages - bootmem pages,

    but is now:
    zone->present_pages = spanned pages - absent pages - memmap pages.

    spanned pages: total size, including holes.
    absent pages: holes.
    bootmem pages: pages used in system boot, managed by bootmem allocator.
    memmap pages: pages used by page structs.

    This may cause zone->present_pages less than it should be. For example,
    numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
    bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
    present_pages should be spanned pages - absent pages, but now it also
    minus memmap pages(free_area_init_core), which are actually allocated from
    ZONE_MOVABLE. When offlining all memory of a zone, this will cause
    zone->present_pages less than 0, because present_pages is unsigned long
    type, it is actually a very large integer, it indirectly caused
    zone->watermark[WMARK_MIN] becomes a large
    integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
    large integer(calculate_totalreserve_pages()), and finally cause memory
    allocating failure when fork process(__vm_enough_memory()).

    [root@localhost ~]# dmesg
    -bash: fork: Cannot allocate memory

    I think the bug described in

    http://marc.info/?l=linux-mm&m=134502182714186&w=2

    is also caused by wrong zone present pages.

    This patch intends to fix-up zone->present_pages when memory are freed to
    buddy system on x86_64 and IA64 platforms.

    Signed-off-by: Jianguo Wu
    Signed-off-by: Jiang Liu
    Reported-by: Petr Tesarik
    Tested-by: Petr Tesarik
    Cc: "Luck, Tony"
    Cc: Mel Gorman
    Cc: Yinghai Lu
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Now that lumpy reclaim has been removed, compaction is the only way left
    to free up contiguous memory areas. It is time to just enable
    CONFIG_COMPACTION by default.

    Signed-off-by: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Rafael Aquini
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The update_mmu_cache() takes a pointer (to pte_t by default) as the last
    argument but the huge_memory.c passes a pmd_t value. The patch changes
    the argument to the pmd_t * pointer.

    Signed-off-by: Catalin Marinas
    Signed-off-by: Steve Capper
    Signed-off-by: Will Deacon
    Cc: Arnd Bergmann
    Reviewed-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Gerald Schaefer
    Reviewed-by: Andrea Arcangeli
    Cc: Chris Metcalf
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • If NUMA is enabled, the indicator is not reset if the previous page
    request failed, ausing us to trigger the BUG_ON() in
    khugepaged_alloc_page().

    Signed-off-by: Xiao Guangrong
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • The changelog for commit 6a6dccba2fdc ("mm: cma: don't replace lowmem
    pages with highmem") mentioned that lowmem pages can be replaced by
    highmem pages during CMA migration. 6a6dccba2fdc fixed that issue.

    Quote from that changelog:

    : The filesystem layer expects pages in the block device's mapping to not
    : be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
    : currently replace lowmem pages with highmem pages, leading to crashes in
    : filesystem code such as the one below:
    :
    : Unable to handle kernel NULL pointer dereference at virtual address 00000400
    : pgd = c0c98000
    : [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
    : Internal error: Oops: 817 [#1] PREEMPT SMP ARM
    : CPU: 0 Not tainted (3.5.0-rc5+ #80)
    : PC is at __memzero+0x24/0x80
    : ...
    : Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
    : Backtrace:
    : [] (ext4_getblk+0x0/0x180) from [] (ext4_bread+0x1c/0x98)
    : [] (ext4_bread+0x0/0x98) from [] (ext4_mkdir+0x160/0x3bc)
    : r4:c15337f0
    : [] (ext4_mkdir+0x0/0x3bc) from [] (vfs_mkdir+0x8c/0x98)
    : [] (vfs_mkdir+0x0/0x98) from [] (sys_mkdirat+0x74/0xac)
    : r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
    : [] (sys_mkdirat+0x0/0xac) from [] (sys_mkdir+0x20/0x24)
    : r6:beccdcf0 r5:00074000 r4:beccdbbc
    : [] (sys_mkdir+0x0/0x24) from [] (ret_fast_syscall+0x0/0x30)

    Memory-hotplug has same problem as CMA has so the same fix can be applied
    to memory-hotplug as well.

    Fix it by reusing.

    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Wen Congyang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
    it out (move + rename as a common name) into page_isolation.c.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Wen Congyang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim