08 Apr, 2014

2 commits

  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Apr, 2014

1 commit

  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Apr, 2014

1 commit

  • Pull s390 compat wrapper rework from Heiko Carstens:
    "S390 compat system call wrapper simplification work.

    The intention of this work is to get rid of all hand written assembly
    compat system call wrappers on s390, which perform proper sign or zero
    extension, or pointer conversion of compat system call parameters.
    Instead all of this should be done with C code eg by using Al's
    COMPAT_SYSCALL_DEFINEx() macro.

    Therefore all common code and s390 specific compat system calls have
    been converted to the COMPAT_SYSCALL_DEFINEx() macro.

    In order to generate correct code all compat system calls may only
    have eg compat_ulong_t parameters, but no unsigned long parameters.
    Those patches which change parameter types from unsigned long to
    compat_ulong_t parameters are separate in this series, but shouldn't
    cause any harm.

    The only compat system calls which intentionally have 64 bit
    parameters (preadv64 and pwritev64) in support of the x86/32 ABI
    haven't been changed, but are now only available if an architecture
    defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

    System calls which do not have a compat variant but still need proper
    zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
    get a proper wrapper function with the new s390 specific
    COMPAT_SYSCALL_WRAPx() macro:

    COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

    which generates the following code (simplified):

    asmlinkage long sys_brk(unsigned long brk);
    asmlinkage long compat_sys_brk(long brk)
    {
    return sys_brk((u32)brk);
    }

    Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
    includes both linux/syscall.h and linux/compat.h, it will generate
    build errors, if the declaration of sys_brk() doesn't match, or if
    there exists a non-matching compat_sys_brk() declaration.

    In addition this will intentionally result in a link error if
    somewhere else a compat_sys_brk() function exists, which probably
    should have been used instead. Two more BUILD_BUG_ONs make sure the
    size and type of each compat syscall parameter can be handled
    correctly with the s390 specific macros.

    I converted the compat system calls step by step to verify the
    generated code is correct and matches the previous code. In fact it
    did not always match, however that was always a bug in the hand
    written asm code.

    In result we get less code, less bugs, and much more sanity checking"

    * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
    s390/compat: add copyright statement
    compat: include linux/unistd.h within linux/compat.h
    s390/compat: get rid of compat wrapper assembly code
    s390/compat: build error for large compat syscall args
    mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: convert to COMPAT_SYSCALL_DEFINE
    security/compat: convert to COMPAT_SYSCALL_DEFINE
    mm/compat: convert to COMPAT_SYSCALL_DEFINE
    net/compat: convert to COMPAT_SYSCALL_DEFINE
    kernel/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: optional preadv64/pwrite64 compat system calls
    ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
    s390/compat: partial parameter conversion within syscall wrappers
    s390/compat: automatic zero, sign and pointer conversion of syscalls
    s390/compat: add sync_file_range and fallocate compat syscalls
    ...

    Linus Torvalds
     

06 Mar, 2014

1 commit

  • Convert all compat system call functions where all parameter types
    have a size of four or less than four bytes, or are pointer types
    to COMPAT_SYSCALL_DEFINE.
    The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper
    zero and sign extension to 64 bit of all parameters if needed.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

02 Feb, 2014

1 commit


31 Jan, 2014

1 commit

  • As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference
    policy"), /proc//numa_maps prints the mempolicy for any as
    "prefer:N" for the local node, N, of the process reading the file.

    This should only be printed when the mempolicy of is
    MPOL_PREFERRED for node N.

    If the process is actually only using the default mempolicy for local
    node allocation, make sure "default" is printed as expected.

    Signed-off-by: David Rientjes
    Reported-by: Robert Lippert
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Jan, 2014

2 commits

  • A few printk(KERN_*'s have snuck in there.

    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The command line parsing takes place before jump labels are initialised
    which generates a warning if numa_balancing= is specified and
    CONFIG_JUMP_LABEL is set.

    On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage
    before jump_label_init was called") the kernel would have crashed. This
    patch enables automatic numa balancing later in the initialisation
    process if numa_balancing= is specified.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: stable
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

28 Jan, 2014

3 commits

  • Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

    In order to maximize performance of workloads that do not fit in one NUMA
    node, we want to satisfy the following criteria:

    1) keep private memory local to each thread

    2) avoid excessive NUMA migration of pages

    3) distribute shared memory across the active nodes, to
    maximize memory bandwidth available to the workload

    This patch accomplishes that by implementing the following policy for
    NUMA migrations:

    1) always migrate on a private fault

    2) never migrate to a node that is not in the set of active nodes
    for the numa_group

    3) always migrate from a node outside of the set of active nodes,
    to a node that is in that set

    4) within the set of active nodes in the numa_group, only migrate
    from a node with more NUMA page faults, to a node with fewer
    NUMA page faults, with a 25% margin to avoid ping-ponging

    This results in most pages of a workload ending up on the actively
    used nodes, with reduced ping-ponging of pages between those nodes.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Chegu Vinod
    Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Excessive migration of pages can hurt the performance of workloads
    that span multiple NUMA nodes. However, it turns out that the
    p->numa_migrate_deferred knob is a really big hammer, which does
    reduce migration rates, but does not actually help performance.

    Now that the second stage of the automatic numa balancing code
    has stabilized, it is time to replace the simplistic migration
    deferral code with something smarter.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Chegu Vinod
    Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Pull powerpc updates from Ben Herrenschmidt:
    "So here's my next branch for powerpc. A bit late as I was on vacation
    last week. It's mostly the same stuff that was in next already, I
    just added two patches today which are the wiring up of lockref for
    powerpc, which for some reason fell through the cracks last time and
    is trivial.

    The highlights are, in addition to a bunch of bug fixes:

    - Reworked Machine Check handling on kernels running without a
    hypervisor (or acting as a hypervisor). Provides hooks to handle
    some errors in real mode such as TLB errors, handle SLB errors,
    etc...

    - Support for retrieving memory error information from the service
    processor on IBM servers running without a hypervisor and routing
    them to the memory poison infrastructure.

    - _PAGE_NUMA support on server processors

    - 32-bit BookE relocatable kernel support

    - FSL e6500 hardware tablewalk support

    - A bunch of new/revived board support

    - FSL e6500 deeper idle states and altivec powerdown support

    You'll notice a generic mm change here, it has been acked by the
    relevant authorities and is a pre-req for our _PAGE_NUMA support"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
    powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
    powerpc: Add support for the optimised lockref implementation
    powerpc/powernv: Call OPAL sync before kexec'ing
    powerpc/eeh: Escalate error on non-existing PE
    powerpc/eeh: Handle multiple EEH errors
    powerpc: Fix transactional FP/VMX/VSX unavailable handlers
    powerpc: Don't corrupt transactional state when using FP/VMX in kernel
    powerpc: Reclaim two unused thread_info flag bits
    powerpc: Fix races with irq_work
    Move precessing of MCE queued event out from syscall exit path.
    pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
    powerpc: Make add_system_ram_resources() __init
    powerpc: add SATA_MV to ppc64_defconfig
    powerpc/powernv: Increase candidate fw image size
    powerpc: Add debug checks to catch invalid cpu-to-node mappings
    powerpc: Fix the setup of CPU-to-Node mappings during CPU online
    powerpc/iommu: Don't detach device without IOMMU group
    powerpc/eeh: Hotplug improvement
    powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
    powerpc/eeh: Add restore_config operation
    ...

    Linus Torvalds
     

24 Jan, 2014

2 commits

  • Commit 11c731e81bb0 ("mm/mempolicy: fix !vma in new_vma_page()") has
    removed BUG_ON(!vma) from new_vma_page which is partially correct
    because page_address_in_vma will return EFAULT for non-linear mappings
    and at least shared shmem might be mapped this way.

    The patch also tried to prevent NULL ptr for hugetlb pages which is not
    correct AFAICS because hugetlb pages cannot be mapped as VM_NONLINEAR
    and other conditions in page_address_in_vma seem to be legit and catch
    real bugs.

    This patch restores BUG_ON for PageHuge to catch potential issues when
    the to-be-migrated page is not setup properly.

    Signed-off-by: Michal Hocko
    Reviewed-by: Bob Liu
    Cc: Sasha Levin
    Cc: Wanpeng Li
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Add a working sysctl to enable/disable automatic numa memory balancing
    at runtime.

    This allows us to track down performance problems with this feature and
    is generally a good idea.

    This was possible earlier through debugfs, but only with special
    debugging options set. Also fix the boot message.

    [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
    Signed-off-by: Andi Kleen
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

19 Dec, 2013

2 commits

  • BUG_ON(!vma) assumption is introduced by commit 0bf598d863e3 ("mbind:
    add BUG_ON(!vma) in new_vma_page()"), however, even if

    address = __vma_address(page, vma);

    and

    vma->start < address < vma->end

    page_address_in_vma() may still return -EFAULT because of many other
    conditions in it. As a result the while loop in new_vma_page() may end
    with vma=NULL.

    This patch revert the commit and also fix the potential dereference NULL
    pointer reported by Dan.

    http://marc.info/?l=linux-mm&m=137689530323257&w=2

    kernel BUG at mm/mempolicy.c:1204!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU: 3 PID: 7056 Comm: trinity-child3 Not tainted 3.13.0-rc3+ #2
    task: ffff8801ca5295d0 ti: ffff88005ab20000 task.ti: ffff88005ab20000
    RIP: new_vma_page+0x70/0x90
    RSP: 0000:ffff88005ab21db0 EFLAGS: 00010246
    RAX: fffffffffffffff2 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000008040075 RSI: ffff8801c3d74600 RDI: ffffea00079a8b80
    RBP: ffff88005ab21dc8 R08: 0000000000000004 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff2
    R13: ffffea00079a8b80 R14: 0000000000400000 R15: 0000000000400000

    FS: 00007ff49c6f4740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ff49c68f994 CR3: 000000005a205000 CR4: 00000000001407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Stack:
    ffffea00079a8b80 ffffea00079a8bc0 ffffea00079a8ba0 ffff88005ab21e50
    ffffffff811adc7a 0000000000000000 ffff8801ca5295d0 0000000464e224f8
    0000000000000000 0000000000000002 0000000000000000 ffff88020ce75c00
    Call Trace:
    migrate_pages+0x12a/0x850
    SYSC_mbind+0x513/0x6a0
    SyS_mbind+0xe/0x10
    ia32_do_call+0x13/0x13
    Code: 85 c0 75 2f 4c 89 e1 48 89 da 31 f6 bf da 00 02 00 65 44 8b 04 25 08 f7 1c 00 e8 ec fd ff ff 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 0b 66 0f 1f 44 00 00 4c 89 e6 48 89 df ba 01 00 00 00 e8 48
    RIP [] new_vma_page+0x70/0x90
    RSP

    Signed-off-by: Wanpeng Li
    Reported-by: Dave Jones
    Reported-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • queue_pages_range() isolates hugetlbfs pages and putback_lru_pages()
    can't handle these. We should change it to putback_movable_pages().

    Naoya said that it is worth going into stable, because it can break
    in-use hugepage list.

    Signed-off-by: Joonsoo Kim
    Acked-by: Rafael Aquini
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Cc: [3.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

09 Dec, 2013

1 commit

  • change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
    On archs like ppc64 that don't use _PAGE_PROTNONE and also have
    a separate page table outside linux pagetable, we just need to
    make sure that when calling change_prot_numa we flush the
    hardware page table entry so that next page access result in a numa
    fault.

    We still need to make sure we use the numa faulting logic only
    when CONFIG_NUMA_BALANCING is set. This implies the migrate-on-fault
    (Lazy migration) via mbind will only work if CONFIG_NUMA_BALANCING
    is set.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

22 Nov, 2013

1 commit

  • Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:

    mm/mempolicy.c: In function 'mpol_to_str':
    mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments

    Kees says this is because he is using -Wformat-security.

    Silence the warning.

    Signed-off-by: David Rientjes
    Reported-by: Fengguang Wu
    Suggested-by: Kees Cook
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Nov, 2013

1 commit

  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

2 commits

  • Use more appropriate NUMA_NO_NODE instead of -1

    Signed-off-by: Jianguo Wu
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • mpol_to_str() should not fail. Currently, it either fails because the
    string buffer is too small or because a string hasn't been defined for a
    mempolicy mode.

    If a new mempolicy mode is introduced and no string is defined for it,
    just warn and return "unknown".

    If the buffer is too small, just truncate the string and return, the
    same behavior as snprintf().

    This also fixes a bug where there was no NULL-byte termination when doing
    *p++ = '=' and *p++ ':' and maxlen has been reached.

    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Chen Gang
    Cc: Rik van Riel
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2013

6 commits

  • Shared faults can lead to lots of unnecessary page migrations,
    slowing down the system, and causing private faults to hit the
    per-pgdat migration ratelimit.

    This patch adds sysctl numa_balancing_migrate_deferred, which specifies
    how many shared page migrations to skip unconditionally, after each page
    migration that is skipped because it is a shared fault.

    This reduces the number of page migrations back and forth in
    shared fault situations. It also gives a strong preference to
    the tasks that are already running where most of the memory is,
    and to moving the other tasks to near the memory.

    Testing this with a much higher scan rate than the default
    still seems to result in fewer page migrations than before.

    Memory seems to be somewhat better consolidated than previously,
    with multi-instance specjbb runs on a 4 node system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With the scan rate code working (at least for multi-instance specjbb),
    the large hammer that is "sched: Do not migrate memory immediately after
    switching node" can be replaced with something smarter. Revert temporarily
    migration disabling and all traces of numa_migrate_seq.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is a 90% regression observed with a large Oracle performance test
    on a 4 node system. Profiles indicated that the overhead was due to
    contention on sp_lock when looking up shared memory policies. These
    policies do not have the appropriate flags to allow them to be
    automatically balanced so trapping faults on them is pointless. This
    patch skips VMAs that do not have MPOL_F_MOF set.

    [riel@redhat.com: Initial patch]

    Signed-off-by: Mel Gorman
    Reported-and-tested-by: Joe Mario
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • The load balancer can move tasks between nodes and does not take NUMA
    locality into account. With automatic NUMA balancing this may result in the
    tasks working set being migrated to the new node. However, as the fault
    buffer will still store faults from the old node the schduler may decide to
    reset the preferred node and migrate the task back resulting in more
    migrations.

    The ideal would be that the scheduler did not migrate tasks with a heavy
    memory footprint but this may result nodes being overloaded. We could
    also discard the fault information on task migration but this would still
    cause all the tasks working set to be migrated. This patch simply avoids
    migrating the memory for a short time after a task is migrated.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

12 Sep, 2013

6 commits

  • new_vma_page() is called only by page migration called from do_mbind(),
    where pages to be migrated are queued into a pagelist by
    queue_pages_range(). queue_pages_range() confirms that a queued page
    belongs to some vma, so !vma case is not supposed to be happen. This
    patch adds BUG_ON() to catch this unexpected case.

    Signed-off-by: Naoya Horiguchi
    Reported-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The function check_range() (and its family) is not well-named, because it
    does not only checking something, but moving pages from list to list to do
    page migration for them. So queue_pages_*range is more desirable name.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
    migrate hugepage with mbind(2) after applying the enablement patch which
    comes later in this series.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend check_range() to handle vma with VM_HUGETLB set. We will be able
    to migrate hugepage with migrate_pages(2) after applying the enablement
    patch which comes later in this series.

    Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for
    example), we simply skip it now.

    Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by
    pmd/pud. This is not true in some architectures implementing hugepage
    with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge
    simply return 0 in such arch and page walker simply ignores such
    hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If node == NUMA_NO_NODE, pol is NULL, we should return NULL instead of
    do "if (!pol->mode)" check.

    [akpm@linux-foundation.org: reorganise code]
    Signed-off-by: Jianguo Wu
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Aug, 2013

1 commit

  • vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
    is doubly wrong:

    1. This leaks vma->vm_policy if it is not NULL and not equal to
    next->vm_policy.

    This can happen if vma_merge() expands "area", not prev (case 8).

    2. This sets the wrong policy if vma_merge() joins prev and area,
    area is the vma the caller needs to update and it still has the
    old policy.

    Revert commit 1444f92c8498 ("mm: merging memory blocks resets
    mempolicy") which introduced these problems.

    Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
    the problem that commit tried to address.

    Signed-off-by: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Cc: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Mar, 2013

2 commits

  • Currently, n_new is wrongly initialized. start and end parameter are
    inverted. Let's fix it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Hillf Danton
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • n->end is accessed in sp_insert(). Thus it should be update
    before calling sp_insert(). This mistake may make kernel panic.

    Signed-off-by: Hillf Danton
    Signed-off-by: KOSAKI Motohiro
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

24 Feb, 2013

4 commits

  • Make a sweep through mm/ and convert code that uses -1 directly to using
    the more appropriate NUMA_NO_NODE.

    Signed-off-by: David Rientjes
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Migration of KSM pages is now safe: remove the PageKsm restrictions from
    mempolicy.c and migrate.c.

    But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
    are irrelevant to KSM: it looks as if that code was preventing hotremove
    migration of KSM pages, unless they happened to be in swapcache.

    There is some question as to whether enforcing a NUMA mempolicy migration
    ought to migrate KSM pages, mapped into entirely unrelated processes; but
    moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
    and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
    any area where this is a worry.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The function names page_xchg_last_nid(), page_last_nid() and
    reset_page_last_nid() were judged to be inconsistent so rename them to a
    struct_field_op style pattern. As it looked jarring to have
    reset_page_mapcount() and page_nid_reset_last() beside each other in
    memmap_init_zone(), this patch also renames reset_page_mapcount() to
    page_mapcount_reset(). There are others like init_page_count() but as
    it is used throughout the arch code a rename would likely cause more
    conflicts than it is worth.

    [akpm@linux-foundation.org: fix zcache]
    Signed-off-by: Mel Gorman
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman