13 Nov, 2013

40 commits

  • Only a couple of arches (sh/x86) use fpu_counter in task_struct so it can
    be moved out into ARCH specific thread_struct, reducing the size of
    task_struct for other arches.

    Compile tested sh defconfig + sh4-linux-gcc (4.6.3)

    Signed-off-by: Vineet Gupta
    Cc: Paul Mundt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • if (unlikely(x) > 0) doesn't seem to help branch prediction

    Signed-off-by: Roel Kluin
    Cc: Raghavendra K T
    Cc: Konrad Rzeszutek Wilk
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • Commit 15d94b82565e ("reboot: move shutdown/reboot related functions to
    kernel/reboot.c") moved all kexec-related functionality to
    kernel/reboot.c, so kernel/sys.c no longer needs to include
    .

    Signed-off-by: Geert Uytterhoeven
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • The wrapper function delayacct_add_tsk() already checked 'tsk->delays',
    and __delayacct_add_tsk() has no another direct callers, so can remove the
    redundancy checking code.

    And the label 'done' is also useless, so remove it, too.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • getenv() may return NULL if given environment variable does not exist
    which leads to NULL dereference when calling strncat.

    Besides that, the environment variable name was copied to a temporary
    env_var buffer, but this copying can be avoided by simply using the input
    string.

    Lastly, the whole loop can be greatly simplified by using the snprintf
    function instead of the playing with strncat.

    By the way, the current implementation allows a recursive variable
    expansion, as in:

    $ echo 'out ${A} out ' | A='a ${B} a' B=b /tmp/a
    out a b a out

    I'm assuming this is just a side effect and not a conscious decision
    (especially as this may lead to infinite loop), but I didn't want to
    change this behaviour without consulting.

    If the current behaviour is deamed incorrect, I'll be happy to send
    a patch without recursive processing.

    Signed-off-by: Michal Nazarewicz
    Cc: Kees Cook
    Cc: Jiri Kosina
    Cc: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     
  • glibc recently changed the error string for ESTALE to remove "NFS" -

    https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=96945714ec61951cc748da2b4b8a80cf02127ee9

    from: [ERR_REMAP (ESTALE)] = N_("Stale NFS file handle"),
    to: [ERR_REMAP (ESTALE)] = N_("Stale file handle"),

    And some have expressed concern that the kernel's errno.h
    comments still refer to NFS.

    So make that change... note that this is a comment-only change,
    and has no functional difference.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • The name_to_dev_t function has a comment block which lists the supported
    syntaxes for the device name. Add a bullet for the :
    syntax, which is already supported in the code

    Signed-off-by: Sebastian Capella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Capella
     
  • For some reason I managed to trick gcc into create CRC symbols that are
    not absolute anymore, but weak.

    Make modpost handle this case.

    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Geert Uytterhoeven
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Use standard gcc __attribute__((alias(foo))) to define the syscall aliases
    instead of custom assembler macros.

    This is far cleaner, and also fixes my LTO kernel build.

    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Geert Uytterhoeven
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Who needs cramfs when you have squashfs? At least, we should warn people
    that cramfs is obsolete.

    Signed-off-by: Michael Opdenacker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Opdenacker
     
  • Tests various percpu operations.

    Enable with CONFIG_PERCPU_TEST=m.

    Signed-off-by: Greg Thelen
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
    registers to userspace. The Kconfig help points out that in some cases
    this can be a security risk as some systems may erroneously configure the
    map such that additional data is exposed to userspace.

    This is a problem for distributions -- some users want the MMAP
    functionality but it comes with a significant security risk. In an effort
    to mitigate this risk, and due to the low number of users of the MMAP
    functionality, I've introduced a kernel parameter, hpet_mmap_enable, that
    is required in order to actually have the HPET MMAP exposed.

    Signed-off-by: Prarit Bhargava
    Acked-by: Matt Wilson
    Signed-off-by: Clemens Ladisch
    Cc: Randy Dunlap
    Cc: Tomas Winkler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     
  • Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The same calculation is currently done in three differents places.
    Factor that code so future changes has to be made at only one place.

    [akpm@linux-foundation.org: uninline vm_commit_limit()]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Now dirty_background_ratio/dirty_ratio contains a percentage of total
    avaiable memory, which contains free pages and reclaimable pages. The
    number of these pages is not equal to the number of total system memory.
    But they are described as a percentage of total system memory in
    Documentation/sysctl/vm.txt. So we need to fix them to avoid
    misunderstanding.

    Signed-off-by: Zheng Liu
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zheng Liu
     
  • Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     
  • Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     
  • The refcount routine was not fit the kernel get/put semantic exactly,
    There were too many judgement statements on refcount and it could be
    minus.

    This patch does the following:

    - move refcount judgement to zswap_entry_put() to hide resource free function.

    - add a new function zswap_entry_find_get(), so that callers can use
    easily in the following pattern:

    zswap_entry_find_get
    .../* do something */
    zswap_entry_put

    - to eliminate compile error, move some functions declaration

    This patch is based on Minchan Kim 's idea and suggestion.

    Signed-off-by: Weijie Yang
    Cc: Seth Jennings
    Acked-by: Minchan Kim
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Consider the following scenario:

    thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
    thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
    finished, entry x and its zbud is not freed as its refcount != 0
    now, the swap_map[x] = 0
    thread 0: now call zswap_get_swap_cache_page
    swapcache_prepare return -ENOENT because entry x is not used any more
    zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
    zswap_writeback_entry do nothing except put refcount

    Now, the memory of zswap_entry x and its zpage leak.

    Modify:
    - check the refcount in fail path, free memory if it is not referenced.

    - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
    can be not only caused by nomem but also by invalidate.

    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Reviewed-by: Minchan Kim
    Acked-by: Seth Jennings
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • We can't see the relationship with memcg from the parameters,
    so the name with memcg_idx would be more reasonable.

    Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • This patch fixes the problem that get_unmapped_area() can return illegal
    address and result in failing mmap(2) etc.

    In case that the address higher than PAGE_SIZE is set to
    /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
    returned by get_unmapped_area(), even if you do not pass any virtual
    address hint (i.e. the second argument).

    This is because the current get_unmapped_area() code does not take into
    account mmap_min_addr.

    This leads to two actual problems as follows:

    1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
    although any illegal parameter is not passed.

    2. The bottom-up search path after the top-down search might not work in
    arch_get_unmapped_area_topdown().

    Note: The first and third chunk of my patch, which changes "len" check,
    are for more precise check using mmap_min_addr, and not for solving the
    above problem.

    [How to reproduce]

    --- test.c -------------------------------------------------
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    void *ret = NULL, *last_map;
    size_t pagesize = sysconf(_SC_PAGESIZE);

    do {
    last_map = ret;
    ret = mmap(0, pagesize, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    // printf("ret=%p\n", ret);
    } while (ret != MAP_FAILED);

    if (errno != ENOMEM) {
    printf("ERR: unexpected errno: %d (last map=%p)\n",
    errno, last_map);
    }

    return 0;
    }
    ---------------------------------------------------------------

    $ gcc -m32 -o test test.c
    $ sudo sysctl -w vm.mmap_min_addr=65536
    vm.mmap_min_addr = 65536
    $ ./test (run as non-priviledge user)
    ERR: unexpected errno: 1 (last map=0x10000)

    Signed-off-by: Akira Takeuchi
    Signed-off-by: Kiyoshi Owada
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akira Takeuchi
     
  • When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.

    But it has one serious mistake. When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA. However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty. That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e. requested size is smaller than half of page block). The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.

    Mel's original anti fragmentation does the right thing. But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

    This patch restores sane and old behavior. It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
    MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
    the argument to MIGRATE_UNMOVABLE and we lost these attribute.

    The problem was introduced by commit 49255c619fbd ("page allocator: move
    check for disabled anti-fragmentation out of fastpath"). So a 4 year
    old issue may mean that nobody uses page_group_by_mobility_disabled.

    But anyway, this patch fixes the problem.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The kernel's readahead algorithm sometimes interprets random read
    accesses as sequential and triggers unnecessary data prefecthing from
    storage device (impacting random read average latency).

    In order to identify sequential cache read misses, the readahead
    algorithm intends to check whether offset - previous offset == 1
    (trivial sequential reads) or offset - previous offset == 0 (sequential
    reads not aligned on page boundary):

    if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
    condition is true and access is wrongly interpeted as sequential. An
    unnecessary data prefetching is triggered, impacting the average random
    read latency.

    Storing the previous offset value in a "pgoff_t" variable (unsigned
    long) fixes the sequential read detection logic.

    Signed-off-by: Damien Ramonda
    Reviewed-by: Fengguang Wu
    Acked-by: Pierre Tardy
    Acked-by: David Cohen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Damien Ramonda
     
  • It has been reported on very large machines that show_mem is taking almost
    5 minutes to display information. This is a serious problem if there is
    an OOM storm. The bulk of the cost is in show_mem doing a very expensive
    PFN walk to give us the following information

    Total RAM: Also available as totalram_pages
    Highmem pages: Also available as totalhigh_pages
    Reserved pages: Can be inferred from the zone structure
    Shared pages: PFN walk required
    Unshared pages: PFN walk required
    Quick pages: Per-cpu walk required

    Only the shared/unshared pages requires a full PFN walk but that
    information is useless. It is also inaccurate as page pins of unshared
    pages would be accounted for as shared. Even if the information was
    accurate, I'm struggling to think how the shared/unshared information
    could be useful for debugging OOM conditions. Maybe it was useful before
    rmap existed when reclaiming shared pages was costly but it is less
    relevant today.

    The PFN walk could be optimised a bit but why bother as the information is
    useless. This patch deletes the PFN walker and infers the total RAM,
    highmem and reserved pages count from struct zone. It omits the
    shared/unshared page usage on the grounds that it is useless. It also
    corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
    has similar problems to HighMem with respect to lowmem/highmem exhaustion.

    Signed-off-by: Mel Gorman
    Cc: David Rientjes
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Signed-off-by: Daeseok Youn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daeseok Youn
     
  • vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
    node at CPU online event. However, it does not update this N_CPU info at
    CPU offline event.

    Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
    node is put into offline, i.e. the node no longer has any online CPU.

    Signed-off-by: Toshi Kani
    Acked-by: Christoph Lameter
    Reviewed-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • After a system booted, N_CPU is not set to any node as has_cpu shows an
    empty line.

    # cat /sys/devices/system/node/has_cpu
    (show-empty-line)

    setup_vmstat() registers its CPU notifier callback,
    vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
    into online. However, setup_vmstat() is called after all CPUs are
    launched in the boot sequence.

    Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
    boot, which is consistent with other operations in
    vmstat_cpuup_callback(), i.e. start_cpu_timer() and
    refresh_zone_stat_thresholds().

    Also added get_online_cpus() to protect the for_each_online_cpu() loop.

    Signed-off-by: Toshi Kani
    Acked-by: Christoph Lameter
    Reviewed-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
    As we mentioned before, if hotpluggable memory is used by the kernel, it
    cannot be hot-removed. So memory hotplug users may want to set all
    hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

    Memory hotplug users may also set a node as movable node, which has
    ZONE_MOVABLE only, so that the whole node can be hot-removed.

    But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
    kernel cannot use memory in movable nodes. This will cause NUMA
    performance down. And other users may be unhappy.

    So we need a way to allow users to enable and disable this functionality.
    In this patch, we introduce movable_node boot option to allow users to
    choose to not to consume hotpluggable memory at early boot time and later
    we can set it as ZONE_MOVABLE.

    To achieve this, the movable_node boot option will control the memblock
    allocation direction. That said, after memblock is ready, before SRAT is
    parsed, we should allocate memory near the kernel image as we explained in
    the previous patches. So if movable_node boot option is set, the kernel
    does the following:

    1. After memblock is ready, make memblock allocate memory bottom up.
    2. After SRAT is parsed, make memblock behave as default, allocate memory
    top down.

    Users can specify "movable_node" in kernel commandline to enable this
    functionality. For those who don't use memory hotplug or who don't want
    to lose their NUMA performance, just don't specify anything. The kernel
    will work as before.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Suggested-by: Kamezawa Hiroyuki
    Suggested-by: Ingo Molnar
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Memory reserved for crashkernel could be large. So we should not allocate
    this memory bottom up from the end of kernel image.

    When SRAT is parsed, we will be able to know which memory is hotpluggable,
    and we can avoid allocating this memory for the kernel. So reorder
    reserve_crashkernel() after SRAT is parsed.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • The Linux kernel cannot migrate pages used by the kernel. As a result,
    kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
    memory for the kernel.

    In a memory hotplug system, any numa node the kernel resides in should be
    unhotpluggable. And for a modern server, each node could have at least
    16GB memory. So memory around the kernel image is highly likely
    unhotpluggable.

    ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
    info. But before SRAT is parsed, memblock has already started to allocate
    memory for the kernel. So we need to prevent memblock from doing this.

    So direct memory mapping page tables setup is the case.
    init_mem_mapping() is called before SRAT is parsed. To prevent page
    tables being allocated within hotpluggable memory, we will use bottom-up
    direction to allocate page tables from the end of kernel image to the
    higher memory.

    Note:
    As for allocating page tables in lower memory, TJ said:

    : This is an optional behavior which is triggered by a very specific kernel
    : boot param, which I suspect is gonna need to stick around to support
    : memory hotplug in the current setup unless we add another layer of address
    : translation to support memory hotplug.

    As for page tables may occupy too much lower memory if using 4K mapping
    (CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
    pages), TJ said:

    : But as I said in the same paragraph, parsing SRAT earlier doesn't solve
    : the problem in itself either. Ignoring the option if 4k mapping is
    : required and memory consumption would be prohibitive should work, no?
    : Something like that would be necessary if we're gonna worry about cases
    : like this no matter how we implement it, but, frankly, I'm not sure this
    : is something worth worrying about.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Create a new function memory_map_top_down to factor out of the top-down
    direct memory mapping pagetable setup. This is also a preparation for the
    following patch, which will introduce the bottom-up memory mapping. That
    said, we will put the two ways of pagetable setup into separate functions,
    and choose to use which way in init_mem_mapping, which makes the code more
    clear.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • The Linux kernel cannot migrate pages used by the kernel. As a result,
    kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
    memory for the kernel.

    ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
    info. But before SRAT is parsed, memblock has already started to allocate
    memory for the kernel. So we need to prevent memblock from doing this.

    In a memory hotplug system, any numa node the kernel resides in should be
    unhotpluggable. And for a modern server, each node could have at least
    16GB memory. So memory around the kernel image is highly likely
    unhotpluggable.

    So the basic idea is: Allocate memory from the end of the kernel image and
    to the higher memory. Since memory allocation before SRAT is parsed won't
    be too much, it could highly likely be in the same node with kernel image.

    The current memblock can only allocate memory top-down. So this patch
    introduces a new bottom-up allocation mode to allocate memory bottom-up.
    And later when we use this allocation direction to allocate memory, we
    will limit the start address above the kernel.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Tejun Heo
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • [Problem]

    The current Linux cannot migrate pages used by the kernel because of the
    kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
    When the pa is changed, we cannot simply update the pagetable and keep the
    va unmodified. So the kernel pages are not migratable.

    There are also some other issues will cause the kernel pages not
    migratable. For example, the physical address may be cached somewhere and
    will be used. It is not to update all the caches.

    When doing memory hotplug in Linux, we first migrate all the pages in one
    memory device somewhere else, and then remove the device. But if pages
    are used by the kernel, they are not migratable. As a result, memory used
    by the kernel cannot be hot-removed.

    Modifying the kernel direct mapping mechanism is too difficult to do. And
    it may cause the kernel performance down and unstable. So we use the
    following way to do memory hotplug.

    [What we are doing]

    In Linux, memory in one numa node is divided into several zones. One of
    the zones is ZONE_MOVABLE, which the kernel won't use.

    In order to implement memory hotplug in Linux, we are going to arrange all
    hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
    memory. To do this, we need ACPI's help.

    In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The
    memory affinities in SRAT record every memory range in the system, and
    also, flags specifying if the memory range is hotpluggable. (Please refer
    to ACPI spec 5.0 5.2.16)

    With the help of SRAT, we have to do the following two things to achieve our
    goal:

    1. When doing memory hot-add, allow the users arranging hotpluggable as
    ZONE_MOVABLE.
    (This has been done by the MOVABLE_NODE functionality in Linux.)

    2. when the system is booting, prevent bootmem allocator from allocating
    hotpluggable memory for the kernel before the memory initialization
    finishes.

    The problem 2 is the key problem we are going to solve. But before solving it,
    we need some preparation. Please see below.

    [Preparation]

    Bootloader has to load the kernel image into memory. And this memory must
    be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug
    system, we can assume any node the kernel resides in is not hotpluggable.

    Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
    But memblock has already started to work. In the current kernel,
    memblock allocates the following memory before SRAT is parsed:

    setup_arch()
    |->memblock_x86_fill() /* memblock is ready */
    |......
    |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
    |->reserve_real_mode() /* allocate memory under 1MB */
    |->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
    |->dma_contiguous_reserve() /* specified by user, should be low */
    |->setup_log_buf() /* specified by user, several mega bytes */
    |->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
    |->acpi_initrd_override() /* several mega bytes */
    |->reserve_crashkernel() /* could be large, should reorder */
    |......
    |->initmem_init() /* Parse SRAT */

    According to Tejun's advice, before SRAT is parsed, we should try our best
    to allocate memory near the kernel image. Since the whole node the kernel
    resides in won't be hotpluggable, and for a modern server, a node may have
    at least 16GB memory, allocating several mega bytes memory around the
    kernel image won't cross to hotpluggable memory.

    [About this patchset]

    So this patchset is the preparation for the problem 2 that we want to
    solve. It does the following:

    1. Make memblock be able to allocate memory bottom up.
    1) Keep all the memblock APIs' prototype unmodified.
    2) When the direction is bottom up, keep the start address greater than the
    end of kernel image.

    2. Improve init_mem_mapping() to support allocate page tables in
    bottom up direction.

    3. Introduce "movable_node" boot option to enable and disable this
    functionality.

    This patch (of 6):

    Create a new function __memblock_find_range_top_down to factor out of
    top-down allocation from memblock_find_in_range_node. This is a
    preparation because we will introduce a new bottom-up allocation mode in
    the following patch.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Implement mmap base randomization for the bottom up direction, so ASLR
    works for both mmap layouts on s390. See also commit df54d6fa5427 ("x86
    get_unmapped_area(): use proper mmap base for bottom-up direction").

    Signed-off-by: Heiko Carstens
    Cc: Radu Caragea
    Cc: Michel Lespinasse
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Metcalf
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • This is more or less the generic variant of commit 41aacc1eea64 ("x86
    get_unmapped_area: Access mmap_legacy_base through mm_struct member").

    So effectively architectures which use an own arch_pick_mmap_layout()
    implementation but call the generic arch_get_unmapped_area() now can
    also randomize their mmap_base.

    All architectures which have an own arch_pick_mmap_layout() and call the
    generic arch_get_unmapped_area() (arm64, s390, tile) currently set
    mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
    arch_pick_mmap_layout() function. So this change is a no-op currently.

    Signed-off-by: Heiko Carstens
    Cc: Radu Caragea
    Cc: Michel Lespinasse
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Metcalf
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Add SetPageReclaim() before __swap_writepage() so that page can be moved
    to the tail of the inactive list, which can avoid unnecessary page
    scanning as this page was reclaimed by swap subsystem before.

    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Reviewed-by: Minchan Kim
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang