04 Jul, 2014

5 commits

  • Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
    in isolate_lru_pages() at mm/vmscan.c:1281!

    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") looks like interrupted work-in-progress.

    mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
    - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
    when the page is always visible in radix_tree, and often already on LRU.

    Revert change to shmem_write_begin(), and use init_page_accessed() or
    mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().

    SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
    before; but since many other filesystems use [__]page_symlink(), which did
    and does mark the page accessed, consider this as rectifying an oversight.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Until now, the kernel has the same policy to handle victimized page
    frames that belong to kernel-space(reserved/slab-subsystem) or
    non-LRU(unknown page state). In other word, the result of handling
    either of these victimized page frames is (IGNORED | FAILED), and the
    return value of memory_failure() is -EBUSY.

    This patch is to avoid that memory_failure() returns very soon due to
    the "true" value of (!PageLRU(p)), and it also ensures that
    action_result() can report more precise information("reserved kernel",
    "kernel slab", and "unknown page state") instead of "non LRU",
    especially for memory errors which are detected by memory-scrubbing.

    Andi said:

    : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
    :
    : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
    : page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
    : page flags: 0x3ffff800004081(locked|slab|head)
    : ------------[ cut here ]------------
    : kernel BUG at mm/rmap.c:1495!
    :
    : I think what happened is that a LRU page turned into a slab page in
    : parallel with offlining. memory_failure initially tests for this case,
    : but doesn't retest later after the page has been locked.
    :
    : ...
    :
    : I ran this patch in a loop over night with some stress plus
    : the mcelog test suite running in a loop. I cannot guarantee it hit it,
    : but it should have given it a good beating.
    :
    : The kernel survived with no messages, although the mcelog test suite
    : got killed at some point because it couldn't fork anymore. Probably
    : some unrelated problem.
    :
    : So the patch is ok for me for .16.

    Signed-off-by: Chen Yucong
    Acked-by: Naoya Horiguchi
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Fix a regression caused by 7fc34a62ca44 ("mm/msync.c: sync only the
    requested range in msync()").

    xfstests generic/075 fail occured on ext4 data=journal mode because the
    intended range was not syncing due to wrong fstart calculation.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Reported-by: Eric Whitney
    Tested-by: Eric Whitney
    Acked-by: Matthew Wilcox
    Reviewed-by: Lukas Czerner
    Tested-by: Lukas Czerner
    Reviewed-by: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namjae Jeon
     
  • min_partial means minimum number of slab cached in node partial list.
    So, if nr_partial is less than it, we keep newly empty slab on node
    partial list rather than freeing it. But if nr_partial is equal or
    greater than it, it means that we have enough partial slabs so should
    free newly empty slab. Current implementation missed the equal case so
    if we set min_partial is 0, then, at least one slab could be cached.
    This is critical problem to kmemcg destroying logic because it doesn't
    works properly if some slabs is cached. This patch fixes this problem.

    Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
    immediately").

    Signed-off-by: Joonsoo Kim
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
    the following is triggered at early boot:

    SMP: Total of 8 processors activated.
    devtmpfs: initialized
    Unable to handle kernel NULL pointer dereference at virtual address 00000008
    pgd = fffffe0000050000
    [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
    Internal error: Oops: 96000006 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
    task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
    PC is at __list_add+0x10/0xd4
    LR is at free_one_page+0x270/0x638
    ...
    Call trace:
    __list_add+0x10/0xd4
    free_one_page+0x26c/0x638
    __free_pages_ok.part.52+0x84/0xbc
    __free_pages+0x74/0xbc
    init_cma_reserved_pageblock+0xe8/0x104
    cma_init_reserved_areas+0x190/0x1e4
    do_one_initcall+0xc4/0x154
    kernel_init_freeable+0x204/0x2a8
    kernel_init+0xc/0xd4

    This happens because init_cma_reserved_pageblock() calls
    __free_one_page() with pageblock_order as page order but it is bigger
    than MAX_ORDER. This in turn causes accesses past zone->free_list[].

    Fix the problem by changing init_cma_reserved_pageblock() such that it
    splits pageblock into individual MAX_ORDER pages if pageblock is bigger
    than a MAX_ORDER page.

    In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
    architectures expect for ia64, powerpc and tile at the moment, the
    “pageblock_order > MAX_ORDER” condition will be optimised out since both
    sides of the operator are constants. In cases where pageblock size is
    variable, the performance degradation should not be significant anyway
    since init_cma_reserved_pageblock() is called only at boot time at most
    MAX_CMA_AREAS times which by default is eight.

    Signed-off-by: Michal Nazarewicz
    Reported-by: Mark Salter
    Tested-by: Mark Salter
    Tested-by: Christopher Covington
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Marek Szyprowski
    Cc: Catalin Marinas
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     

24 Jun, 2014

9 commits

  • In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    introduced vma merging to mbind(), but it should have also changed the
    convention of passing start vma from queue_pages_range() (formerly
    check_range()) to new_vma_page(): vma merging may have already freed
    that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
    worse crashes.

    Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    Reported-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: [2.6.34+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit b1cb0982bdd6 ("change the management method of free objects of
    the slab") introduced a bug on slab leak detector
    ('/proc/slab_allocators'). This detector works like as following
    decription.

    1. traverse all objects on all the slabs.
    2. determine whether it is active or not.
    3. if active, print who allocate this object.

    but that commit changed the way how to manage free objects, so the logic
    determining whether it is active or not is also changed. In before, we
    regard object in cpu caches as inactive one, but, with this commit, we
    mistakenly regard object in cpu caches as active one.

    This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
    DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
    corrupt free memory in the slab. It unmaps page table mapping if object
    is free and map it if object is active. When slab leak detector check
    object in cpu caches, it mistakenly think this object active so try to
    access object memory to retrieve caller of allocation. At this point,
    page table mapping to this object doesn't exist, so oops occurs.

    Following is oops message reported from Dave.

    It blew up when something tried to read /proc/slab_allocators
    (Just cat it, and you should see the oops below)

    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    [snip...]
    CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
    task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
    RIP: 0010:[] [] handle_slab+0x8a/0x180
    RSP: 0018:ffff880076925de0 EFLAGS: 00010002
    RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
    RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
    RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
    R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
    FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
    DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
    Call Trace:
    leaks_show+0xce/0x240
    seq_read+0x28e/0x490
    proc_reg_read+0x3d/0x80
    vfs_read+0x9b/0x160
    SyS_read+0x58/0xb0
    tracesys+0xd4/0xd9
    Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
    RIP handle_slab+0x8a/0x180

    To fix the problem, I introduce an object status buffer on each slab.
    With this, we can track object status precisely, so slab leak detector
    would not access active object and no kernel oops would occur. Memory
    overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
    which is mainly used for debugging, so memory overhead isn't big
    problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Dave Jones
    Reported-by: Tetsuo Handa
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Trinity finds that mmap access to a hole while it's punched from shmem
    can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
    from completing, until the reader chooses to stop; with the puncher's
    hold on i_mutex locking out all other writers until it can complete.

    It appears that the tmpfs fault path is too light in comparison with its
    hole-punching path, lacking an i_data_sem to obstruct it; but we don't
    want to slow down the common case.

    Extend shmem_fallocate()'s existing range notification mechanism, so
    shmem_fault() can refrain from faulting pages into the hole while it's
    punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
    faulting when not).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Trinity has reported:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
    CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
    3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
    lock_acquire (arch/x86/include/asm/current.h:14
    kernel/locking/lockdep.c:3602)
    _raw_spin_lock (include/linux/spinlock_api_smp.h:143
    kernel/locking/spinlock.c:151)
    remove_migration_pte (mm/migrate.c:137)
    rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
    remove_migration_ptes (mm/migrate.c:224)
    migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
    migrate_misplaced_page (mm/migrate.c:1733)
    __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
    handle_mm_fault (mm/memory.c:3948)
    __get_user_pages (mm/memory.c:1851)
    __mlock_vma_pages_range (mm/mlock.c:255)
    __mm_populate (mm/mlock.c:711)
    SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

    I believe this comes about because, whereas collapsing and splitting THP
    functions take anon_vma lock in write mode (which excludes concurrent
    rmap walks), faulting THP functions (write protection and misplaced
    NUMA) do not - and mostly they do not need to.

    But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
    an instant (indeed, for a long instant, given the inter-CPU TLB flush in
    there), leaves *pmd neither present not trans_huge.

    Which can confuse a concurrent rmap walk, as when removing migration
    ptes, seen in the dumped trace. Although that rmap walk has a 4k page
    to insert, anon_vmas containing THPs are in no way segregated from
    4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
    that instant when a trans_huge pmd is temporarily absent.

    I don't think we need strengthen the locking at the THP end: it's easily
    handled with an ACCESS_ONCE() before testing both conditions.

    And since mm_find_pmd() had only one caller who wanted a THP rather than
    a pmd, let's slightly repurpose it to fail when it hits a THP or
    non-present pmd, and open code split_huge_page_address() again.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Bob Liu
    Cc: Christoph Lameter
    Cc: Dave Jones
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
    in copy_page_rep() called from copy_user_huge_page() called from
    do_huge_pmd_wp_page().

    I believe this is a DEBUG_PAGEALLOC false positive, due to the source
    page being split, and a tail page freed, while copy is in progress; and
    not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
    prevent a miscopy from being made visible.

    Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
    the usual get_page() and put_page() on head page in the usual config;
    but get and put references to all of the tail pages when
    DEBUG_PAGEALLOC.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There's a race between fork() and hugepage migration, as a result we try
    to "dereference" a swap entry as a normal pte, causing kernel panic.
    The cause of the problem is that copy_hugetlb_page_range() can't handle
    "swap entry" family (migration entry and hwpoisoned entry) so let's fix
    it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE
    support being added to fallocate(); but didn't realize until now that I
    had been too stupid to future-proof shmem_fallocate() against new
    additions. -EOPNOTSUPP instead of going on to ordinary fallocation.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Lukas Czerner
    Cc: [3.15]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • mm could be removed from current task struct, using previous vma->vm_mm

    It will crash on blackfin after updated to Linux 3.15. The commit "mm:
    per-thread vma caching" caused the crash. mm could be removed from
    current task struct before

    mmput()->
    exit_mmap()->
    delete_vma_from_mm()

    the detailed fault information:

    NULL pointer access
    Kernel OOPS in progress
    Deferred Exception context
    CURRENT PROCESS:
    COMM=modprobe PID=278 CPU=0
    invalid mm
    return address: [0x000531de]; contents of:
    0x000531b0: c727 acea 0c42 181d 0000 0000 0000 a0a8
    0x000531c0: b090 acaa 0c42 1806 0000 0000 0000 a0e8
    0x000531d0: b0d0 e801 0000 05b3 0010 e522 0046 [a090]
    0x000531e0: 6408 b090 0c00 17cc 3042 e3ff f37b 2fc8

    CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
    task: 0572b720 ti: 0569e000 task.ti: 0569e000
    Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
    ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
    Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014

    SEQUENCER STATUS: Not tainted
    SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 2806
    EXCAUSE : 0x27
    physical IVG3 asserted : { _trap + 0x0 }
    physical IVG15 asserted : { _evt_system_call + 0x0 }
    logical irq 6 mapped : { _bfin_coretmr_interrupt + 0x0 }
    logical irq 7 mapped : { _bfin_fault_routine + 0x0 }
    logical irq 11 mapped : { _l2_ecc_err + 0x0 }
    logical irq 13 mapped : { _bfin_fault_routine + 0x0 }
    logical irq 39 mapped : { _bfin_twi_interrupt_entry + 0x0 }
    logical irq 40 mapped : { _bfin_twi_interrupt_entry + 0x0 }
    RETE: /* Maybe null pointer? */
    RETN: /* kernel dynamic memory (maybe user-space) */
    RETX: /* Maybe fixed code section */
    RETS: { _exit_mmap + 0x28 }
    PC : { _delete_vma_from_mm + 0x92 }
    DCPLB_FAULT_ADDR: /* Maybe null pointer? */
    ICPLB_FAULT_ADDR: { _delete_vma_from_mm + 0x92 }
    PROCESSOR STATE:
    R0 : 00000004 R1 : 0569e000 R2 : 00bf3db4 R3 : 00000000
    R4 : 057f9800 R5 : 00000001 R6 : 0569ddd0 R7 : 0572b720
    P0 : 0572b854 P1 : 00000004 P2 : 00000000 P3 : 0569dda0
    P4 : 0572b720 P5 : 0566c368 FP : 0569fe5c SP : 0569fd74
    LB0: 057f523f LT0: 057f523e LC0: 00000000
    LB1: 0005317c LT1: 00053172 LC1: 00000002
    B0 : 00000000 L0 : 00000000 M0 : 0566f5bc I0 : 00000000
    B1 : 00000000 L1 : 00000000 M1 : 00000000 I1 : ffffffff
    B2 : 00000001 L2 : 00000000 M2 : 00000000 I2 : 00000000
    B3 : 00000000 L3 : 00000000 M3 : 00000000 I3 : 057f8000
    A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000
    USP : 056ffcf8 ASTAT: 02003024

    Hardware Trace:
    0 Target : { _trap_c + 0x0 }
    Source : { _exception_to_level5 + 0xa0 } JUMP.L
    1 Target : { _exception_to_level5 + 0x0 }
    Source : { _bfin_return_from_exception + 0x6 } RTX
    2 Target : { _bfin_return_from_exception + 0x0 }
    Source : { _ex_trap_c + 0x70 } JUMP.S
    3 Target : { _ex_trap_c + 0x0 }
    Source : { _trap + 0x2a } JUMP (P4)
    4 Target : { _trap + 0x0 }
    FAULT : { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
    Source : { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
    5 Target : { _delete_vma_from_mm + 0x8e }
    Source : { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
    6 Target : { _delete_vma_from_mm + 0x0 }
    Source : { _exit_mmap + 0x24 } JUMP.L
    7 Target : { _exit_mmap + 0x1c }
    Source : { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
    8 Target : { _exit_mmap + 0x34 }
    Source : { __cond_resched + 0x20 } RTS
    9 Target : { __cond_resched + 0x0 }
    Source : { _exit_mmap + 0x30 } JUMP.L
    10 Target : { _exit_mmap + 0x30 }
    Source : { _delete_vma + 0xb2 } RTS
    11 Target : { _delete_vma + 0xac }
    Source : { _kmem_cache_free + 0xba } RTS
    12 Target : { _kmem_cache_free + 0xa8 }
    Source : { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
    13 Target : { _kmem_cache_free + 0x92 }
    Source : { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
    14 Target : { _kmem_cache_free + 0x34 }
    Source : { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
    15 Target : { _kmem_cache_free + 0x0 }
    Source : { _delete_vma + 0xa8 } JUMP.L
    Kernel Stack
    Stack info:
    SP: [0x0569ff24] /* kernel dynamic memory (maybe user-space) */
    Memory from 0x0569ff20 to 056a0000
    0569ff20: 00000001 [04e8da5a] 00008000 00000000 00000000 056a0000 04e8da5a 04e8da5a
    0569ff40: 04eb9eea ffa00dce 02003025 04ea09c5 057f523f 04ea09c4 057f523e 00000000
    0569ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000
    0569ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    0569ffa0: 0566f5bc 057f8000 057f8000 00000001 04ec0170 056ffcf8 056ffd04 057f9800
    0569ffc0: 04d1d498 057f9800 057f8fe4 057f8ef0 00000001 057f928c 00000001 00000001
    0569ffe0: 057f9800 00000000 00000008 00000007 00000001 00000001 00000001
    Return addresses in stack:
    address : { _show_cpuinfo + 0x2d2 }
    Modules linked in:
    Kernel panic - not syncing: Kernel exception
    [ end Kernel panic - not syncing: Kernel exception

    Signed-off-by: Steven Miao
    Acked-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: [3.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Miao
     

15 Jun, 2014

1 commit

  • Tetsuo Handa wrote:
    "Commit 62a8067a7f35 ("bio_vec-backed iov_iter") introduced an unnamed
    union inside a struct which gcc-4.4.7 cannot handle. Name the unnamed
    union as u in order to fix build failure"

    Let's do this instead: there is only one place in the entire tree that
    steps into this breakage. Anon structs and unions work in older gcc
    versions; as the matter of fact, we have those in the tree - see e.g.
    struct ieee80211_tx_info in include/net/mac80211.h

    What doesn't work is handling their initializers:

    struct {
    int a;
    union {
    int b;
    char c;
    };
    } x[2] = {{.a = 1, .c = 'a'}, {.a = 0, .b = 1}};

    is the obvious syntax for initializer, perfectly fine for C11 and
    handled correctly by gcc-4.7 or later.

    Earlier versions, though, break on it - declaration is fine and so's
    access to fields (i.e. x[0].c = 'a'; would produce the right code), but
    members of the anon structs and unions are not inserted into the right
    namespace. Tellingly, those older versions will not barf on struct {int
    a; struct {int a;};}; - looks like they just have it hacked up somewhere
    around the handling of . and -> instead of doing the right thing.

    The easiest way to deal with that crap is to turn initialization of
    those fields (in the only place where we have such initializer of
    iov_iter) into plain assignment.

    Reported-by: Tetsuo Handa
    Reported-by: Russell King
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

2 commits


10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

09 Jun, 2014

3 commits

  • shrink_inactive_list() used to wait 0.1s to avoid congestion when all
    the pages that were isolated from the inactive list were dirty but not
    under active writeback. That makes no real sense, and apparently causes
    major interactivity issues under some loads since 3.11.

    The ostensible reason for it was to wait for kswapd to start writing
    pages, but that seems questionable as well, since the congestion wait
    code seems to trigger for kswapd itself as well. Also, the logic behind
    delaying anything when we haven't actually started writeback is not
    clear - it only delays actually starting that writeback.

    We'll still trigger the congestion waiting if

    (a) the process is kswapd, and we hit pages flagged for immediate
    reclaim

    (b) the process is not kswapd, and the zone backing dev writeback is
    actually congested.

    This probably needs to be revisited, but as it is this fixes a reported
    regression.

    Reported-by: Felipe Contreras
    Pinpointed-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "Clean ups and miscellaneous bug fixes, in particular for the new
    collapse_range and zero_range fallocate functions. In addition,
    improve the scalability of adding and remove inodes from the orphan
    list"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
    ext4: handle symlink properly with inline_data
    ext4: fix wrong assert in ext4_mb_normalize_request()
    ext4: fix zeroing of page during writeback
    ext4: remove unused local variable "stored" from ext4_readdir(...)
    ext4: fix ZERO_RANGE test failure in data journalling
    ext4: reduce contention on s_orphan_lock
    ext4: use sbi in ext4_orphan_{add|del}()
    ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
    ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
    ext4: remove unnecessary double parentheses
    ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
    ext4: make local functions static
    ext4: fix block bitmap validation when bigalloc, ^flex_bg
    ext4: fix block bitmap initialization under sparse_super2
    ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
    ext4: avoid unneeded lookup when xattr name is invalid
    ext4: fix data integrity sync in ordered mode
    ext4: remove obsoleted check
    ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
    ext4: fix locking for O_APPEND writes
    ...

    Linus Torvalds
     
  • Now that 3.15 is released, this merges the 'next' branch into 'master',
    bringing us to the normal situation where my 'master' branch is the
    merge window.

    * accumulated work in next: (6809 commits)
    ufs: sb mutex merge + mutex_destroy
    powerpc: update comments for generic idle conversion
    cris: update comments for generic idle conversion
    idle: remove cpu_idle() forward declarations
    nbd: zero from and len fields in NBD_CMD_DISCONNECT.
    mm: convert some level-less printks to pr_*
    MAINTAINERS: adi-buildroot-devel is moderated
    MAINTAINERS: add linux-api for review of API/ABI changes
    mm/kmemleak-test.c: use pr_fmt for logging
    fs/dlm/debug_fs.c: replace seq_printf by seq_puts
    fs/dlm/lockspace.c: convert simple_str to kstr
    fs/dlm/config.c: convert simple_str to kstr
    mm: mark remap_file_pages() syscall as deprecated
    mm: memcontrol: remove unnecessary memcg argument from soft limit functions
    mm: memcontrol: clean up memcg zoneinfo lookup
    mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
    mm/mempool.c: update the kmemleak stack trace for mempool allocations
    lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
    mm: introduce kmemleak_update_trace()
    mm/kmemleak.c: use %u to print ->checksum
    ...

    Linus Torvalds
     

07 Jun, 2014

14 commits

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     
  • Signed-off-by: Fabian Frederick
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • The remap_file_pages() system call is used to create a nonlinear
    mapping, that is, a mapping in which the pages of the file are mapped
    into a nonsequential order in memory. The advantage of using
    remap_file_pages() over using repeated calls to mmap(2) is that the
    former approach does not require the kernel to create additional VMA
    (Virtual Memory Area) data structures.

    Supporting of nonlinear mapping requires significant amount of
    non-trivial code in kernel virtual memory subsystem including hot paths.
    Also to get nonlinear mapping work kernel need a way to distinguish
    normal page table entries from entries with file offset (pte_file).
    Kernel reserves flag in PTE for this purpose. PTE flags are scarce
    resource especially on some CPU architectures. It would be nice to free
    up the flag for other usage.

    Fortunately, there are not many users of remap_file_pages() in the wild.
    It's only known that one enterprise RDBMS implementation uses the
    syscall on 32-bit systems to map files bigger than can linearly fit into
    32-bit virtual address space. This use-case is not critical anymore
    since 64-bit systems are widely available.

    The plan is to deprecate the syscall and replace it with an emulation.
    The emulation will create new VMAs instead of nonlinear mappings. It's
    going to work slower for rare users of remap_file_pages() but ABI is
    preserved.

    One side effect of emulation (apart from performance) is that user can
    hit vm.max_map_count limit more easily due to additional VMAs. See
    comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Armin Rigo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Jianyu Zhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg zoneinfo lookup sites have either the page, the zone, or the node
    id and zone index, but sites that only have the zone have to look up the
    node id and zone index themselves, whereas sites that already have those
    two integers use a function for a simple pointer chase.

    Provide mem_cgroup_zone_zoneinfo() that takes a zone pointer and let
    sites that already have node id and zone index - all for each node, for
    each zone iterators - use &memcg->nodeinfo[nid]->zoneinfo[zid].

    Rename page_cgroup_zoneinfo() to mem_cgroup_page_zoneinfo() to match.

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • Kmemleak could ignore memory blocks allocated via memblock_alloc()
    leading to false positives during scanning. This patch adds the
    corresponding callbacks and removes kmemleak_free_* calls in
    mm/nobootmem.c to avoid duplication.

    The kmemleak_alloc() in mm/nobootmem.c is kept since
    __alloc_memory_core_early() does not use memblock_alloc() directly.

    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • When mempool_alloc() returns an existing pool object, kmemleak_alloc()
    is no longer called and the stack trace corresponds to the original
    object allocation. This patch updates the kmemleak allocation stack
    trace for such objects to make it more useful for debugging.

    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • The memory allocation stack trace is not always useful for debugging a
    memory leak (e.g. radix_tree_preload). This function, when called,
    updates the stack trace for an already allocated object.

    Signed-off-by: Catalin Marinas
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Signed-off-by: Jianpeng Ma
    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianpeng Ma
     
  • Memory reclaim always uses swappiness of the reclaim target memcg
    (origin of the memory pressure) or vm_swappiness for global memory
    reclaim. This behavior was consistent (except for difference between
    global and hard limit reclaim) because swappiness was enforced to be
    consistent within each memcg hierarchy.

    After "mm: memcontrol: remove hierarchy restrictions for swappiness and
    oom_control" each memcg can have its own swappiness independent of
    hierarchical parents, though, so the consistency guarantee is gone.
    This can lead to an unexpected behavior. Say that a group is explicitly
    configured to not swapout by memory.swappiness=0 but its memory gets
    swapped out anyway when the memory pressure comes from its parent with a
    It is also unexpected that the knob is meaningless without setting the
    hard limit which would trigger the reclaim and enforce the swappiness.
    There are setups where the hard limit is configured higher in the
    hierarchy by an administrator and children groups are under control of
    somebody else who is interested in the swapout behavior but not
    necessarily about the memory limit.

    From a semantic point of view swappiness is an attribute defining anon
    vs.
    file proportional scanning of LRU which is memcg specific (unlike
    charges which are propagated up the hierarchy) so it should be applied
    to the particular memcg's LRU regardless where the memory pressure comes
    from.

    This patch removes vmscan_swappiness() and stores the swappiness into
    the scan_control structure. mem_cgroup_swappiness is then used to
    provide the correct value before shrink_lruvec is called. The global
    vm_swappiness is used for the root memcg.

    [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This typedef is unnecessary and should just be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently, if allocation constraint to node is NUMA_NO_NODE, we search a
    partial slab on numa_node_id() node. This doesn't work properly on a
    system having memoryless nodes, since it can have no memory on that node
    so there must be no partial slab on that node.

    On that node, page allocation always falls back to numa_mem_id() first.
    So searching a partial slab on numa_node_id() in that case is the proper
    solution for the memoryless node case.

    Signed-off-by: Joonsoo Kim
    Acked-by: Nishanth Aravamudan
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Wanpeng Li
    Cc: Han Pingtian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When kswapd exits, it can end up taking locks that were previously held
    by allocating tasks while they waited for reclaim. Lockdep currently
    warns about this:

    On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
    > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
    > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
    > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
    > {RECLAIM_FS-ON-W} state was registered at:
    > mark_held_locks+0xb9/0x140
    > lockdep_trace_alloc+0x7a/0xe0
    > kmem_cache_alloc_trace+0x37/0x240
    > flex_array_alloc+0x99/0x1a0
    > cgroup_attach_task+0x63/0x430
    > attach_task_by_pid+0x210/0x280
    > cgroup_procs_write+0x16/0x20
    > cgroup_file_write+0x120/0x2c0
    > vfs_write+0xc0/0x1f0
    > SyS_write+0x4c/0xa0
    > tracesys+0xdd/0xe2
    > irq event stamp: 49
    > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
    > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
    > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
    > softirqs last disabled at (0): (null)
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(&sig->group_rwsem);
    >
    > lock(&sig->group_rwsem);
    >
    > *** DEADLOCK ***
    >
    > no locks held by kswapd2/1151.
    >
    > stack backtrace:
    > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
    > Call Trace:
    > dump_stack+0x19/0x1b
    > print_usage_bug+0x1f7/0x208
    > mark_lock+0x21d/0x2a0
    > __lock_acquire+0x52a/0xb60
    > lock_acquire+0xa2/0x140
    > down_read+0x51/0xa0
    > exit_signals+0x24/0x130
    > do_exit+0xb5/0xa50
    > kthread+0xdb/0x100
    > ret_from_fork+0x7c/0xb0

    This is because the kswapd thread is still marked as a reclaimer at the
    time of exit. But because it is exiting, nobody is actually waiting on
    it to make reclaim progress anymore, and it's nothing but a regular
    thread at this point. Be tidy and strip it of all its powers
    (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
    before returning from the thread function.

    Signed-off-by: Johannes Weiner
    Reported-by: Gu Zheng
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The age table walker doesn't check non-present hugetlb entry in common
    path, so hugetlb_entry() callbacks must check it. The reason for this
    behavior is that some callers want to handle it in its own way.

    [ I think that reason is bogus, btw - it should just do what the regular
    code does, which is to call the "pte_hole()" function for such hugetlb
    entries - Linus]

    However, some callers don't check it now, which causes unpredictable
    result, for example when we have a race between migrating hugepage and
    reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
    checks on buggy callbacks.

    This bug exists for years and got visible by introducing hugepage
    migration.

    ChangeLog v2:
    - fix if condition (check !pte_present() instead of pte_present())

    Reported-by: Sasha Levin
    Signed-off-by: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    [ Backported to 3.15. Signed-off-by: Josh Boyer ]
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Jun, 2014

1 commit

  • While working address sanitizer for kernel I've discovered
    use-after-free bug in __put_anon_vma.

    For the last anon_vma, anon_vma->root freed before child anon_vma.
    Later in anon_vma_free(anon_vma) we are referencing to already freed
    anon_vma->root to check rwsem.

    This fixes it by freeing the child anon_vma before freeing
    anon_vma->root.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Peter Zijlstra
    Cc: # v3.0+
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

05 Jun, 2014

3 commits

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     
  • zswap_dstmem is a percpu block of memory, which should be allocated using
    kmalloc_node(), to get better NUMA locality.

    Without it, all the blocks are allocated from a single node.

    Signed-off-by: Eric Dumazet
    Acked-by: Seth Jennings
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Now, we can build zsmalloc as module because unmap_kernel_range was
    exported.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim