06 Nov, 2013

2 commits

  • Currently seqlocks and seqcounts don't support lockdep.

    After running across a seqcount related deadlock in the timekeeping
    code, I used a less-refined and more focused variant of this patch
    to narrow down the cause of the issue.

    This is a first-pass attempt to properly enable lockdep functionality
    on seqlocks and seqcounts.

    Since seqcounts are used in the vdso gettimeofday code, I've provided
    non-lockdep accessors for those needs.

    I've also handled one case where there were nested seqlock writers
    and there may be more edge cases.

    Comments and feedback would be appreciated!

    Signed-off-by: John Stultz
    Signed-off-by: Peter Zijlstra
    Cc: Eric Dumazet
    Cc: Li Zefan
    Cc: Mathieu Desnoyers
    Cc: Steven Rostedt
    Cc: "David S. Miller"
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/1381186321-4906-3-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz
     
  • Conflicts:
    kernel/Makefile

    There are conflicts in kernel/Makefile due to file moving in the
    scheduler tree - resolve them.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

02 Nov, 2013

1 commit

  • When a memcg is deleted mem_cgroup_reparent_charges() moves charged
    memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
    cgroup writeback pages accounting" there's bad pointer read. The goal
    was to check for counter underflow. The counter is a per cpu counter
    and there are two problems with the code:

    (1) per cpu access function isn't used, instead a naked pointer is used
    which easily causes oops.
    (2) the check doesn't sum all cpus

    Test:
    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ echo 3 > /proc/sys/vm/drop_caches
    $ (echo $BASHPID >> x/tasks && exec cat) &
    [1] 7154
    $ grep ^mapped x/memory.stat
    mapped_file 53248
    $ echo 7154 > tasks
    $ rmdir x

    The fix is to remove the check. It's currently dangerous and isn't
    worth fixing it to use something expensive, such as
    percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't
    enough to fix this because there's no guarantees of the current cpus
    count. The only guarantees is that the sum of all per-cpu counter is >=
    nr_pages.

    Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
    Reported-and-tested-by: Flavio Leitner
    Signed-off-by: Greg Thelen
    Reviewed-by: Sha Zhengju
    Acked-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

01 Nov, 2013

6 commits

  • Resolve cherry-picking conflicts:

    Conflicts:
    mm/huge_memory.c
    mm/memory.c
    mm/mprotect.c

    See this upstream merge commit for more details:

    52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Merge four more fixes from Andrew Morton.

    * emailed patches from Andrew Morton :
    lib/scatterlist.c: don't flush_kernel_dcache_page on slab page
    mm: memcg: fix test for child groups
    mm: memcg: lockdep annotation for memcg OOM lock
    mm: memcg: use proper memcg in limit bypass

    Linus Torvalds
     
  • When memcg code needs to know whether any given memcg has children, it
    uses the cgroup child iteration primitives and returns true/false
    depending on whether the iteration loop is executed at least once or
    not.

    Because a cgroup's list of children is RCU protected, these primitives
    require the RCU read-lock to be held, which is not the case for all
    memcg callers. This results in the following splat when e.g. enabling
    hierarchy mode:

    WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
    CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
    Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
    Call Trace:
    dump_stack+0x54/0x74
    warn_slowpath_common+0x78/0xa0
    warn_slowpath_null+0x1a/0x20
    css_next_child+0xa3/0x160
    mem_cgroup_hierarchy_write+0x5b/0xa0
    cgroup_file_write+0x108/0x2a0
    vfs_write+0xbd/0x1e0
    SyS_write+0x4c/0xa0
    system_call_fastpath+0x16/0x1b

    In the memcg case, we only care about children when we are attempting to
    modify inheritable attributes interactively. Racing with deletion could
    mean a spurious -EBUSY, no problem. Racing with addition is handled
    just fine as well through the memcg_create_mutex: if the child group is
    not on the list after the mutex is acquired, it won't be initialized
    from the parent's attributes until after the unlock.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg OOM lock is a mutex-type lock that is open-coded due to
    memcg's special needs. Add annotations for lockdep coverage.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they
    fail to reclaim enough memory for the charge. But because the main test
    case was on a 3.2-based system, the patch missed the fact that on newer
    kernels the charge function needs to return root_mem_cgroup when
    bypassing the limit, and not NULL. This will corrupt whatever memory is
    at NULL + percpu pointer offset. Fix this quickly before problems are
    reported.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull NUMA balancing memory corruption fixes from Ingo Molnar:
    "So these fixes are definitely not something I'd like to sit on, but as
    I said to Mel at the KS the timing is quite tight, with Linus planning
    v3.12-final within a week.

    Fedora-19 is affected:

    comet:~> grep NUMA_BALANCING /boot/config-3.11.3-201.fc19.x86_64

    CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
    CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
    CONFIG_NUMA_BALANCING=y

    AFAICS Ubuntu will be affected as well, once it updates the kernel:

    hubble:~> grep NUMA_BALANCING /boot/config-3.8.0-32-generic

    CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
    CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
    CONFIG_NUMA_BALANCING=y

    These 6 commits are a minimalized set of cherry-picks needed to fix
    the memory corruption bugs. All commits are fixes, except "mm: numa:
    Sanitize task_numa_fault() callsites" which is a cleanup that made two
    followup fixes simpler.

    I've done targeted testing with just this SHA1 to try to make sure
    there are no cherry-picking artifacts. The original non-cherry-picked
    set of fixes were exposed to linux-next for a couple of weeks"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Account for a THP NUMA hinting update as one PTE update
    mm: Close races between THP migration and PMD numa clearing
    mm: numa: Sanitize task_numa_fault() callsites
    mm: Prevent parallel splits during THP migration
    mm: Wait for THP migrations to complete during NUMA hinting faults
    mm: numa: Do not account for a hinting fault if we raced

    Linus Torvalds
     

31 Oct, 2013

4 commits

  • Merge three fixes from Andrew Morton.

    * emailed patches from Andrew Morton :
    memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting
    percpu: fix this_cpu_sub() subtrahend casting for unsigneds
    mm/pagewalk.c: fix walk_page_range() access of wrong PTEs

    Linus Torvalds
     
  • As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages
    accounting") memcg counter errors are possible when moving charged
    memory to a different memcg. Charge movement occurs when processing
    writes to memory.force_empty, moving tasks to a memcg with
    memcg.move_charge_at_immigrate=1, or memcg deletion.

    An example showing error after memory.force_empty:

    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ rm /data/tmp/file
    $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
    [1] 13600
    $ grep ^mapped x/memory.stat
    mapped_file 1048576
    $ echo 13600 > tasks
    $ echo 1 > x/memory.force_empty
    $ grep ^mapped x/memory.stat
    mapped_file 4503599627370496

    mapped_file should end with 0.
    4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
    1048576 == 0x10,0000 == 0x100 pages

    This issue only affects the source memcg on 64 bit machines; the
    destination memcg counters are correct. So the rmdir case is not too
    important because such counters are soon disappearing with the entire
    memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1
    cases are larger problems as the bogus counters are visible for the
    (possibly long) remaining life of the source memcg.

    The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
    is subtly wrong because it subtracts the unsigned int nr_pages (either
    -1 or -512 for THP) from a signed long percpu counter. When
    nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx]
    is signed 64 bit. So memcg's attempt to simply decrement a count (e.g.
    from 1 to 0) boils down to:

    long count = 1
    unsigned int nr_pages = 1
    count += -nr_pages /* -nr_pages == 0xffff,ffff */
    count is now 0x1,0000,0000 instead of 0

    The fix is to subtract the unsigned page count rather than adding its
    negation. This only works once "percpu: fix this_cpu_sub() subtrahend
    casting for unsigneds" is applied to fix this_cpu_sub().

    Signed-off-by: Greg Thelen
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • When walk_page_range walk a memory map's page tables, it'll skip
    VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it
    maybe larger than 'end'. In next loop, 'addr' will be larger than
    'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr'
    will growing forever in pagemap_pte_range, pte_to_pagemap_entry will
    access the wrong pte.

    BUG: Bad page map in process procrank pte:8437526f pmd:785de067
    addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d
    CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1
    Call Trace:
    dump_stack+0x16/0x18
    print_bad_pte+0x114/0x1b0
    vm_normal_page+0x56/0x60
    pagemap_pte_range+0x17a/0x1d0
    walk_page_range+0x19e/0x2c0
    pagemap_read+0x16e/0x200
    vfs_read+0x84/0x150
    SyS_read+0x4a/0x80
    syscall_call+0x7/0xb

    Signed-off-by: Liu ShuoX
    Signed-off-by: Chen LinX
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Cc: [3.10.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen LinX
     
  • I've seen a fair number of issues with kswapd and other processes
    appearing to get stuck in v3.12-rc. Using sysrq-p many times seems to
    indicate that it gets stuck somewhere in list_lru_walk_node(), called
    from prune_icache_sb() and super_cache_scan().

    I never seem to be able to trigger a calltrace for functions above that
    point.

    So I decided to add the following to super_cache_scan():

    @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
    inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
    dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
    total_objects = dentries + inodes + fs_objects + 1;
    +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects);

    /* proportion the scan between the caches */
    dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
    inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
    +printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes);
    +BUG_ON(dentries == 0);
    +BUG_ON(inodes == 0);

    /*
    * prune the dcache first as the icache is pinned by it, then
    @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
    freed += sb->s_op->free_cached_objects(sb, fs_objects,
    sc->nid);
    }
    -
    +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed);
    drop_super(sb);
    return freed;
    }

    and shortly thereafter, having applied some pressure, I got this:

    update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635
    update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0
    ------------[ cut here ]------------
    Kernel BUG at c0101994 [verbose debug info unavailable]
    Internal error: Oops - BUG: 0 [#3] SMP ARM
    Modules linked in: fuse rfcomm bnep bluetooth hid_cypress
    CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G D 3.12.0-rc7+ #154
    task: daea1200 ti: c3bf8000 task.ti: c3bf8000
    PC is at super_cache_scan+0x1c0/0x278
    LR is at trace_hardirqs_on+0x14/0x18
    Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240)
    ...
    Backtrace:
    (super_cache_scan) from [] (shrink_slab+0x254/0x4c8)
    (shrink_slab) from [] (try_to_free_pages+0x3a0/0x5e0)
    (try_to_free_pages) from [] (__alloc_pages_nodemask+0x5)
    (__alloc_pages_nodemask) from [] (__pte_alloc+0x2c/0x13)
    (__pte_alloc) from [] (handle_mm_fault+0x84c/0x914)
    (handle_mm_fault) from [] (do_page_fault+0x1f0/0x3bc)
    (do_page_fault) from [] (do_translation_fault+0xac/0xb8)
    (do_translation_fault) from [] (do_DataAbort+0x38/0xa0)
    (do_DataAbort) from [] (__dabt_usr+0x38/0x40)

    Notice that we had a very low number of inodes, which were reduced to
    zero my mult_frac().

    Now, prune_icache_sb() calls list_lru_walk_node() passing that number of
    inodes (0) into that as the number of objects to scan:

    long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
    int nid)
    {
    LIST_HEAD(freeable);
    long freed;

    freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
    &freeable, &nr_to_scan);

    which does:

    unsigned long
    list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
    void *cb_arg, unsigned long *nr_to_walk)
    {

    struct list_lru_node *nlru = &lru->node[nid];
    struct list_head *item, *n;
    unsigned long isolated = 0;

    spin_lock(&nlru->lock);
    restart:
    list_for_each_safe(item, n, &nlru->list) {
    enum lru_status ret;

    /*
    * decrement nr_to_walk first so that we don't livelock if we
    * get stuck on large numbesr of LRU_RETRY items
    */
    if (--(*nr_to_walk) == 0)
    break;

    So, if *nr_to_walk was zero when this function was entered, that means
    we're wanting to operate on (~0UL)+1 objects - which might as well be
    infinite.

    Clearly this is not correct behaviour. If we think about the behaviour
    of this function when *nr_to_walk is 1, then clearly it's wrong - we
    decrement first and then test for zero - which results in us doing
    nothing at all. A post-decrement would give the desired behaviour -
    we'd try to walk one object and one object only if *nr_to_walk were one.

    It also gives the correct behaviour for zero - we exit at this point.

    Fixes: 5cedf721a7cd ("list_lru: fix broken LRU_RETRY behaviour")
    Signed-off-by: Russell King
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    [ Modified to make sure we never underflow the count: this function gets
    called in a loop, so the 0 -> ~0ul transition is dangerous - Linus ]
    Signed-off-by: Linus Torvalds

    Russell King
     

29 Oct, 2013

6 commits

  • A THP PMD update is accounted for as 512 pages updated in vmstat. This is
    large difference when estimating the cost of automatic NUMA balancing and
    can be misleading when comparing results that had collapsed versus split
    THP. This patch addresses the accounting issue.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • THP migration uses the page lock to guard against parallel allocations
    but there are cases like this still open

    Task A Task B
    --------------------- ---------------------
    do_huge_pmd_numa_page do_huge_pmd_numa_page
    lock_page
    mpol_misplaced == -1
    unlock_page
    goto clear_pmdnuma
    lock_page
    mpol_misplaced == 2
    migrate_misplaced_transhuge
    pmd = pmd_mknonnuma
    set_pmd_at

    During hours of testing, one crashed with weird errors and while I have
    no direct evidence, I suspect something like the race above happened.
    This patch extends the page lock to being held until the pmd_numa is
    cleared to prevent migration starting in parallel while the pmd_numa is
    being cleared. It also flushes the old pmd entry and orders pagetable
    insertion before rmap insertion.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • There are three callers of task_numa_fault():

    - do_huge_pmd_numa_page():
    Accounts against the current node, not the node where the
    page resides, unless we migrated, in which case it accounts
    against the node we migrated to.

    - do_numa_page():
    Accounts against the current node, not the node where the
    page resides, unless we migrated, in which case it accounts
    against the node we migrated to.

    - do_pmd_numa_page():
    Accounts not at all when the page isn't migrated, otherwise
    accounts against the node we migrated towards.

    This seems wrong to me; all three sites should have the same
    sementaics, furthermore we should accounts against where the page
    really is, we already know where the task is.

    So modify all three sites to always account; we did after all receive
    the fault; and always account to where the page is after migration,
    regardless of success.

    They all still differ on when they clear the PTE/PMD; ideally that
    would get sorted too.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • THP migrations are serialised by the page lock but on its own that does
    not prevent THP splits. If the page is split during THP migration then
    the pmd_same checks will prevent page table corruption but the unlock page
    and other fix-ups potentially will cause corruption. This patch takes the
    anon_vma lock to prevent parallel splits during migration.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • The locking for migrating THP is unusual. While normal page migration
    prevents parallel accesses using a migration PTE, THP migration relies on
    a combination of the page_table_lock, the page lock and the existance of
    the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

    If a THP page is currently being migrated and another thread traps a
    fault on the same page it checks if the page is misplaced. If it is not,
    then pmd_numa is cleared. The problem is that it checks if the page is
    misplaced without holding the page lock meaning that the racing thread
    can be migrating the THP when the second thread clears the NUMA bit
    and faults a stale page.

    This patch checks if the page is potentially being migrated and stalls
    using the lock_page if it is potentially being migrated before checking
    if the page is misplaced or not.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • If another task handled a hinting fault in parallel then do not double
    account for it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

17 Oct, 2013

12 commits

  • Revert commit 1ecfd533f4c5 ("mm/mremap.c: call pud_free() after fail
    calling pmd_alloc()").

    The original code was correct: pud_alloc(), pmd_alloc(), pte_alloc_map()
    ensure that the pud, pmd, pt is already allocated, and seldom do they
    need to allocate; on failure, upper levels are freed if appropriate by
    the subsequent do_munmap(). Whereas commit 1ecfd533f4c5 did an
    unconditional pud_free() of a most-likely still-in-use pud: saved only
    by the near-impossiblity of pmd_alloc() failing.

    Signed-off-by: Hugh Dickins
    Cc: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
    __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).

    It's invalid: we don't always have down_write of mmap_sem there: a racing
    do_huge_pmd_wp_page() might have copied-on-write to another huge page
    before our split_huge_page() got the anon_vma lock.

    Forget the BUG_ON, just go back and try again if this happens.

    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix race between swapoff and swapon. Swapoff used old_block_size from
    swap_info outside of swapon_mutex so it could be overwritten by
    concurrent swapon.

    The race has visible effect only if more than one swap block device
    exists with different block sizes (e.g. /dev/sda1 with block size 4096
    and /dev/sdb1 with 512). In such case it leads to setting the blocksize
    of swapped off device with wrong blocksize.

    The bug can be triggered with multiple concurrent swapoff and swapon:
    0. Swap for some device is on.
    1. swapoff:
    First the swapoff is called on this device and "struct swap_info_struct
    *p" is assigned. This is done under swap_lock however this lock is
    released for the call try_to_unuse().

    2. swapon:
    After the assignment above (and before acquiring swapon_mutex &
    swap_lock by swapoff) the swapon is called on the same device.
    The p->old_block_size is assigned to the value of block_size the device.
    This block size should be the same as previous but sometimes it is not.
    The swapon ends successfully.

    3. swapoff:
    Swapoff resumes, grabs the locks and mutex and continues to disable this
    swap device. Now it sets the block size to value taken from swap_info
    which was overwritten by swapon in 2.

    Signed-off-by: Krzysztof Kozlowski
    Reported-by: Weijie Yang
    Cc: Bob Liu
    Cc: Konrad Rzeszutek Wilk
    Cc: Shaohua Li
    Cc: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • Toralf runs trinity on UML/i386. After some time it hangs and the last
    message line is

    BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]

    It's found that pages_dirtied becomes very large. More than 1000000000
    pages in this case:

    period = HZ * pages_dirtied / task_ratelimit;
    BUG_ON(pages_dirtied > 2000000000);
    BUG_ON(pages_dirtied > 1000000000); < 0) {
    + extern int printf(char *, ...);
    + printf("ick : pause : %li\n", pause);
    + printf("ick: pages_dirtied : %lu\n", pages_dirtied);
    + printf("ick: task_ratelimit: %lu\n", task_ratelimit);
    + BUG_ON(1);
    + }
    trace_balance_dirty_pages(bdi,

    Since pause is bounded by [min_pause, max_pause] where min_pause is also
    bounded by max_pause. It's suspected and demonstrated that the
    max_pause calculation goes wrong:

    ick: pause : -717
    ick: min_pause : -177
    ick: max_pause : -717
    ick: pages_dirtied : 14
    ick: task_ratelimit: 0

    The problem lies in the two "long = unsigned long" assignments in
    bdi_max_pause() which might go negative if the highest bit is 1, and the
    min_t(long, ...) check failed to protect it falling under 0. Fix all of
    them by using "unsigned long" throughout the function.

    Signed-off-by: Fengguang Wu
    Reported-by: Toralf Förster
    Tested-by: Toralf Förster
    Reviewed-by: Jan Kara
    Cc: Richard Weinberger
    Cc: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Buffer allocation has a very crude indefinite loop around waking the
    flusher threads and performing global NOFS direct reclaim because it can
    not handle allocation failures.

    The most immediate problem with this is that the allocation may fail due
    to a memory cgroup limit, where flushers + direct reclaim might not make
    any progress towards resolving the situation at all. Because unlike the
    global case, a memory cgroup may not have any cache at all, only
    anonymous pages but no swap. This situation will lead to a reclaim
    livelock with insane IO from waking the flushers and thrashing unrelated
    filesystem cache in a tight loop.

    Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
    any looping happens in the page allocator, which knows how to
    orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
    allows memory cgroups to detect allocations that can't handle failure
    and will allow them to ultimately bypass the limit if reclaim can not
    make progress.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") assumed that only a few places that can trigger a
    memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
    readahead. But there are many more and it's impractical to annotate
    them all.

    First of all, we don't want to invoke the OOM killer when the failed
    allocation is gracefully handled, so defer the actual kill to the end of
    the fault handling as well. This simplifies the code quite a bit for
    added bonus.

    Second, since a failed allocation might not be the abrupt end of the
    fault, the memcg OOM handler needs to be re-entrant until the fault
    finishes for subsequent allocation attempts. If an allocation is
    attempted after the task already OOMed, allow it to bypass the limit so
    that it can quickly finish the fault and invoke the OOM killer.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 11feeb498086 ("kvm: optimize away THP checks in
    kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic
    compound pages.

    That commit depends on the assumption that PG_reserved is identical for
    all head and tail pages of a compound page. So that if get_user_pages
    returns a tail page, we don't need to check the head page in order to
    know if we deal with a reserved page that requires different
    refcounting.

    The assumption that PG_reserved is the same for head and tail pages is
    certainly correct for THP and regular hugepages, but gigantic hugepages
    allocated through bootmem don't clear the PG_reserved on the tail pages
    (the clearing of PG_reserved is done later only if the gigantic hugepage
    is freed).

    This patch corrects the gigantic compound page initialization so that we
    can retain the optimization in 11feeb498086. The cacheline was already
    modified in order to set PG_tail so this won't affect the boot time of
    large memory systems.

    [akpm@linux-foundation.org: tweak comment layout and grammar]
    Signed-off-by: Andrea Arcangeli
    Reported-by: andy123
    Acked-by: Rik van Riel
    Cc: Gleb Natapov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • zswap_tree is not freed when swapoff, and it got re-kmalloced in swapon,
    so a memory leak occurs.

    Free the memory of zswap_tree in zswap_frontswap_invalidate_area().

    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Cc: Minchan Kim
    Reviewed-by: Minchan Kim
    Cc:
    From: Weijie Yang
    Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

    Consider the following scenario:
    thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
    thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
    finished, entry x and its zbud is not freed as its refcount != 0
    now, the swap_map[x] = 0
    thread 0: now call zswap_get_swap_cache_page
    swapcache_prepare return -ENOENT because entry x is not used any more
    zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
    zswap_writeback_entry do nothing except put refcount
    Now, the memory of zswap_entry x and its zpage leak.

    Modify:
    - check the refcount in fail path, free memory if it is not referenced.

    - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
    can be not only caused by nomem but also by invalidate.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Cc: Minchan Kim
    Cc:
    Acked-by: Seth Jennings

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • If page migration is turned on in config and the page is migrating, we
    may lose the soft dirty bit. If fork and mprotect are called on
    migrating pages (once migration is complete) pages do not obtain the
    soft dirty bit in the correspond pte entries. Fix it adding an
    appropriate test on swap entries.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • We should clear the page's private flag when returing the page to the
    hugepage pool. Otherwise, marked hugepage can be allocated to the user
    who tries to allocate the non-reserved hugepage. If this user fail to
    map this hugepage, he would try to return the page to the hugepage pool.
    Since this page has a private flag, resv_huge_pages would mistakenly
    increase. This patch fixes this situation.

    Signed-off-by: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This leak was added by commit 1d3d4437eae1 ("vmscan: per-node deferred
    work").

    unreferenced object 0xffff88006ada3bd0 (size 8):
    comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
    hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00 ........
    backtrace:
    [] kmemleak_alloc+0x5e/0xc0
    [] __kmalloc+0x247/0x310
    [] register_shrinker+0x3c/0xa0
    [] sget+0x5ab/0x670
    [] proc_mount+0x54/0x170
    [] mount_fs+0x43/0x1b0
    [] vfs_kern_mount+0x72/0x110
    [] kern_mount_data+0x19/0x30
    [] pid_ns_prepare_proc+0x20/0x40
    [] alloc_pid+0x466/0x4a0
    [] copy_process+0xc6a/0x1860
    [] do_fork+0x8b/0x370
    [] SyS_clone+0x16/0x20
    [] stub_clone+0x69/0x90
    [] 0xffffffffffffffff

    Signed-off-by: Andrew Vagin
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Cc: Chuck Lever
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     
  • for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
    cpu_online_mask doesn't change during the iteration.

    cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
    that is used kernel-wide to synchronize cpu hotplug activity. Memcg has
    a cpu hotplug notifier, called while there may not be any cpu hotplug
    refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
    to maintain a cumulative event count as cpus disappear. Without
    get_online_cpus() in mem_cgroup_read_events(), it's possible to account
    for the event count on a dying cpu twice, and this value may be
    significantly large.

    In fact, all memcg->pcp_counter_lock use should be nested by
    {get,put}_online_cpus().

    This fixes that issue and ensures the reported statistics are not vastly
    over-reported during cpu hotplug.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Oct, 2013

1 commit


09 Oct, 2013

8 commits

  • Shared faults can lead to lots of unnecessary page migrations,
    slowing down the system, and causing private faults to hit the
    per-pgdat migration ratelimit.

    This patch adds sysctl numa_balancing_migrate_deferred, which specifies
    how many shared page migrations to skip unconditionally, after each page
    migration that is skipped because it is a shared fault.

    This reduces the number of page migrations back and forth in
    shared fault situations. It also gives a strong preference to
    the tasks that are already running where most of the memory is,
    and to moving the other tasks to near the memory.

    Testing this with a much higher scan rate than the default
    still seems to result in fewer page migrations than before.

    Memory seems to be somewhat better consolidated than previously,
    with multi-instance specjbb runs on a 4 node system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With the scan rate code working (at least for multi-instance specjbb),
    the large hammer that is "sched: Do not migrate memory immediately after
    switching node" can be replaced with something smarter. Revert temporarily
    migration disabling and all traces of numa_migrate_seq.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Adjust numa_scan_period in task_numa_placement, depending on how much
    useful work the numa code can do. The more local faults there are in a
    given scan window the longer the period (and hence the slower the scan rate)
    during the next window. If there are excessive shared faults then the scan
    period will decrease with the amount of scaling depending on whether the
    ratio of shared/private faults. If the preferred node changes then the
    scan rate is reset to recheck if the task is properly placed.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Due to the way the pid is truncated, and tasks are moved between
    CPUs by the scheduler, it is possible for the current task_numa_fault
    to group together tasks that do not actually share memory together.

    This patch adds a few easy sanity checks to task_numa_fault, joining
    tasks together if they share the same tsk->mm, or if the fault was on
    a page with an elevated mapcount, in a shared VMA.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With the THP migration races closed it is still possible to occasionally
    see corruption. The problem is related to handling PMD pages in batch.
    When a page fault is handled it can be assumed that the page being
    faulted will also be flushed from the TLB. The same flushing does not
    happen when handling PMD pages in batch. Fixing is straight forward but
    there are a number of reasons not to

    1. Multiple TLB flushes may have to be sent depending on what pages get
    migrated
    2. The handling of PMDs in batch means that faults get accounted to
    the task that is handling the fault. While care is taken to only
    mark PMDs where the last CPU and PID match it can still have problems
    due to PID truncation when matching PIDs.
    3. Batching on the PMD level may reduce faults but setting pmd_numa
    requires taking a heavy lock that can contend with THP migration
    and handling the fault requires the release/acquisition of the PTL
    for every page migrated. It's still pretty heavy.

    PMD batch handling is not something that people ever have been happy
    with. This patch removes it and later patches will deal with the
    additional fault overhead using more installigent migrate rate adaption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • And here's a little something to make sure not the whole world ends up
    in a single group.

    As while we don't migrate shared executable pages, we do scan/fault on
    them. And since everybody links to libc, everybody ends up in the same
    group.

    Suggested-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • After page migration, the new page has the nidpid unset. This makes
    every fault on a recently migrated page look like a first numa fault,
    leading to another page migration.

    Copying over the nidpid at page migration time should prevent erroneous
    migrations of recently migrated pages.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-46-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • While parallel applications tend to align their data on the cache
    boundary, they tend not to align on the page or THP boundary.
    Consequently tasks that partition their data can still "false-share"
    pages presenting a problem for optimal NUMA placement.

    This patch uses NUMA hinting faults to chain tasks together into
    numa_groups. As well as storing the NID a task was running on when
    accessing a page a truncated representation of the faulting PID is
    stored. If subsequent faults are from different PIDs it is reasonable
    to assume that those two tasks share a page and are candidates for
    being grouped together. Note that this patch makes no scheduling
    decisions based on the grouping information.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Peter Zijlstra