12 Jan, 2013

8 commits

  • Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
    waiting for POLLIN on a local TCP socket. It was easier to trigger if
    there was disk IO and dirty pages at the same time and he bisected it to
    commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
    immediately when it is made available").

    The intention of that patch was to improve high-order allocations under
    memory pressure after changes made to reclaim in 3.6 drastically hurt
    THP allocations but the approach was flawed. For Eric, the problem was
    that page->pfmemalloc was not being cleared for captured pages leading
    to a poor interaction with swap-over-NFS support causing the packets to
    be dropped. However, I identified a few more problems with the patch
    including the fact that it can increase contention on zone->lock in some
    cases which could result in async direct compaction being aborted early.

    In retrospect the capture patch took the wrong approach. What it should
    have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
    was allocating for THP and avoided races that way. While the patch was
    showing to improve allocation success rates at the time, the benefit is
    marginal given the relative complexity and it should be revisited from
    scratch in the context of the other reclaim-related changes that have
    taken place since the patch was first written and tested. This patch
    partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
    suitable high-order page immediately when it is made available").

    Reported-and-tested-by: Eric Wong
    Tested-by: Eric Dumazet
    Cc:
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Zhouping Liu reported the following against 3.8-rc1 when running a mmap
    testcase from LTP.

    mapcount 0 page_mapcount 3
    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:1798!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
    CPU 1
    Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
    RIP: __split_huge_page+0x677/0x6d0
    RSP: 0000:ffff88017a03fc08 EFLAGS: 00010293
    RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
    RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
    RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
    R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
    R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
    FS: 00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
    Stack:
    ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
    ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
    00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
    Call Trace:
    split_huge_page+0x68/0xb0
    __split_huge_page_pmd+0x134/0x330
    split_huge_page_pmd_mm+0x51/0x60
    split_huge_page_address+0x3b/0x50
    __vma_adjust_trans_huge+0x9c/0xf0
    vma_adjust+0x684/0x750
    __split_vma.isra.28+0x1fa/0x220
    do_munmap+0xf9/0x420
    vm_munmap+0x4e/0x70
    sys_munmap+0x2b/0x40
    system_call_fastpath+0x16/0x1b

    Alexander Beregalov and Alex Xu reported similar bugs and Hillf Danton
    identified that commit 5a505085f043 ("mm/rmap: Convert the struct
    anon_vma::mutex to an rwsem") and commit 4fc3f1d66b1e ("mm/rmap,
    migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable")
    were likely the problem. Reverting these commits was reported to solve
    the problem for Alexander.

    Despite the reason for these commits, NUMA balancing is not the direct
    source of the problem. split_huge_page() expects the anon_vma lock to
    be exclusive to serialise the whole split operation. Ordinarily it is
    expected that the anon_vma lock would only be required when updating the
    avcs but THP also uses the anon_vma rwsem for collapse and split
    operations where the page lock or compound lock cannot be used (as the
    page is changing from base to THP or vice versa) and the page table
    locks are insufficient.

    This patch takes the anon_vma lock for write to serialise against parallel
    split_huge_page as THP expected before the conversion to rwsem.

    Reported-and-tested-by: Zhouping Liu
    Reported-by: Alexander Beregalov
    Reported-by: Alex Xu
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an
    rwsem") turned anon_vma mutex to rwsem.

    However, the properly annotated nested locking in mm_take_all_locks()
    has been converted from

    mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);

    to

    down_write(&anon_vma->root->rwsem);

    which is incomplete, and causes the false positive report from lockdep
    below.

    Annotate the fact that mmap_sem is used as an outter lock to serialize
    taking of all the anon_vma rwsems at once no matter the order, using the
    down_write_nest_lock() primitive.

    This patch fixes this lockdep report:

    =============================================
    [ INFO: possible recursive locking detected ]
    3.8.0-rc2-00036-g5f73896 #171 Not tainted
    ---------------------------------------------
    qemu-kvm/2315 is trying to acquire lock:
    (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    but task is already holding lock:
    (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&anon_vma->rwsem);
    lock(&anon_vma->rwsem);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    4 locks held by qemu-kvm/2315:
    #0: (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
    #1: (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
    #2: (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
    #3: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    stack backtrace:
    Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f73896 #171
    Call Trace:
    print_deadlock_bug+0xf2/0x100
    validate_chain+0x4f6/0x720
    __lock_acquire+0x359/0x580
    lock_acquire+0x121/0x190
    down_write+0x3f/0x70
    mm_take_all_locks+0x149/0x1b0
    do_mmu_notifier_register+0x68/0x170
    mmu_notifier_register+0xe/0x10
    kvm_create_vm+0x22b/0x330 [kvm]
    kvm_dev_ioctl+0xf8/0x1a0 [kvm]
    do_vfs_ioctl+0x9d/0x350
    sys_ioctl+0x91/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Jiri Kosina
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • Currently free_all_bootmem_core ignores that node_min_pfn may be not
    multiple of BITS_PER_LONG. Eg commit 6dccdcbe2c3e ("mm: bootmem: fix
    checking the bitmap when finally freeing bootmem") shifts vec by lower
    bits of start instead of lower bits of idx. Also

    if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL)

    assumes that vec bit 0 corresponds to start pfn, which is only true when
    node_min_pfn is a multiple of BITS_PER_LONG. Also loop in the else
    clause can double-free pages (e.g. with node_min_pfn == start == 1,
    map[0] == ~0 on 32-bit machine page 32 will be double-freed).

    This bug causes the following message during xtensa kernel boot:

    bootmem::free_all_bootmem_core nid=0 start=1 end=8000
    BUG: Bad page state in process swapper pfn:00001
    page:d04bd020 count:0 mapcount:-127 mapping: (null) index:0x2
    page flags: 0x0()
    Call Trace:
    bad_page+0x8c/0x9c
    free_pages_prepare+0x5e/0x88
    free_hot_cold_page+0xc/0xa0
    __free_pages+0x24/0x38
    __free_pages_bootmem+0x54/0x56
    free_all_bootmem_core$part$11+0xeb/0x138
    free_all_bootmem+0x46/0x58
    mem_init+0x25/0xa4
    start_kernel+0x11e/0x25c
    should_never_return+0x0/0x3be7

    The fix is the following:
    - always align vec so that its bit 0 corresponds to start
    - provide BITS_PER_LONG bits in vec, if those bits are available in the
    map
    - don't free pages past next start position in the else clause.

    Signed-off-by: Max Filippov
    Cc: Gavin Shan
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Yinghai Lu
    Cc: Joonsoo Kim
    Cc: Prasad Koya
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Max Filippov
     
  • The current calculation in pfn_to_bitidx assumes that (pfn -
    zone->zone_start_pfn) >> pageblock_order will return the same bit for
    all pfn in a pageblock. If zone_start_pfn is not aligned to
    pageblock_nr_pages, this may not always be correct.

    Consider the following with pageblock order = 10, zone start 2MB:

    pfn | pfn - zone start | (pfn - zone start) >> page block order
    ----------------------------------------------------------------
    0x26000 | 0x25e00 | 0x97
    0x26100 | 0x25f00 | 0x97
    0x26200 | 0x26000 | 0x98
    0x26300 | 0x26100 | 0x98

    This means that calling {get,set}_pageblock_migratetype on a single page
    will not set the migratetype for the full block. Fix this by rounding
    down zone_start_pfn when doing the bitidx calculation.

    For our use case, the effects of this bug were mostly tied to the fact
    that CMA allocations would either take a long time or fail to happen.
    Depending on the driver using CMA, this could result in anything from
    visual glitches to application failures.

    Signed-off-by: Laura Abbott
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • when run the folloing command under shell, it will return error

    sh/$ echo 1 > /proc/sys/vm/compact_memory
    sh/$ sh: write error: Bad address

    After strace, I found the following log:

    ...
    write(1, "1\n", 2) = 3
    write(1, "", 4294967295) = -1 EFAULT (Bad address)
    write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
    ) = 31

    This tells system return 3(COMPACT_COMPLETE) after write data to
    compact_memory.

    The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
    from sysctl_compaction_handler after compaction_nodes finished.

    Signed-off-by: Jason Liu
    Suggested-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Liu
     
  • The memmove span covers from (next+1) to the end of the array, and the
    index of next is (i+1), so the index of (next+1) is (i+2). So the size
    of remaining array elements is (type->cnt - (i + 2)).

    Since the remaining elements of the memblock array are move forward by
    one element and there is only one additional element caused by this bug.
    So there won't be any write overflow here but read overflow. It may
    read one more element out of the array address if the array happens to
    be full. Commonly it doesn't matter at all but if the array happens to
    be located at the end a memblock, it may cause a invalid read operation
    for the physical address doesn't exist.

    There are 2 *happens to be* here, so I think the probability is quite
    low, I don't know if any guy is haunted by this bug before.

    Mostly I think it's user-invisible.

    Signed-off-by: Lin Feng
    Acked-by: Tejun Heo
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lin Feng
     
  • Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does
    not check page_count before migrating like base page migration and
    khugepage. He could not see why this was safe and he is right.

    The potential impact of the bug is avoided due to the limitations of
    NUMA balancing. The page_mapcount() check ensures that only a single
    address space is using this page and as THPs are typically private it
    should not be possible for another address space to fault it in
    parallel. If the address space has one associated task then it's
    difficult to have both a GUP pin and be referencing the page at the same
    time. If there are multiple tasks then a buggy scenario requires that
    another thread be accessing the page while the direct IO is in flight.
    This is dodgy behaviour as there is a possibility of corruption with or
    without THP migration. It would be

    While we happen to be safe for the most part it is shoddy to depend on
    such "safety" so this patch checks the page count similar to anonymous
    pages. Note that this does not mean that the page_mapcount() check can
    go away. If we were to remove the page_mapcount() check the the THP
    would have to be unmapped from all referencing PTEs, replaced with
    migration PTEs and restored properly afterwards.

    Signed-off-by: Mel Gorman
    Reported-by: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Jan, 2013

1 commit

  • The check for a pmd being in the process of being split was dropped by
    mistake by commit d10e63f29488 ("mm: numa: Create basic numa page
    hinting infrastructure"). Put it back.

    Reported-by: Dave Jones
    Debugged-by: Hillf Danton
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: Kirill Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Jan, 2013

2 commits

  • Since commit e303297e6c3a ("mm: extended batches for generic
    mmu_gather") we are batching pages to be freed until either
    tlb_next_batch cannot allocate a new batch or we are done.

    This works just fine most of the time but we can get in troubles with
    non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
    on large machines where too aggressive batching might lead to soft
    lockups during process exit path (exit_mmap) because there are no
    scheduling points down the free_pages_and_swap_cache path and so the
    freeing can take long enough to trigger the soft lockup.

    The lockup is harmless except when the system is setup to panic on
    softlockup which is not that unusual.

    The simplest way to work around this issue is to limit the maximum
    number of batches in a single mmu_gather. 10k of collected pages should
    be safe to prevent from soft lockups (we would have 2ms for one) even if
    they are all freed without an explicit scheduling point.

    This patch doesn't add any new explicit scheduling points because it
    relies on zap_pmd_range during page tables zapping which calls
    cond_resched per PMD.

    The following lockup has been reported for 3.0 kernel with a huge
    process (in order of hundreds gigs but I do know any more details).

    BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
    Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
    Supported: Yes
    CPU 56
    Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
    RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10
    RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206
    RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
    RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
    RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
    R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
    R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
    FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
    Call Trace:
    release_pages+0xc5/0x260
    free_pages_and_swap_cache+0x9d/0xc0
    tlb_flush_mmu+0x5c/0x80
    tlb_finish_mmu+0xe/0x50
    exit_mmap+0xbd/0x120
    mmput+0x49/0x120
    exit_mm+0x122/0x160
    do_exit+0x17a/0x430
    do_group_exit+0x3d/0xb0
    get_signal_to_deliver+0x247/0x480
    do_signal+0x71/0x1b0
    do_notify_resume+0x98/0xb0
    int_signal+0x12/0x17
    DWARF2 unwinder stuck at int_signal+0x12/0x17

    Signed-off-by: Michal Hocko
    Cc: [3.0+]
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 702d1a6e0766 ("memory-hotplug: fix kswapd looping forever
    problem") added an isolated pageblocks counter (nr_pageblock_isolate in
    struct zone) and used it to adjust free pages counter in
    zone_watermark_ok_safe() to prevent kswapd looping forever problem.

    Then later, commit 2139cbe627b8 ("cma: fix counting of isolated pages")
    fixed accounting of isolated pages in global free pages counter. It
    made the previous zone_watermark_ok_safe() fix unnecessary and
    potentially harmful (cause now isolated pages may be accounted twice
    making free pages counter incorrect).

    This patch removes the special isolated pageblocks counter altogether
    which fixes zone_watermark_ok_safe() free pages check.

    Reported-by: Tomasz Stanislawski
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Aaditya Kumar
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

04 Jan, 2013

1 commit

  • CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
    markings need to be removed.

    This change removes the use of __devinit from the file.

    Based on patches originally written by Bill Pemberton, but redone by me
    in order to handle some of the coding style issues better, by hand.

    Cc: Bill Pemberton
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

03 Jan, 2013

3 commits

  • Sasha was fuzzing with trinity and reported the following problem:

    BUG: sleeping function called from invalid context at kernel/mutex.c:269
    in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
    2 locks held by trinity-main/6361:
    #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0
    #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0
    Pid: 6361, comm: trinity-main Tainted: G W
    3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
    Call Trace:
    __might_sleep+0x1c3/0x1e0
    mutex_lock_nested+0x29/0x50
    mpol_shared_policy_lookup+0x2e/0x90
    shmem_get_policy+0x2e/0x30
    get_vma_policy+0x5a/0xa0
    mpol_misplaced+0x41/0x1d0
    handle_pte_fault+0x465/0x6a0

    This was triggered by a different version of automatic NUMA balancing
    but in theory the current version is vunerable to the same problem.

    do_numa_page
    -> numa_migrate_prep
    -> mpol_misplaced
    -> get_vma_policy
    -> shmem_get_policy

    It's very unlikely this will happen as shared pages are not marked
    pte_numa -- see the page_mapcount() check in change_pte_range() -- but
    it is possible.

    To address this, this patch restores sp->lock as originally implemented
    by Kosaki Motohiro. In the path where get_vma_policy() is called, it
    should not be calling sp_alloc() so it is not necessary to treat the PTL
    specially.

    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove the unused argument (formerly no_context) from mpol_parse_str()
    and from mpol_to_str().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
    mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts
    or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
    in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
    pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
    worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic.

    Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
    when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when
    no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
    which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
    With slab poisoning, you can then rely on mpol_to_str() to set the bit
    for node 0x6b6b, probably in the next page above the caller's stack.

    mpol_parse_str() is only called from shmem_parse_options(): no_context
    is always true, so call it unused for now, and remove !no_context code.
    Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
    expect. Then mpol_to_str() can ignore its no_context argument also,
    the mpol being appropriately initialized whether contextualized or not.
    Rename its no_context unused too, and let subsequent patch remove them
    (that's not needed for stable backporting, which would involve rejects).

    I don't understand why MPOL_LOCAL is described as a pseudo-policy:
    it's a reasonable policy which suffers from a confusing implementation
    in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be
    much more robust if MPOL_LOCAL were recognized in switch statements
    throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
    empty) nodes mask like everyone else, instead of its preferred_node
    variant (I presume an optimization from the days before MPOL_LOCAL).
    But that would take me too long to get right and fully tested.

    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Dec, 2012

1 commit

  • An unintended consequence of commit 4ae0a48b5efc ("mm: modify
    pgdat_balanced() so that it also handles order-0") is that
    wait_iff_congested() can now be called with NULL 'struct zone *'
    producing kernel oops like this:

    BUG: unable to handle kernel NULL pointer dereference
    IP: [] wait_iff_congested+0x59/0x140

    This trivial patch fixes it.

    Reported-by: Zhouping Liu
    Reported-and-tested-by: Sedat Dilek
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Zlatko Calusic
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

24 Dec, 2012

1 commit


21 Dec, 2012

9 commits

  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Commit 648bb56d076b ("cgroup: lock cgroup_mutex in cgroup_init_subsys()")
    made cgroup_init_subsys() grab cgroup_mutex before invoking
    ->css_alloc() for the root css. Because memcg registers hotcpu notifier
    from ->css_alloc() for the root css, this introduced circular locking
    dependency between cgroup_mutex and cpu hotplug.

    Fix it by moving hotcpu notifier registration to a subsys initcall.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.7.0-rc4-work+ #42 Not tainted
    -------------------------------------------------------
    bash/645 is trying to acquire lock:
    (cgroup_mutex){+.+.+.}, at: [] cgroup_lock+0x17/0x20

    but task is already holding lock:
    (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2f/0x60

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (cpu_hotplug.lock){+.+.+.}:
    lock_acquire+0x97/0x1e0
    mutex_lock_nested+0x61/0x3b0
    get_online_cpus+0x3c/0x60
    rebuild_sched_domains_locked+0x1b/0x70
    cpuset_write_resmask+0x298/0x2c0
    cgroup_file_write+0x1ef/0x300
    vfs_write+0xa8/0x160
    sys_write+0x52/0xa0
    system_call_fastpath+0x16/0x1b

    -> #0 (cgroup_mutex){+.+.+.}:
    __lock_acquire+0x14ce/0x1d20
    lock_acquire+0x97/0x1e0
    mutex_lock_nested+0x61/0x3b0
    cgroup_lock+0x17/0x20
    cpuset_handle_hotplug+0x1b/0x560
    cpuset_update_active_cpus+0xe/0x10
    cpuset_cpu_inactive+0x47/0x50
    notifier_call_chain+0x66/0x150
    __raw_notifier_call_chain+0xe/0x10
    __cpu_notify+0x20/0x40
    _cpu_down+0x7e/0x2f0
    cpu_down+0x36/0x50
    store_online+0x5d/0xe0
    dev_attr_store+0x18/0x30
    sysfs_write_file+0xe0/0x150
    vfs_write+0xa8/0x160
    sys_write+0x52/0xa0
    system_call_fastpath+0x16/0x1b
    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpu_hotplug.lock);
    lock(cgroup_mutex);
    lock(cpu_hotplug.lock);
    lock(cgroup_mutex);

    *** DEADLOCK ***

    5 locks held by bash/645:
    #0: (&buffer->mutex){+.+.+.}, at: [] sysfs_write_file+0x48/0x150
    #1: (s_active#42){.+.+.+}, at: [] sysfs_write_file+0xc8/0x150
    #2: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [] cpu_hotplug_driver_lock+0x1
    +7/0x20
    #3: (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20
    #4: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2f/0x60

    stack backtrace:
    Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42
    Call Trace:
    print_circular_bug+0x28e/0x29f
    __lock_acquire+0x14ce/0x1d20
    lock_acquire+0x97/0x1e0
    mutex_lock_nested+0x61/0x3b0
    cgroup_lock+0x17/0x20
    cpuset_handle_hotplug+0x1b/0x560
    cpuset_update_active_cpus+0xe/0x10
    cpuset_cpu_inactive+0x47/0x50
    notifier_call_chain+0x66/0x150
    __raw_notifier_call_chain+0xe/0x10
    __cpu_notify+0x20/0x40
    _cpu_down+0x7e/0x2f0
    cpu_down+0x36/0x50
    store_online+0x5d/0xe0
    dev_attr_store+0x18/0x30
    sysfs_write_file+0xe0/0x150
    vfs_write+0xa8/0x160
    sys_write+0x52/0xa0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Clarify error messages and correct a few typos in the transparent hugepage
    sysfs init code.

    Signed-off-by: Jeremy Eder
    Acked-by: Rafael Aquini
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Eder
     
  • Memory returned to free_contig_range() must have no other references.
    Let kernel to complain loudly if page reference count is not equal to 1.

    [rientjes@google.com: support sparsemem]
    Signed-off-by: Marek Szyprowski
    Reviewed-by: Kyungmin Park
    Acked-by: Michal Nazarewicz
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • The system uses global_dirtyable_memory() to calculate number of
    dirtyable pages/pages that can be allocated to the page cache. A bug
    causes an underflow thus making the page count look like a big unsigned
    number. This in turn confuses the dirty writeback throttling to
    aggressively write back pages as they become dirty (usually 1 page at a
    time). This generally only affects systems with highmem because the
    underflowed count gets subtracted from the global count of dirtyable
    memory.

    The problem was introduced with v3.2-4896-gab8fabd

    Fix is to ensure we don't get an underflowed total of either highmem or
    global dirtyable memory.

    Signed-off-by: Sonny Rao
    Signed-off-by: Puneet Kumar
    Acked-by: Johannes Weiner
    Tested-by: Damien Wyart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sonny Rao
     
  • isolate_freepages_block() and isolate_migratepages_range() are used for
    CMA as well as compaction so it breaks build for CONFIG_CMA &&
    !CONFIG_COMPACTION.

    This patch fixes it.

    [akpm@linux-foundation.org: add "do { } while (0)", per Mel]
    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • …/linux-fs into for-linus

    Al Viro
     
  • Removed vmtruncate

    Signed-off-by: Marco Stornelli
    Signed-off-by: Al Viro

    Marco Stornelli
     
  • Pull virtio update from Rusty Russell:
    "Some nice cleanups, and even a patch my wife did as a "live" demo for
    Latinoware 2012.

    There's a slightly non-trivial merge in virtio-net, as we cleaned up
    the virtio add_buf interface while DaveM accepted the mq virtio-net
    patches."

    * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (27 commits)
    virtio_console: Add support for remoteproc serial
    virtio_console: Merge struct buffer_token into struct port_buffer
    virtio: add drv_to_virtio to make code clearly
    virtio: use dev_to_virtio wrapper in virtio
    virtio-mmio: Fix irq parsing in command line parameter
    virtio_console: Free buffers from out-queue upon close
    virtio: Convert dev_printk(KERN_ to dev_(
    virtio_console: Use kmalloc instead of kzalloc
    virtio_console: Free buffer if splice fails
    virtio: tools: make it clear that virtqueue_add_buf() no longer returns > 0
    virtio: scsi: make it clear that virtqueue_add_buf() no longer returns > 0
    virtio: rpmsg: make it clear that virtqueue_add_buf() no longer returns > 0
    virtio: net: make it clear that virtqueue_add_buf() no longer returns > 0
    virtio: console: make it clear that virtqueue_add_buf() no longer returns > 0
    virtio: make virtqueue_add_buf() returning 0 on success, not capacity.
    virtio: console: don't rely on virtqueue_add_buf() returning capacity.
    virtio_net: don't rely on virtqueue_add_buf() returning capacity.
    virtio-net: remove unused skb_vnet_hdr->num_sg field
    virtio-net: correct capacity math on ring full
    virtio: move queue_index and num_free fields into core struct virtqueue.
    ...

    Linus Torvalds
     

20 Dec, 2012

2 commits

  • The rmap walks in ksm.c are like those in rmap.c: they can safely be
    done with anon_vma_lock_read().

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone,
    the Normal zone gets fragmented in time. This requires relatively more
    pressure in balance_pgdat to get the zone above the required watermark.
    Unfortunately, the congestion_wait() call in there slows it down for a
    completely wrong reason, expecting that there's a lot of
    writeback/swapout, even when there's none (much more common). After a
    few days, when fragmentation progresses, this flawed logic translates to
    a very high CPU iowait times, even though there's no I/O congestion at
    all. If THP is enabled, the problem occurs sooner, but I was able to
    see it even on !THP kernels, just by giving it a bit more time to occur.

    The proper way to deal with this is to not wait, unless there's
    congestion. Thanks to Mel Gorman, we already have the function that
    perfectly fits the job. The patch was tested on a machine which nicely
    revealed the problem after only 1 day of uptime, and it's been working
    great.

    Signed-off-by: Zlatko Calusic
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

19 Dec, 2012

12 commits

  • Neil found that if too_many_isolated() returns true while performing
    direct reclaim we can end up waiting for other threads to complete their
    direct reclaim. If those threads are allowed to enter the FS or IO to
    free memory, but this thread is not, then it is possible that those
    threads will be waiting on this thread and so we get a circular deadlock.

    some task enters direct reclaim with GFP_KERNEL
    => too_many_isolated() false
    => vmscan and run into dirty pages
    => pageout()
    => take some FS lock
    => fs/block code does GFP_NOIO allocation
    => enter direct reclaim again
    => too_many_isolated() true
    => waiting for others to progress, however the other
    tasks may be circular waiting for the FS lock..

    The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
    priority than normal ones, by lowering the throttle threshold for the
    latter.

    Allowing ~1/8 isolated pages in normal is large enough. For example, for
    a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
    isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
    logical CPUs per NUMA node). So it's not likely some CPU goes idle
    waiting (when it could make progress) because of this limit: there are
    much more sleeping reclaim tasks than the number of CPU, so the task may
    well be blocked by some low level queue/lock anyway.

    Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
    They will be blocked only when there are too many concurrent !GFP_IOFS
    reclaims, however that's very unlikely because the IO-less direct reclaims
    is able to progress much more faster, and they won't deadlock each other.
    The threshold is raised high enough for them, so that there can be
    sufficient parallel progress of !GFP_IOFS reclaims.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Wu Fengguang
    Cc: Torsten Kaiser
    Tested-by: NeilBrown
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Comment "Why it's doing so" rather than "What it does" as proposed by
    Andrew Morton.

    Signed-off-by: Wu Fengguang
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Replace the obsolete simple_strtoul() with kstrtoul().

    Signed-off-by: Abhijit Pawar
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Abhijit Pawar
     
  • Signed-off-by: Tang Chen
    Cc: Jiang Liu
    Cc: Lai Jiangshan
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
    CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
    will fail to boot.

    This failure is caused by following code path:

    setup_hugepagesz
    hugetlb_add_hstate
    hugetlb_cgroup_file_init
    cgroup_add_cftypes
    kzalloc
    Signed-off-by: Jiang Liu
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • A few gremlins have recently crept in.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Sasha Levin recently reported a lockdep problem resulting from the new
    attribute propagation introduced by kmemcg series. In short, slab_mutex
    will be called from within the sysfs attribute store function. This will
    create a dependency, that will later be held backwards when a cache is
    destroyed - since destruction occurs with the slab_mutex held, and then
    calls in to the sysfs directory removal function.

    In this patch, I propose to adopt a strategy close to what
    __kmem_cache_create does before calling sysfs_slab_add, and release the
    lock before the call to sysfs_slab_remove. This is pretty much the last
    operation in the kmem_cache_shutdown() path, so we could do better by
    splitting this and moving this call alone to later on. This will fit
    nicely when sysfs handling is consistent between all caches, but will look
    weird now.

    Lockdep info:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
    -------------------------------------------------------
    trinity-child13/6961 is trying to acquire lock:
    (s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60

    but task is already holding lock:
    (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:
    -> #1 (slab_mutex){+.+.+.}:
    lock_acquire+0x1aa/0x240
    __mutex_lock_common+0x59/0x5a0
    mutex_lock_nested+0x3f/0x50
    slab_attr_store+0xde/0x110
    sysfs_write_file+0xfa/0x150
    vfs_write+0xb0/0x180
    sys_pwrite64+0x60/0xb0
    tracesys+0xe1/0xe6
    -> #0 (s_active#43){++++.+}:
    __lock_acquire+0x14df/0x1ca0
    lock_acquire+0x1aa/0x240
    sysfs_deactivate+0x122/0x1a0
    sysfs_addrm_finish+0x31/0x60
    sysfs_remove_dir+0x89/0xd0
    kobject_del+0x16/0x40
    __kmem_cache_shutdown+0x40/0x60
    kmem_cache_destroy+0x40/0xe0
    mon_text_release+0x78/0xe0
    __fput+0x122/0x2d0
    ____fput+0x9/0x10
    task_work_run+0xbe/0x100
    do_exit+0x432/0xbd0
    do_group_exit+0x84/0xd0
    get_signal_to_deliver+0x81d/0x930
    do_signal+0x3a/0x950
    do_notify_resume+0x3e/0x90
    int_signal+0x12/0x17

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(s_active#43);
    lock(slab_mutex);
    lock(s_active#43);

    *** DEADLOCK ***

    2 locks held by trinity-child13/6961:
    #0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
    #1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0

    stack backtrace:
    Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
    Call Trace:
    print_circular_bug+0x1fb/0x20c
    __lock_acquire+0x14df/0x1ca0
    lock_acquire+0x1aa/0x240
    sysfs_deactivate+0x122/0x1a0
    sysfs_addrm_finish+0x31/0x60
    sysfs_remove_dir+0x89/0xd0
    kobject_del+0x16/0x40
    __kmem_cache_shutdown+0x40/0x60
    kmem_cache_destroy+0x40/0xe0
    mon_text_release+0x78/0xe0
    __fput+0x122/0x2d0
    ____fput+0x9/0x10
    task_work_run+0xbe/0x100
    do_exit+0x432/0xbd0
    do_group_exit+0x84/0xd0
    get_signal_to_deliver+0x81d/0x930
    do_signal+0x3a/0x950
    do_notify_resume+0x3e/0x90
    int_signal+0x12/0x17

    Signed-off-by: Glauber Costa
    Reported-by: Sasha Levin
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • This patch clarifies two aspects of cache attribute propagation.

    First, the expected context for the for_each_memcg_cache macro in
    memcontrol.h. The usages already in the codebase are safe. In mm/slub.c,
    it is trivially safe because the lock is acquired right before the loop.
    In mm/slab.c, it is less so: the lock is acquired by an outer function a
    few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
    indeed safe.

    A comment is also added to detail why we are returning the value of the
    parent cache and ignoring the children's when we propagate the attributes.

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • SLUB allows us to tune a particular cache behavior with sysfs-based
    tunables. When creating a new memcg cache copy, we'd like to preserve any
    tunables the parent cache already had.

    This can be done by tapping into the store attribute function provided by
    the allocator. We of course don't need to mess with read-only fields.
    Since the attributes can have multiple types and are stored internally by
    sysfs, the best strategy is to issue a ->show() in the root cache, and
    then ->store() in the memcg cache.

    The drawback of that, is that sysfs can allocate up to a page in buffering
    for show(), that we are likely not to need, but also can't guarantee. To
    avoid always allocating a page for that, we can update the caches at store
    time with the maximum attribute size ever stored to the root cache. We
    will then get a buffer big enough to hold it. The corolary to this, is
    that if no stores happened, nothing will be propagated.

    It can also happen that a root cache has its tunables updated during
    normal system operation. In this case, we will propagate the change to
    all caches that are already active.

    [akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • SLAB allows us to tune a particular cache behavior with tunables. When
    creating a new memcg cache copy, we'd like to preserve any tunables the
    parent cache already had.

    This could be done by an explicit call to do_tune_cpucache() after the
    cache is created. But this is not very convenient now that the caches are
    created from common code, since this function is SLAB-specific.

    Another method of doing that is taking advantage of the fact that
    do_tune_cpucache() is always called from enable_cpucache(), which is
    called at cache initialization. We can just preset the values, and then
    things work as expected.

    It can also happen that a root cache has its tunables updated during
    normal system operation. In this case, we will propagate the change to
    all caches that are already active.

    This change will require us to move the assignment of root_cache in
    memcg_params a bit earlier. We need this to be already set - which
    memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • When we create caches in memcgs, we need to display their usage
    information somewhere. We'll adopt a scheme similar to /proc/meminfo,
    with aggregate totals shown in the global file, and per-group information
    stored in the group itself.

    For the time being, only reads are allowed in the per-group cache.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • This means that when we destroy a memcg cache that happened to be empty,
    those caches may take a lot of time to go away: removing the memcg
    reference won't destroy them - because there are pending references, and
    the empty pages will stay there, until a shrinker is called upon for any
    reason.

    In this patch, we will call kmem_cache_shrink() for all dead caches that
    cannot be destroyed because of remaining pages. After shrinking, it is
    possible that it could be freed. If this is not the case, we'll schedule
    a lazy worker to keep trying.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa