13 Feb, 2015

40 commits

  • If an attacker can cause a controlled kernel stack overflow, overwriting
    the restart block is a very juicy exploit target. This is because the
    restart_block is held in the same memory allocation as the kernel stack.

    Moving the restart block to struct task_struct prevents this exploit by
    making the restart_block harder to locate.

    Note that there are other fields in thread_info that are also easy
    targets, at least on some architectures.

    It's also a decent simplification, since the restart code is more or less
    identical on all architectures.

    [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
    Signed-off-by: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: David Miller
    Acked-by: Richard Weinberger
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Steven Miao
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Richard Kuo
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Jonas Bonn
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Michael Ellerman (powerpc)
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Oleg Nesterov
    Cc: Guenter Roeck
    Signed-off-by: James Hogan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Instead of custom approach let's use string_escape_str() to escape a given
    string (task_name in this case).

    Signed-off-by: Andy Shevchenko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • The output of /proc/$pid/numa_maps is in terms of number of pages like
    anon=22 or dirty=54. Here's some output:

    7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
    7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
    7fff8d425000 default stack anon=50 dirty=50 N0=50

    Looks like we have a stack and a couple of anonymous hugetlbfs
    areas page which both use the same amount of memory. They don't.

    The 'bigfile' uses 1GB pages and takes up ~50GB of space. The
    anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
    uses normal 4k pages. You can go over to smaps to figure out what the
    page size _really_ is with KernelPageSize or MMUPageSize. But, I think
    this is a pretty nasty and counterintuitive interface as it stands.

    This patch introduces 'kernelpagesize_kB' line element to
    /proc//numa_maps report file in order to help identifying the size of
    pages that are backing memory areas mapped by a given task. This is
    specially useful to help differentiating between HUGE and GIGANTIC page
    backed VMAs.

    This patch is based on Dave Hansen's proposal and reviewer's follow-ups
    taken from the following dicussion threads:
    * https://lkml.org/lkml/2011/9/21/454
    * https://lkml.org/lkml/2014/12/20/66

    Signed-off-by: Rafael Aquini
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Add a small section to proc.txt doc in order to document its
    /proc/pid/numa_maps interface. It does not introduce any functional
    changes, just documentation.

    Signed-off-by: Rafael Aquini
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Use the PDE() helper to get proc_dir_entry instead of coding it directly.

    Signed-off-by: Alexander Kuleshov
    Acked-by: Nicolas Dichtel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • Peak resident size of a process can be reset back to the process's
    current rss value by writing "5" to /proc/pid/clear_refs. The driving
    use-case for this would be getting the peak RSS value, which can be
    retrieved from the VmHWM field in /proc/pid/status, per benchmark
    iteration or test scenario.

    [akpm@linux-foundation.org: clarify behaviour in documentation]
    Signed-off-by: Petr Cermak
    Cc: Bjorn Helgaas
    Cc: Primiano Tucci
    Cc: Petr Cermak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Cermak
     
  • Remove the function search_one_table() that is not used anywhere.

    This was partially found by using a static code analysis program called
    cppcheck.

    Signed-off-by: Rickard Strandqvist
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rickard Strandqvist
     
  • Keeping fragmentation of zsmalloc in a low level is our target. But now
    we still need to add the debug code in zsmalloc to get the quantitative
    data.

    This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
    statistics collection for developers. Currently only the objects
    statatitics in each class are collected. User can get the information via
    debugfs.

    cat /sys/kernel/debug/zsmalloc/zram0/...

    For example:

    After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
    class size obj_allocated obj_used pages_used
    0 32 0 0 0
    1 48 256 12 3
    2 64 64 14 1
    3 80 51 7 1
    4 96 128 5 3
    5 112 73 5 2
    6 128 32 4 1
    7 144 0 0 0
    8 160 0 0 0
    9 176 0 0 0
    10 192 0 0 0
    11 208 0 0 0
    12 224 0 0 0
    13 240 0 0 0
    14 256 16 1 1
    15 272 15 9 1
    16 288 0 0 0
    17 304 0 0 0
    18 320 0 0 0
    19 336 0 0 0
    20 352 0 0 0
    21 368 0 0 0
    22 384 0 0 0
    23 400 0 0 0
    24 416 0 0 0
    25 432 0 0 0
    26 448 0 0 0
    27 464 0 0 0
    28 480 0 0 0
    29 496 33 1 4
    30 512 0 0 0
    31 528 0 0 0
    32 544 0 0 0
    33 560 0 0 0
    34 576 0 0 0
    35 592 0 0 0
    36 608 0 0 0
    37 624 0 0 0
    38 640 0 0 0
    40 672 0 0 0
    42 704 0 0 0
    43 720 17 1 3
    44 736 0 0 0
    46 768 0 0 0
    49 816 0 0 0
    51 848 0 0 0
    52 864 14 1 3
    54 896 0 0 0
    57 944 13 1 3
    58 960 0 0 0
    62 1024 4 1 1
    66 1088 15 2 4
    67 1104 0 0 0
    71 1168 0 0 0
    74 1216 0 0 0
    76 1248 0 0 0
    83 1360 3 1 1
    91 1488 11 1 4
    94 1536 0 0 0
    100 1632 5 1 2
    107 1744 0 0 0
    111 1808 9 1 4
    126 2048 4 4 2
    144 2336 7 3 4
    151 2448 0 0 0
    168 2720 15 15 10
    190 3072 28 27 21
    202 3264 0 0 0
    254 4096 36209 36209 36209

    Total 37022 36326 36288

    We can calculate the overall fragentation by the last line:
    Total 37022 36326 36288
    (37022 - 36326) / 37022 = 1.87%

    Also by analysing objects alocated in every class we know why we got so
    low fragmentation: Most of the allocated objects is in . And
    there is only 1 page in class 254 zspage. So, No fragmentation will be
    introduced by allocating objs in class 254.

    And in future, we can collect other zsmalloc statistics as we need and
    analyse them.

    Signed-off-by: Ganesh Mahendran
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
    them. There is not a method to let zsmalloc/zbud find which caller they
    belong to.

    Now we want to add statistics collection in zsmalloc. We need to name the
    debugfs dir for each pool created. The way suggested by Minchan Kim is to
    use a name passed by caller(such as zram) to create the zsmalloc pool.

    /sys/kernel/debug/zsmalloc/zram0

    This patch adds an argument `name' to zs_create_pool() and other related
    functions.

    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • `struct zram' contains both `struct gendisk' and `struct request_queue'.
    the latter can be deleted, because zram->disk carries ->queue pointer, and
    ->queue carries zram pointer:

    create_device()
    zram->queue->queuedata = zram
    zram->disk->queue = zram->queue
    zram->disk->private_data = zram

    so zram->queue is not needed, we can access all necessary data anyway.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Admin could reset zram during I/O operation going on so we have used
    zram->init_lock as read-side lock in I/O path to prevent sudden zram
    meta freeing.

    However, the init_lock is really troublesome. We can't do call
    zram_meta_alloc under init_lock due to lockdep splat because
    zram_rw_page is one of the function under reclaim path and hold it as
    read_lock while other places in process context hold it as write_lock.
    So, we have used allocation out of the lock to avoid lockdep warn but
    it's not good for readability and fainally, I met another lockdep splat
    between init_lock and cpu_hotplug from kmem_cache_destroy during working
    zsmalloc compaction. :(

    Yes, the ideal is to remove horrible init_lock of zram in rw path. This
    patch removes it in rw path and instead, add atomic refcount for meta
    lifetime management and completion to free meta in process context.
    It's important to free meta in process context because some of resource
    destruction needs mutex lock, which could be held if we releases the
    resource in reclaim context so it's deadlock, again.

    As a bonus, we could remove init_done check in rw path because
    zram_meta_get will do a role for it, instead.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Ganesh Mahendran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • bd_holders is increased only when user open the device file as FMODE_EXCL
    so if something opens zram0 as !FMODE_EXCL and request I/O while another
    user reset zram0, we can see following warning.

    zram0: detected capacity change from 0 to 64424509440
    Buffer I/O error on dev zram0, logical block 180823, lost async page write
    Buffer I/O error on dev zram0, logical block 180824, lost async page write
    Buffer I/O error on dev zram0, logical block 180825, lost async page write
    Buffer I/O error on dev zram0, logical block 180826, lost async page write
    Buffer I/O error on dev zram0, logical block 180827, lost async page write
    Buffer I/O error on dev zram0, logical block 180828, lost async page write
    Buffer I/O error on dev zram0, logical block 180829, lost async page write
    Buffer I/O error on dev zram0, logical block 180830, lost async page write
    Buffer I/O error on dev zram0, logical block 180831, lost async page write
    Buffer I/O error on dev zram0, logical block 180832, lost async page write
    ------------[ cut here ]------------
    WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
    Modules linked in:
    CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x45/0x57
    warn_slowpath_common+0x8a/0xc0
    warn_slowpath_null+0x1a/0x20
    __blkdev_put+0x1d7/0x210
    blkdev_put+0x50/0x130
    blkdev_close+0x25/0x30
    __fput+0xdf/0x1e0
    ____fput+0xe/0x10
    task_work_run+0xa7/0xe0
    do_notify_resume+0x49/0x60
    int_signal+0x12/0x17
    ---[ end trace 274fbbc5664827d2 ]---

    The warning comes from bdev_write_node in blkdev_put path.

    static void bdev_write_inode(struct inode *inode)
    {
    spin_lock(&inode->i_lock);
    while (inode->i_state & I_DIRTY) {
    spin_unlock(&inode->i_lock);
    WARN_ON_ONCE(write_inode_now(inode, true)); i_lock);
    }
    spin_unlock(&inode->i_lock);
    }

    The reason is dd process encounters I/O fails due to sudden block device
    disappear so in filemap_check_errors in __writeback_single_inode returns
    -EIO.

    If we check bd_openers instead of bd_holders, we could address the
    problem. When I see the brd, it already have used it rather than
    bd_holders so although I'm not a expert of block layer, it seems to be
    better.

    I can make following warning with below simple script. In addition, I
    added msleep(2000) below set_capacity(zram->disk, 0) after applying your
    patch to make window huge(Kudos to Ganesh!)

    script:

    echo $((60< /sys/block/zram0/disksize
    setsid dd if=/dev/zero of=/dev/zram0 &
    sleep 1
    setsid echo 1 > /sys/block/zram0/reset

    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Ganesh Mahendran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We need to return set_capacity(disk, 0) from reset_store() back to
    zram_reset_device(), a catch by Ganesh Mahendran. Potentially, we can
    race set_capacity() calls from init and reset paths.

    The problem is that zram_reset_device() is also getting called from
    zram_exit(), which performs operations in misleading reversed order -- we
    first create_device() and then init it, while zram_exit() perform
    destroy_device() first and then does zram_reset_device(). This is done to
    remove sysfs group before we reset device, so we can continue with device
    reset/destruction not being raced by sysfs attr write (f.e. disksize).

    Apart from that, destroy_device() releases zram->disk (but we still have
    ->disk pointer), so we cannot acces zram->disk in later
    zram_reset_device() call, which may cause additional errors in the future.

    So, this patch rework and cleanup destroy path.

    1) remove several unneeded goto labels in zram_init()

    2) factor out zram_init() error path and zram_exit() into
    destroy_devices() function, which takes the number of devices to
    destroy as its argument.

    3) remove sysfs group in destroy_devices() first, so we can reorder
    operations -- reset device (as expected) goes before disk destroy and
    queue cleanup. So we can always access ->disk in zram_reset_device().

    4) and, finally, return set_capacity() back under ->init_lock.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Ganesh Mahendran
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
    avoid ->bd_holders race condition:

    CPU0 CPU1
    umount /* zram->init_done is true */
    reset_store()
    bdev->bd_holders == 0 mount
    ... zram_make_request()
    zram_reset_device()

    However, his solution required some considerable amount of code movement,
    which we can avoid.

    Apart from using bdev->bd_mutex in reset_store(), this patch also
    simplifies zram_reset_device().

    zram_reset_device() has a bool parameter reset_capacity which tells it
    whether disk capacity and itself disk should be reset. There are two
    zram_reset_device() callers:

    -- zram_exit() passes reset_capacity=false
    -- reset_store() passes reset_capacity=true

    So we can move reset_capacity-sensitive work out of zram_reset_device()
    and perform it unconditionally in reset_store(). This also lets us drop
    reset_capacity parameter from zram_reset_device() and pass zram pointer
    only.

    Signed-off-by: Sergey Senozhatsky
    Reported-by: Ganesh Mahendran
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • zram_meta_alloc() and zram_meta_free() are a pair. In
    zram_meta_alloc(), meta table is allocated. So it it better to free it
    in zram_meta_free().

    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • A trivial cleanup of zram_meta_alloc() error handling.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • The vmstat interfaces are good at hiding negative counts (at least when
    CONFIG_SMP); but if you peer behind the curtain, you find that
    nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
    more negative: so they can absorb larger and larger numbers of isolated
    pages, yet still appear to be zero.

    I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
    but I guess it's there for a good reason, in which case we ought to get
    too_many_isolated() working again.

    The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
    putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
    to call acct_isolated() to increment them.

    It is possible that the bug whcih this patch fixes could cause OOM kills
    when the system still has a lot of reclaimable page cache.

    Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
    Signed-off-by: Hugh Dickins
    Acked-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A race condition starts to be visible in recent mmotm, where a PG_hwpoison
    flag is set on a migration source page *before* it's back in buddy page
    poo= l.

    This is problematic because no page flag is supposed to be set when
    freeing (see __free_one_page().) So the user-visible effect of this race
    is that it could trigger the BUG_ON() when soft-offlining is called.

    The root cause is that we call lru_add_drain_all() to make sure that the
    page is in buddy, but that doesn't work because this function just
    schedule= s a work item and doesn't wait its completion.
    drain_all_pages() does drainin= g directly, so simply dropping
    lru_add_drain_all() solves this problem.

    Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: [3.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Add a necessary 'leave'.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • For whatever reason, generic_access_phys() only remaps one page, but
    actually allows to access arbitrary size. It's quite easy to trigger
    large reads, like printing out large structure with gdb, which leads to a
    crash. Fix it by remapping correct size.

    Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
    Signed-off-by: Grazvydas Ignotas
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grazvydas Ignotas
     
  • The only caller of cpuset_init_current_mems_allowed is the __init
    annotated build_all_zonelists_init, so we can also make the former __init.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • mminit_loglevel is only referenced from __init and __meminit functions, so
    we can mark it __meminitdata.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The only caller of mminit_verify_zonelist is build_all_zonelists_init,
    which is annotated with __init, so it should be safe to also mark the
    former as __init, saving ~400 bytes of .text.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Pulling the code protected by if (system_state == SYSTEM_BOOTING) into
    its own helper allows us to shrink .text a little. This relies on
    build_all_zonelists already having a __ref annotation. Add a comment
    explaining why so one doesn't have to track it down through git log.

    The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist
    as __init"), where we save about 400 bytes

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • All users of mminit_dprintk pass a compile-time constant as level, so this
    just makes gcc emit a single printk call instead of two.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • mm->nr_pmds doesn't make sense on !MMU configurations

    Signed-off-by: Kirill A. Shutemov
    Cc: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
    from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
    wrapper functions.

    Although this patch moves static keys decrement from __mem_cgroup_free to
    mem_cgroup_css_free, it does not introduce any functional changes, because
    the keys are incremented on setting the limit (tcp or kmem), which can
    only happen after successful mem_cgroup_css_online.

    Signed-off-by: Vladimir Davydov
    Cc: Glauber Costa
    Cc: KAMEZAWA Hiroyuki
    Cc: Eric W. Biederman
    Cc: David S. Miller
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, freepage isolation in one pageblock doesn't consider how many
    freepages we isolate. When I traced flow of compaction, compaction
    sometimes isolates more than 256 freepages to migrate just 32 pages.

    In this patch, freepage isolation is stopped at the point that we
    have more isolated freepage than isolated page for migration. This
    results in slowing down free page scanner and make compaction success
    rate higher.

    stress-highalloc test in mmtests with non movable order 7 allocation shows
    increase of compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    27.13 : 31.82

    pfn where both scanners meets on compaction complete
    (separate test due to enormous tracepoint buffer)
    (zone_start=4096, zone_end=1048576)
    586034 : 654378

    In fact, I didn't fully understand why this patch results in such good
    result. There was a guess that not used freepages are released to pcp list
    and on next compaction trial we won't isolate them again so compaction
    success rate would decrease. To prevent this effect, I tested with adding
    pcp drain code on release_freepages(), but, it has no good effect.

    Anyway, this patch reduces waste time to isolate unneeded freepages so
    seems reasonable.

    Vlastimil said:

    : I briefly tried it on top of the pivot-changing series and with order-9
    : allocations it reduced free page scanned counter by almost 10%. No effect
    : on success rates (maybe because pivot changing already took care of the
    : scanners meeting problem) but the scanning reduction is good on its own.
    :
    : It also explains why e14c720efdd7 ("mm, compaction: remember position
    : within pageblock in free pages scanner") had less than expected
    : improvements. It would only actually stop within pageblock in case of
    : async compaction detecting contention. I guess that's also why the
    : infinite loop problem fixed by 1d5bfe1ffb5b affected so relatively few
    : people.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Tested-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • What we want to check here is whether there is highorder freepage in buddy
    list of other migratetype in order to steal it without fragmentation.
    But, current code just checks cc->order which means allocation request
    order. So, this is wrong.

    Without this fix, non-movable synchronous compaction below pageblock order
    would not stopped until compaction is complete, because migratetype of
    most pageblocks are movable and high order freepage made by compaction is
    usually on movable type buddy list.

    There is some report related to this bug. See below link.

    http://www.spinics.net/lists/linux-mm/msg81666.html

    Although the issued system still has load spike comes from compaction,
    this makes that system completely stable and responsive according to his
    report.

    stress-highalloc test in mmtests with non movable order 7 allocation
    doesn't show any notable difference in allocation success rate, but, it
    shows more compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    18.47 : 28.94

    Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available")
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To speed up further allocations SLUB may store empty slabs in per cpu/node
    partial lists instead of freeing them immediately. This prevents per
    memcg caches destruction, because kmem caches created for a memory cgroup
    are only destroyed after the last page charged to the cgroup is freed.

    To fix this issue, this patch resurrects approach first proposed in [1].
    It forbids SLUB to cache empty slabs after the memory cgroup that the
    cache belongs to was destroyed. It is achieved by setting kmem_cache's
    cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
    that it would drop frozen empty slabs immediately if cpu_partial = 0.

    The runtime overhead is minimal. From all the hot functions, we only
    touch relatively cold put_cpu_partial(): we make it call
    unfreeze_partials() after freezing a slab that belongs to an offline
    memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
    partial list on free/alloc, and there can't be allocations from dead
    caches, it shouldn't cause any overhead. We do have to disable preemption
    for put_cpu_partial() to achieve that though.

    The original patch was accepted well and even merged to the mm tree.
    However, I decided to withdraw it due to changes happening to the memcg
    core at that time. I had an idea of introducing per-memcg shrinkers for
    kmem caches, but now, as memcg has finally settled down, I do not see it
    as an option, because SLUB shrinker would be too costly to call since SLUB
    does not keep free slabs on a separate list. Besides, we currently do not
    even call per-memcg shrinkers for offline memcgs. Overall, it would
    introduce much more complexity to both SLUB and memcg than this small
    patch.

    Regarding to SLAB, there's no problem with it, because it shrinks
    per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
    longer keep entries for offline cgroups in per-memcg arrays (such as
    memcg_cache_params->memcg_caches), so we do not have to bother if a
    per-memcg cache will be shrunk a bit later than it could be.

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It is supposed to return 0 if the cache has no remaining objects and 1
    otherwise, while currently it always returns 0. Fix it.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but
    also tries to rearrange the partial lists to place slabs filled up most to
    the head to cope with fragmentation. To achieve that, it allocates a
    temporary array of lists used to sort slabs by the number of objects in
    use. If the allocation fails, the whole procedure is aborted.

    This is unacceptable for the kernel memory accounting extension of the
    memory cgroup, where we want to make sure that kmem_cache_shrink()
    successfully discarded empty slabs. Although the allocation failure is
    utterly unlikely with the current page allocator implementation, which
    retries GFP_KERNEL allocations of order 32 free objects are left in
    the end of the list without any ordering imposed on them.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Now, the only reason to keep kmemcg_id till css free is list_lru, which
    uses it to distribute elements between per-memcg lists. However, it can
    be easily sorted out - we only need to change kmemcg_id of an offline
    cgroup to its parent's id, making further list_lru_add()'s add elements to
    the parent's list, and then move all elements from the offline cgroup's
    list to the one of its parent. It will work, because a racing
    list_lru_del() does not need to know the list it is deleting the element
    from. It can decrement the wrong nr_items counter though, but the ongoing
    reparenting will fix it. After list_lru reparenting is done we are free
    to release kmemcg_id saving a valuable slot in a per-memcg array for new
    cgroups.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
    on allocations, so there is no need to have the array entries set until
    css free - we can clear them on css offline. This will allow us to reuse
    array entries more efficiently and avoid costly array relocations.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we use mem_cgroup->kmemcg_id to guarantee kmem_cache->name
    uniqueness. This is correct, because kmemcg_id is only released on css
    free after destroying all per memcg caches.

    However, I am going to change that and release kmemcg_id on css offline,
    because it is not wise to keep it for so long, wasting valuable entries of
    memcg_cache_params->memcg_caches arrays. Therefore, to preserve cache
    name uniqueness, let us switch to css->id.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we release css->id in css_release_work_fn, right before calling
    css_free callback, so that when css_free is called, the id may have
    already been reused for a new cgroup.

    I am going to use css->id to create unique names for per memcg kmem
    caches. Since kmem caches are destroyed only on css_free, I need css->id
    to be freed after css_free was called to avoid name clashes. This patch
    therefore moves css->id removal to css_free_work_fn. To prevent
    css_from_id from returning a pointer to a stale css, it makes
    css_release_work_fn replace the css ptr at css_idr:css->id with NULL.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes, we need to iterate over all memcg copies of a particular root
    kmem cache. Currently, we use memcg_cache_params->memcg_caches array for
    that, because it contains all existing memcg caches.

    However, it's a bad practice to keep all caches, including those that
    belong to offline cgroups, in this array, because it will be growing
    beyond any bounds then. I'm going to wipe away dead caches from it to
    save space. To still be able to perform iterations over all memcg caches
    of the same kind, let us link them into a list.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, kmem_cache stores a pointer to struct memcg_cache_params
    instead of embedding it. The rationale is to save memory when kmem
    accounting is disabled. However, the memcg_cache_params has shrivelled
    drastically since it was first introduced:

    * Initially:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct kmem_cache *memcg_caches[0];
    struct {
    struct mem_cgroup *memcg;
    struct list_head list;
    struct kmem_cache *root_cache;
    bool dead;
    atomic_t nr_pages;
    struct work_struct destroy;
    };
    };
    };

    * Now:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct {
    struct rcu_head rcu_head;
    struct kmem_cache *memcg_caches[0];
    };
    struct {
    struct mem_cgroup *memcg;
    struct kmem_cache *root_cache;
    };
    };
    };

    So the memory saving does not seem to be a clear win anymore.

    OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
    it results in touching one more cache line on kmem alloc/free hot paths.
    Besides, it makes linking kmem caches in a list chained by a field of
    struct memcg_cache_params really painful due to a level of indirection,
    while I want to make them linked in the following patch. That said, let
    us embed it.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • In super_cache_scan() we divide the number of objects of particular type
    by the total number of objects in order to distribute pressure among As a
    result, in some corner cases we can get nr_to_scan=0 even if there are
    some objects to reclaim, e.g. dentries=1, inodes=1, fs_objects=1,
    nr_to_scan=1/3=0.

    This is unacceptable for per memcg kmem accounting, because this means
    that some objects may never get reclaimed after memcg death, preventing it
    from being freed.

    This patch therefore assures that super_cache_scan() will scan at least
    one object of each type if any.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Alexander Viro
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov