29 Mar, 2012

10 commits

  • Merge third batch of patches from Andrew Morton:
    - Some MM stragglers
    - core SMP library cleanups (on_each_cpu_mask)
    - Some IPI optimisations
    - kexec
    - kdump
    - IPMI
    - the radix-tree iterator work
    - various other misc bits.

    "That'll do for -rc1. I still have ~10 patches for 3.4, will send
    those along when they've baked a little more."

    * emailed from Andrew Morton : (35 commits)
    backlight: fix typo in tosa_lcd.c
    crc32: add help text for the algorithm select option
    mm: move hugepage test examples to tools/testing/selftests/vm
    mm: move slabinfo.c to tools/vm
    mm: move page-types.c from Documentation to tools/vm
    selftests/Makefile: make `run_tests' depend on `all'
    selftests: launch individual selftests from the main Makefile
    radix-tree: use iterators in find_get_pages* functions
    radix-tree: rewrite gang lookup using iterator
    radix-tree: introduce bit-optimized iterator
    fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
    nbd: rename the nbd_device variable from lo to nbd
    pidns: add reboot_pid_ns() to handle the reboot syscall
    sysctl: use bitmap library functions
    ipmi: use locks on watchdog timeout set on reboot
    ipmi: simplify locking
    ipmi: fix message handling during panics
    ipmi: use a tasklet for handling received messages
    ipmi: increase KCS timeouts
    ipmi: decrease the IPMI message transaction time in interrupt mode
    ...

    Linus Torvalds
     
  • Replace radix_tree_gang_lookup_slot() and
    radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
    brand-new radix-tree direct iterating. This avoids the double-scanning
    and pointer copying.

    Iterator don't stop after nr_pages page-get fails in a row, it continue
    lookup till the radix-tree end. Thus we can safely remove these restart
    conditions.

    Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
    case does not fit into new code, so the patch adds an extra check at the
    beginning.

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
    an IPI requesting CPUs to drain these pages to the buddy allocator if they
    actually have pages when asked to flush.

    This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
    severe memory pressure that leads to OOM since in these cases multiple,
    possibly concurrent, allocation requests end up in the direct reclaim code
    path so when the per-cpu pages end up reclaimed on first allocation
    failure for most of the proceeding allocation attempts until the memory
    pressure is off (possibly via the OOM killer) there are no per-cpu pages
    on most CPUs (and there can easily be hundreds of them).

    This also has the side effect of shortening the average latency of direct
    reclaim by 1 or more order of magnitude since waiting for all the CPUs to
    ACK the IPI takes a long time.

    Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
    difference between the number of direct reclaim attempts that end up in
    drain_all_pages() and those were more then 1/2 of the online CPU had any
    per-cpu page in them, using the vmstat counters introduced in the next
    patch in the series and using proc/interrupts.

    In the test sceanrio, this was seen to save around 3600 global
    IPIs after trigerring an OOM on a concurrent workload:

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 0
    pcp_global_ipi_saved 0

    $ cat /proc/interrupts | grep CAL
    CAL: 1 2 1 2
    2 2 2 2 Function call interrupts

    $ hackbench 400
    [OOM messages snipped]

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 3647
    pcp_global_ipi_saved 3642

    $ cat /proc/interrupts | grep CAL
    CAL: 6 13 6 3
    3 3 1 2 7 Function call interrupts

    Please note that if the global drain is removed from the direct reclaim
    path as a patch from Mel Gorman currently suggests this should be replaced
    with an on_each_cpu_cond invocation.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Acked-by: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Andi Kleen
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • flush_all() is called for each kmem_cache_destroy(). So every cache being
    destroyed dynamically ends up sending an IPI to each CPU in the system,
    regardless if the cache has ever been used there.

    For example, if you close the Infinband ipath driver char device file, the
    close file ops calls kmem_cache_destroy(). So running some infiniband
    config tool on one a single CPU dedicated to system tasks might interrupt
    the rest of the 127 CPUs dedicated to some CPU intensive or latency
    sensitive task.

    I suspect there is a good chance that every line in the output of "git
    grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

    This patch attempts to rectify this issue by sending an IPI to flush the
    per cpu objects back to the free lists only to CPUs that seem to have such
    objects.

    The check which CPU to IPI is racy but we don't care since asking a CPU
    without per cpu objects to flush does no damage and as far as I can tell
    the flush_all by itself is racy against allocs on remote CPUs anyway, so
    if you required the flush_all to be determinstic, you had to arrange for
    locking regardless.

    Without this patch the following artificial test case:

    $ cd /sys/kernel/slab
    $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

    produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

    The code path of memory allocation failure for CPUMASK_OFFSTACK=y
    config was tested using fault injection framework.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Christoph Lameter
    Cc: Chris Metcalf
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Russell King
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Sasha Levin
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: Avi Kivity
    Cc: Michal Nazarewicz
    Cc: Kosaki Motohiro
    Cc: Milton Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • Most system calls taking flags first check that the flags passed in are
    valid, and that helps userspace to detect when new flags are supported.

    But swapon never did so: start checking now, to help if we ever want to
    support more swap_flags in future.

    It's difficult to get stray bits set in an int, and swapon is not widely
    used, so this is most unlikely to break any userspace; but we can just
    revert if it turns out to do so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The size of coredump files is limited by RLIMIT_CORE, however, allocating
    large amounts of memory results in three negative consequences:

    - the coredumping process may be chosen for oom kill and quickly deplete
    all memory reserves in oom conditions preventing further progress from
    being made or tasks from exiting,

    - the coredumping process may cause other processes to be oom killed
    without fault of their own as the result of a SIGSEGV, for example, in
    the coredumping process, or

    - the coredumping process may result in a livelock while writing to the
    dump file if it needs memory to allocate while other threads are in
    the exit path waiting on the coredumper to complete.

    This is fixed by implying __GFP_NORETRY in the page allocator for
    coredumping processes when reclaim has failed so the allocations fail and
    the process continues to exit.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • pmd_trans_unstable() should be called before pmd_offset_map() in the
    locations where the mmap_sem is held for reading.

    Signed-off-by: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Larry Woodman
    Cc: Ulrich Obergfell
    Cc: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
    but forgetting to unmap pages first (ocfs2 remembers). This is not really
    a bug, since races already require truncate_inode_page() to handle that
    case once the page is locked; but it can be very inefficient if the file
    being punched happens to be mapped into many vmas.

    Provide a drop-in replacement truncate_pagecache_range() which does the
    unmapping pass first, handling the awkward mismatch between arguments to
    truncate_inode_pages_range() and arguments to unmap_mapping_range().

    Note that holepunching does not unmap privately COWed pages in the range:
    POSIX requires that we do so when truncating, but it's hard to justify,
    difficult to implement without an i_size cutoff, and no filesystem is
    attempting to implement it.

    Signed-off-by: Hugh Dickins
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull SLAB changes from Pekka Enberg:
    "There's the new kmalloc_array() API, minor fixes and performance
    improvements, but quite honestly, nothing terribly exciting."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: SLAB Out-of-memory diagnostics
    slab: introduce kmalloc_array()
    slub: per cpu partial statistics change
    slub: include include for prefetch
    slub: Do not hold slub_lock when calling sysfs_slab_add()
    slub: prefetch next freelist pointer in slab_alloc()
    slab, cleanup: remove unneeded return

    Linus Torvalds
     
  • Pull ext4 updates for 3.4 from Ted Ts'o:
    "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

    The changes to export dirty_writeback_interval are from Artem's s_dirt
    cleanup patch series. The same is true of the change to remove the
    s_dirt helper functions which never got used by anyone in-tree. I've
    run these changes by Al Viro, and am carrying them so that Artem can
    more easily fix up the rest of the file systems during the next merge
    window. (Originally we had hopped to remove the use of s_dirt from
    ext4 during this merge window, but his patches had some bugs, so I
    ultimately ended dropping them from the ext4 tree.)"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
    vfs: remove unused superblock helpers
    mm: export dirty_writeback_interval
    ext4: remove useless s_dirt assignment
    ext4: write superblock only once on unmount
    ext4: do not mark superblock as dirty unnecessarily
    ext4: correct ext4_punch_hole return codes
    ext4: remove restrictive checks for EOFBLOCKS_FL
    ext4: always set then trimmed blocks count into len
    ext4: fix trimmed block count accunting
    ext4: fix start and len arguments handling in ext4_trim_fs()
    ext4: update s_free_{inodes,blocks}_count during online resize
    ext4: change some printk() calls to use ext4_msg() instead
    ext4: avoid output message interleaving in ext4_error_()
    ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
    ext4: add no_printk argument validation, fix fallout
    ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
    ext4: give more helpful error message in ext4_ext_rm_leaf()
    ext4: remove unused code from ext4_ext_map_blocks()
    ext4: rewrite punch hole to use ext4_ext_remove_space()
    jbd2: cleanup journal tail after transaction commit
    ...

    Linus Torvalds
     

25 Mar, 2012

1 commit


24 Mar, 2012

4 commits

  • Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
    for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
    MADV_DONTDUMP, which can be set by applications to specifically request
    memory regions which should not dump core.

    The specific application I have in mind is qemu: we can add a flag there
    that wouldn't dump all of guest memory when qemu dumps core. This flag
    might also be useful for security sensitive apps that want to absolutely
    make sure that parts of memory are not dumped. To clear the flag use:
    MADV_DODUMP.

    [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
    [akpm@linux-foundation.org: fix up the architectures which broke]
    Signed-off-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: Chris Metcalf
    Cc: Avi Kivity
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • The motivation for this patchset was that I was looking at a way for a
    qemu-kvm process, to exclude the guest memory from its core dump, which
    can be quite large. There are already a number of filter flags in
    /proc//coredump_filter, however, these allow one to specify 'types'
    of kernel memory, not specific address ranges (which is needed in this
    case).

    Since there are no more vma flags available, the first patch eliminates
    the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
    the kernel to mark vdso and vsyscall pages. However, it is simple
    enough to check if a vma covers a vdso or vsyscall page without the need
    for this flag.

    The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
    'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
    'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
    continue to work the same as before unless 'MADV_DONTDUMP' is set on the
    region.

    The qemu code which implements this features is at:

    http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch

    In my testing the qemu core dump shrunk from 383MB -> 13MB with this
    patch.

    I also believe that the 'MADV_DONTDUMP' flag might be useful for
    security sensitive apps, which might want to select which areas are
    dumped.

    This patch:

    The VM_ALWAYSDUMP flag is currently used by the coredump code to
    indicate that a vma is part of a vsyscall or vdso section. However, we
    can determine if a vma is in one these sections by checking it against
    the gate_vma and checking for a non-NULL return value from
    arch_vma_name(). Thus, freeing a valuable vma bit.

    Signed-off-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: Chris Metcalf
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
    of force_sig(SIGKILL). With the recent changes we do not need force_ to
    kill the CLONE_NEWPID tasks.

    And this is more correct. force_sig() can race with the exiting thread
    even if oom_kill_task() checks p->mm != NULL, while
    do_send_sig_info(group => true) kille the whole process.

    Signed-off-by: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Anton Vorontsov
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Fix code duplication in __unmap_hugepage_range(), such as pte_page() and
    huge_pte_none().

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

23 Mar, 2012

4 commits

  • Adjusting cc715d99e529 "mm: vmscan: forcibly scan highmem if there are
    too many buffer_heads pinning highmem" for -stable reveals that it was
    slightly wrong once on top of fe2c2a106663 "vmscan: reclaim at order 0
    when compaction is enabled", which specifically adds testorder for the
    zone_watermark_ok_safe() test.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     
  • Pull MCE changes from Ingo Molnar.

    * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
    x86/mce: Convert static array of pointers to per-cpu variables
    x86/mce: Replace hard coded hex constants with symbolic defines
    x86/mce: Recognise machine check bank signature for data path error
    x86/mce: Handle "action required" errors
    x86/mce: Add mechanism to safely save information in MCE handler
    x86/mce: Create helper function to save addr/misc when needed
    HWPOISON: Add code to handle "action required" errors.
    HWPOISON: Clean up memory_failure() vs. __memory_failure()

    Linus Torvalds
     
  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

21 commits

  • Export 'dirty_writeback_interval' to make it visible to
    file-systems. We are going to push superblock management down to
    file-systems and get rid of the 'sync_supers' kernel thread completly.

    Signed-off-by: Artem Bityutskiy
    Cc: Al Viro
    Signed-off-by: "Theodore Ts'o"

    Artem Bityutskiy
     
  • Currently we can't do task migration among memory cgroups without THP
    split, which means processes heavily using THP experience large overhead
    in task migration. This patch introduces the code for moving charge of
    THP and makes THP more valuable.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Acked-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • These macros will be used in a later patch, where all usages are expected
    to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
    detect unexpected usages, we convert the existing BUG() to BUILD_BUG().

    [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • - Replace lengthy function name is_target_pte_for_mc() with a shorter
    one in order to avoid ugly line breaks.

    - explicitly use MC_TARGET_* instead of simply using integers.

    Signed-off-by: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Hillf Danton
    Cc: David Rientjes
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Signed-off-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Liu
     
  • In the following code:

    if (type == _MEM)
    thresholds = &memcg->thresholds;
    else if (type == _MEMSWAP)
    thresholds = &memcg->memsw_thresholds;
    else
    BUG();

    BUG_ON(!thresholds);

    The BUG_ON() seems redundant.

    Signed-off-by: Anton Vorontsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • A grammatical fix.

    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • mem_cgroup_begin_update_page_stat() should be very fast because it's
    called very frequently. Now, it needs to look up page_cgroup and its
    memcg....this is slow.

    This patch adds a global variable to check "any memcg is moving or not".
    With this, the caller doesn't need to visit page_cgroup and memcg.

    Here is a test result. A test program makes page faults onto a file,
    MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
    range by madvise() and page fault again. This program causes 26214400
    times of page fault onto a file(size was 1G.) and shows shows the cost of
    mem_cgroup_begin_update_page_stat().

    Before this patch for mem_cgroup_begin_update_page_stat()

    [kamezawa@bluextal test]$ time ./mmap 1G

    real 0m21.765s
    user 0m5.999s
    sys 0m15.434s

    27.46% mmap mmap [.] reader
    21.15% mmap [kernel.kallsyms] [k] page_fault
    9.17% mmap [kernel.kallsyms] [k] filemap_fault
    2.96% mmap [kernel.kallsyms] [k] __do_fault
    2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat

    After this patch

    [root@bluextal test]# time ./mmap 1G

    real 0m21.373s
    user 0m6.113s
    sys 0m15.016s

    In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.

    Note: we may be able to remove this optimization in future if
    we can get pointer to memcg directly from struct page.

    [akpm@linux-foundation.org: don't return a void]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • With the new lock scheme for updating memcg's page stat, we don't need a
    flag PCG_FILE_MAPPED which was duplicated information of page_mapped().

    [hughd@google.com: cosmetic fix]
    [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, page-stat-per-memcg is recorded into per page_cgroup flag by
    duplicating page's status into the flag. The reason is that memcg has a
    feature to move a page from a group to another group and we have race
    between "move" and "page stat accounting",

    Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
    does "page stat accounting".

    When CPU-A goes 1st,

    CPU-A CPU-B
    update "struct page" info.
    move_lock_mem_cgroup(memcg)
    see pc->flags
    copy page stat to new group
    overwrite pc->mem_cgroup.
    move_unlock_mem_cgroup(memcg)
    move_lock_mem_cgroup(mem)
    set pc->flags
    update page stat accounting
    move_unlock_mem_cgroup(mem)

    stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
    (CPU-A) doesn't see changes in "struct page" information.

    But it's costly to have the same information both in 'struct page' and
    'struct page_cgroup'. And, there is a potential problem.

    For example, assume we have PG_dirty accounting in memcg.
    PG_..is a flag for struct page.
    PCG_ is a flag for struct page_cgroup.
    (This is just an example. The same problem can be found in any
    kind of page stat accounting.)

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) TestClear PG_dirty
    if (TestClear(PCG_dirty))
    memcg->nr_dirty--
    if (TestSet(PCG_dirty))
    memcg->nr_dirty++

    Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
    Thelen . Now, only FILE_MAPPED is supported but
    fortunately, it's serialized by page table lock and this is not real bug,
    _now_,

    If this potential problem is caused by having duplicated information in
    struct page and struct page_cgroup, we may be able to fix this by using
    original 'struct page' information. But we'll have a problem in "move
    account"

    Assume we use only PG_dirty.

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) move_lock_mem_cgroup()
    if (PageDirty(page))
    new_memcg->nr_dirty++
    pc->mem_cgroup = new_memcg;
    move_unlock_mem_cgroup()
    move_lock_mem_cgroup()
    memcg = pc->mem_cgroup
    new_memcg->nr_dirty++

    accounting information may be double-counted. This was original reason to
    have PCG_xxx flags but it seems PCG_xxx has another problem.

    I think we need a bigger lock as

    move_lock_mem_cgroup(page)
    TestSetPageDirty(page)
    update page stats (without any checks)
    move_unlock_mem_cgroup(page)

    This fixes both of problems and we don't have to duplicate page flag into
    page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
    are possibility of "account move" under the system. So, in most path,
    status update will go without atomic locks.

    This patch introduces mem_cgroup_begin_update_page_stat() and
    mem_cgroup_end_update_page_stat() both should be called at modifying
    'struct page' information if memcg takes care of it. as

    mem_cgroup_begin_update_page_stat()
    modify page information
    mem_cgroup_update_page_stat()
    => never check any 'struct page' info, just update counters.
    mem_cgroup_end_update_page_stat().

    This patch is slow because we need to call begin_update_page_stat()/
    end_update_page_stat() regardless of accounted will be changed or not. A
    following patch adds an easy optimization and reduces the cost.

    [akpm@linux-foundation.org: s/lock/locked/]
    [hughd@google.com: fix deadlock by avoiding stat lock when anon]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
    pc->mem_cgroup and page statistics accounting per memcg. This lock helps
    to avoid the race but the race is very rare because moving tasks between
    cgroup is not a usual job. So, it seems using 1bit per page is too
    costly.

    This patch changes this lock as per-memcg spinlock and removes
    PCG_MOVE_LOCK.

    If smaller lock is required, we'll be able to add some hashes but I'd like
    to start from this.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In memcg, for avoiding take-lock-irq-off at accessing page_cgroup, a
    logic, flag + rcu_read_lock(), is used. This works as following

    CPU-A CPU-B
    rcu_read_lock()
    set flag
    if(flag is set)
    take heavy lock
    do job.
    synchronize_rcu() rcu_read_unlock()
    take heavy lock.

    In recent discussion, it's argued that using per-cpu value for this flag
    just complicates the code because 'set flag' is very rare.

    This patch changes 'flag' implementation from percpu to atomic_t. This
    will be much simpler.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Cc: "Paul E. McKenney"
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • As described in the log, I guess EXPORT was for preparing dirty
    accounting. But _now_, we don't need to export this. Remove this for
    now.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • We record 'the page is cache' with the PCG_CACHE bit in page_cgroup.
    Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't
    include shmem.

    Considering callers, at charge/uncharge, the caller should know what the
    page is and we don't need to record it by using one bit per page.

    This patch removes PCG_CACHE bit and make callers of
    mem_cgroup_charge_statistics() to specify what the page is.

    About page migration: Mapping of the used page is not touched during migra
    tion (see page_remove_rmap) so we can rely on it and push the correct
    charge type down to __mem_cgroup_uncharge_common from end_migration for
    unused page. The force flag was misleading was abused for skipping the
    needless page_mapped() / PageCgroupMigration() check, as we know the
    unused page is no longer mapped and cleared the migration flag just a few
    lines up. But doing the checks is no biggie and it's not worth adding
    another flag just to skip them.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hughd@google.com: fix PageAnon uncharging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit e94c8a9cbce1 ("memcg: make mem_cgroup_split_huge_fixup() more
    efficient") removed move_lock_page_cgroup(). So we do not have to check
    PageTransHuge in mem_cgroup_update_page_stat() and fallback into the
    locked accounting because both move_account() and thp split are done
    with compound_lock so they cannot race.

    The race between update vs. move is protected by mem_cgroup_stealed.

    PageTransHuge pages shouldn't appear in this code path currently because
    we are tracking only file pages at the moment but later we are planning
    to track also other pages (e.g. mlocked ones).

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Acked-by: Michal Hocko
    Cc: David Rientjes
    Acked-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Remove redundant returns from ends of functions, and one blank line.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I never understood why we need a MEM_CGROUP_ZSTAT(mz, idx) macro to
    obscure the LRU counts. For easier searching? So call it lru_size
    rather than bare count (lru_length sounds better, but would be wrong,
    since each huge page raises lru_size hugely).

    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace mem and mem_cont stragglers in memcontrol.c by memcg.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
    will still perform a discard operation. This can cause problems if
    discard is slow or buggy.

    Reverse the order of the check so that a discard operation is performed
    only if the sys_swapon() caller is attempting to enable discard.

    Signed-off-by: Shaohua Li
    Reported-by: Holger Kiehl
    Tested-by: Holger Kiehl
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
    RECLAIM_MODE_ASYNC. (sync/async sign is used only in shrink_page_list
    and does not affect shrink_active_list)

    Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
    RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
    shrink_inactive_list(). Meanwhile, in age_active_anon()
    sc->reclaim_mode is totally zero. So the current behavior is too
    complex and confusing, and this looks like bug.

    In general, shrink_active_list() populates the inactive list for the
    next shrink_inactive_list(). Lumpy shring_inactive_list() isolates
    pages around the chosen one from both the active and inactive lists.
    So, there is no reason for lumpy isolation in shrink_active_list().

    See also: https://lkml.org/lkml/2012/3/15/583

    Signed-off-by: Konstantin Khlebnikov
    Proposed-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov