15 Nov, 2013

1 commit

  • With split ptlock it's important to know which lock
    pmd_trans_huge_lock() took. This patch adds one more parameter to the
    function to return the lock.

    In most places migration to new api is trivial. Exception is
    move_huge_pmd(): we need to take two locks if pmd tables are different.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

8 commits

  • Pull networking updates from David Miller:

    1) The addition of nftables. No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations. For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset. In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

    2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

    3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

    4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

    5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

    6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

    7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option. From Eric Dumazet.

    8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

    9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too. Implementation from Shawn
    Bohrer.

    10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets. With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention. From Eric Dumazet.

    11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations. From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

    12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels. From Eric Dumazet.

    13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies. From Hannes Frederic Sowa. The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

    14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more. Many drivers have been cleaned
    up in this way, from Jingoo Han.

    15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

    16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled. Also from Daniel
    Borkmann.

    17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value. This helps avoid PMTU attacks,
    particularly on DNS servers. From Hannes Frederic Sowa.

    18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net. From Jason Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    random32: add test cases for taus113 implementation
    random32: upgrade taus88 generator to taus113 from errata paper
    random32: move rnd_state to linux/random.h
    random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
    random32: add periodic reseeding
    random32: fix off-by-one in seeding requirement
    PHY: Add RTL8201CP phy_driver to realtek
    xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
    macmace: add missing platform_set_drvdata() in mace_probe()
    ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
    ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
    vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
    ixgbe: add warning when max_vfs is out of range.
    igb: Update link modes display in ethtool
    netfilter: push reasm skb through instead of original frag skbs
    ip6_output: fragment outgoing reassembled skb properly
    MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
    net_sched: tbf: support of 64bit rates
    ixgbe: deleting dfwd stations out of order can cause null ptr deref
    ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
    ...

    Linus Torvalds
     
  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "Not too much activity this time around. css_id is finally killed and
    a minor update to device_cgroup"

    * 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    device_cgroup: remove can_attach
    cgroup: kill css_id
    memcg: stop using css id
    memcg: fail to create cgroup if the cgroup id is too big
    memcg: convert to use cgroup id
    memcg: convert to use cgroup_is_descendant()

    Linus Torvalds
     
  • Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • The memory.numa_stat file was not hierarchical. Memory charged to the
    children was not shown in parent's numa_stat.

    This change adds the "hierarchical_" stats to the existing stats. The
    new hierarchical stats include the sum of all children's values in
    addition to the value of the memcg.

    Tested: Create cgroup a, a/b and run workload under b. The values of
    b are included in the "hierarchical_*" under a.

    $ cd /sys/fs/cgroup
    $ echo 1 > memory.use_hierarchy
    $ mkdir a a/b

    Run workload in a/b:
    $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

    The hierarchical_ fields in parent (a) show use of workload in a/b:
    $ cat a/memory.numa_stat
    total=0 N0=0 N1=0 N2=0 N3=0
    file=0 N0=0 N1=0 N2=0 N3=0
    anon=0 N0=0 N1=0 N2=0 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    $ cat a/b/memory.numa_stat
    total=908 N0=552 N1=317 N2=39 N3=0
    file=850 N0=549 N1=301 N2=0 N3=0
    anon=58 N0=3 N1=16 N2=39 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    Signed-off-by: Ying Han
    Signed-off-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Refactor mem_control_numa_stat_show() to use a new stats structure for
    smaller and simpler code. This consolidates nearly identical code.

    text data bss dec hex filename
    8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
    8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

    Signed-off-by: Greg Thelen
    Signed-off-by: Ying Han
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Use helper function to check if we need to deal with oom condition.

    Signed-off-by: Qiang Huang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     

05 Nov, 2013

1 commit

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be.h
    drivers/net/netconsole.c
    net/bridge/br_private.h

    Three mostly trivial conflicts.

    The net/bridge/br_private.h conflict was a function signature (argument
    addition) change overlapping with the extern removals from Joe Perches.

    In drivers/net/netconsole.c we had one change adjusting a printk message
    whilst another changed "printk(KERN_INFO" into "pr_info(".

    Lastly, the emulex change was a new inline function addition overlapping
    with Joe Perches's extern removals.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Nov, 2013

1 commit

  • When a memcg is deleted mem_cgroup_reparent_charges() moves charged
    memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
    cgroup writeback pages accounting" there's bad pointer read. The goal
    was to check for counter underflow. The counter is a per cpu counter
    and there are two problems with the code:

    (1) per cpu access function isn't used, instead a naked pointer is used
    which easily causes oops.
    (2) the check doesn't sum all cpus

    Test:
    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ echo 3 > /proc/sys/vm/drop_caches
    $ (echo $BASHPID >> x/tasks && exec cat) &
    [1] 7154
    $ grep ^mapped x/memory.stat
    mapped_file 53248
    $ echo 7154 > tasks
    $ rmdir x

    The fix is to remove the check. It's currently dangerous and isn't
    worth fixing it to use something expensive, such as
    percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't
    enough to fix this because there's no guarantees of the current cpus
    count. The only guarantees is that the sum of all per-cpu counter is >=
    nr_pages.

    Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
    Reported-and-tested-by: Flavio Leitner
    Signed-off-by: Greg Thelen
    Reviewed-by: Sha Zhengju
    Acked-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

01 Nov, 2013

3 commits

  • When memcg code needs to know whether any given memcg has children, it
    uses the cgroup child iteration primitives and returns true/false
    depending on whether the iteration loop is executed at least once or
    not.

    Because a cgroup's list of children is RCU protected, these primitives
    require the RCU read-lock to be held, which is not the case for all
    memcg callers. This results in the following splat when e.g. enabling
    hierarchy mode:

    WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
    CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
    Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
    Call Trace:
    dump_stack+0x54/0x74
    warn_slowpath_common+0x78/0xa0
    warn_slowpath_null+0x1a/0x20
    css_next_child+0xa3/0x160
    mem_cgroup_hierarchy_write+0x5b/0xa0
    cgroup_file_write+0x108/0x2a0
    vfs_write+0xbd/0x1e0
    SyS_write+0x4c/0xa0
    system_call_fastpath+0x16/0x1b

    In the memcg case, we only care about children when we are attempting to
    modify inheritable attributes interactively. Racing with deletion could
    mean a spurious -EBUSY, no problem. Racing with addition is handled
    just fine as well through the memcg_create_mutex: if the child group is
    not on the list after the mutex is acquired, it won't be initialized
    from the parent's attributes until after the unlock.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg OOM lock is a mutex-type lock that is open-coded due to
    memcg's special needs. Add annotations for lockdep coverage.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they
    fail to reclaim enough memory for the charge. But because the main test
    case was on a 3.2-based system, the patch missed the fact that on newer
    kernels the charge function needs to return root_mem_cgroup when
    bypassing the limit, and not NULL. This will corrupt whatever memory is
    at NULL + percpu pointer offset. Fix this quickly before problems are
    reported.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Oct, 2013

1 commit

  • As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages
    accounting") memcg counter errors are possible when moving charged
    memory to a different memcg. Charge movement occurs when processing
    writes to memory.force_empty, moving tasks to a memcg with
    memcg.move_charge_at_immigrate=1, or memcg deletion.

    An example showing error after memory.force_empty:

    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ rm /data/tmp/file
    $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
    [1] 13600
    $ grep ^mapped x/memory.stat
    mapped_file 1048576
    $ echo 13600 > tasks
    $ echo 1 > x/memory.force_empty
    $ grep ^mapped x/memory.stat
    mapped_file 4503599627370496

    mapped_file should end with 0.
    4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
    1048576 == 0x10,0000 == 0x100 pages

    This issue only affects the source memcg on 64 bit machines; the
    destination memcg counters are correct. So the rmdir case is not too
    important because such counters are soon disappearing with the entire
    memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1
    cases are larger problems as the bogus counters are visible for the
    (possibly long) remaining life of the source memcg.

    The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
    is subtly wrong because it subtracts the unsigned int nr_pages (either
    -1 or -512 for THP) from a signed long percpu counter. When
    nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx]
    is signed 64 bit. So memcg's attempt to simply decrement a count (e.g.
    from 1 to 0) boils down to:

    long count = 1
    unsigned int nr_pages = 1
    count += -nr_pages /* -nr_pages == 0xffff,ffff */
    count is now 0x1,0000,0000 instead of 0

    The fix is to subtract the unsigned page count rather than adding its
    negation. This only works once "percpu: fix this_cpu_sub() subtrahend
    casting for unsigneds" is applied to fix this_cpu_sub().

    Signed-off-by: Greg Thelen
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

24 Oct, 2013

1 commit


22 Oct, 2013

1 commit


17 Oct, 2013

3 commits

  • Buffer allocation has a very crude indefinite loop around waking the
    flusher threads and performing global NOFS direct reclaim because it can
    not handle allocation failures.

    The most immediate problem with this is that the allocation may fail due
    to a memory cgroup limit, where flushers + direct reclaim might not make
    any progress towards resolving the situation at all. Because unlike the
    global case, a memory cgroup may not have any cache at all, only
    anonymous pages but no swap. This situation will lead to a reclaim
    livelock with insane IO from waking the flushers and thrashing unrelated
    filesystem cache in a tight loop.

    Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
    any looping happens in the page allocator, which knows how to
    orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
    allows memory cgroups to detect allocations that can't handle failure
    and will allow them to ultimately bypass the limit if reclaim can not
    make progress.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") assumed that only a few places that can trigger a
    memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
    readahead. But there are many more and it's impractical to annotate
    them all.

    First of all, we don't want to invoke the OOM killer when the failed
    allocation is gracefully handled, so defer the actual kill to the end of
    the fault handling as well. This simplifies the code quite a bit for
    added bonus.

    Second, since a failed allocation might not be the abrupt end of the
    fault, the memcg OOM handler needs to be re-entrant until the fault
    finishes for subsequent allocation attempts. If an allocation is
    attempted after the task already OOMed, allow it to bypass the limit so
    that it can quickly finish the fault and invoke the OOM killer.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
    cpu_online_mask doesn't change during the iteration.

    cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
    that is used kernel-wide to synchronize cpu hotplug activity. Memcg has
    a cpu hotplug notifier, called while there may not be any cpu hotplug
    refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
    to maintain a cumulative event count as cpus disappear. Without
    get_online_cpus() in mem_cgroup_read_events(), it's possible to account
    for the event count on a dying cpu twice, and this value may be
    significantly large.

    In fact, all memcg->pcp_counter_lock use should be nested by
    {get,put}_online_cpus().

    This fixes that issue and ensures the reported statistics are not vastly
    over-reported during cpu hotplug.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Sep, 2013

7 commits


24 Sep, 2013

4 commits


13 Sep, 2013

9 commits

  • Add memcg routines to count writeback pages, later dirty pages will also
    be accounted.

    After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
    accounting"), we can use 'struct page' flag to test page state instead
    of per page_cgroup flag. But memcg has a feature to move a page from a
    cgroup to another one and may have race between "move" and "page stat
    accounting". So in order to avoid the race we have designed a new lock:

    mem_cgroup_begin_update_page_stat()
    modify page information -->(a)
    mem_cgroup_update_page_stat() -->(b)
    mem_cgroup_end_update_page_stat()

    It requires both (a) and (b)(writeback pages accounting) to be pretected
    in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
    !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
    read lock in the most cases (no task is moving), and spin_lock_irqsave
    on top in the slow path.

    There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
    And the lock order is:
    --> memcg->move_lock
    --> mapping->tree_lock

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Reviewed-by: Greg Thelen
    Cc: Fengguang Wu
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • We should call mem_cgroup_begin_update_page_stat() before
    mem_cgroup_update_page_stat() to get proper locks, however the latter
    doesn't do any checking that we use proper locking, which would be hard.
    Suggested by Michal Hock we could at least test for rcu_read_lock_held()
    because RCU is held if !mem_cgroup_disabled().

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Reviewed-by: Greg Thelen
    Cc: Fengguang Wu
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • While accounting memcg page stat, it's not worth to use
    MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
    complexity and presumed performance overhead. We can use
    MEM_CGROUP_STAT_FILE_MAPPED directly.

    Signed-off-by: Sha Zhengju
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Fengguang Wu
    Reviewed-by: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • The memcg OOM handling is incredibly fragile and can deadlock. When a
    task fails to charge memory, it invokes the OOM killer and loops right
    there in the charge code until it succeeds. Comparably, any other task
    that enters the charge path at this point will go to a waitqueue right
    then and there and sleep until the OOM situation is resolved. The problem
    is that these tasks may hold filesystem locks and the mmap_sem; locks that
    the selected OOM victim may need to exit.

    For example, in one reported case, the task invoking the OOM killer was
    about to charge a page cache page during a write(), which holds the
    i_mutex. The OOM killer selected a task that was just entering truncate()
    and trying to acquire the i_mutex:

    OOM invoking task:
    mem_cgroup_handle_oom+0x241/0x3b0
    mem_cgroup_cache_charge+0xbe/0xe0
    add_to_page_cache_locked+0x4c/0x140
    add_to_page_cache_lru+0x22/0x50
    grab_cache_page_write_begin+0x8b/0xe0
    ext3_write_begin+0x88/0x270
    generic_file_buffered_write+0x116/0x290
    __generic_file_aio_write+0x27c/0x480
    generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
    do_sync_write+0xea/0x130
    vfs_write+0xf3/0x1f0
    sys_write+0x51/0x90
    system_call_fastpath+0x18/0x1d

    OOM kill victim:
    do_truncate+0x58/0xa0 # takes i_mutex
    do_last+0x250/0xa30
    path_openat+0xd7/0x440
    do_filp_open+0x49/0xa0
    do_sys_open+0x106/0x240
    sys_open+0x20/0x30
    system_call_fastpath+0x18/0x1d

    The OOM handling task will retry the charge indefinitely while the OOM
    killed task is not releasing any resources.

    A similar scenario can happen when the kernel OOM killer for a memcg is
    disabled and a userspace task is in charge of resolving OOM situations.
    In this case, ALL tasks that enter the OOM path will be made to sleep on
    the OOM waitqueue and wait for userspace to free resources or increase
    the group's limit. But a userspace OOM handler is prone to deadlock
    itself on the locks held by the waiting tasks. For example one of the
    sleeping tasks may be stuck in a brk() call with the mmap_sem held for
    writing but the userspace handler, in order to pick an optimal victim,
    may need to read files from /proc/, which tries to acquire the same
    mmap_sem for reading and deadlocks.

    This patch changes the way tasks behave after detecting a memcg OOM and
    makes sure nobody loops or sleeps with locks held:

    1. When OOMing in a user fault, invoke the OOM killer and restart the
    fault instead of looping on the charge attempt. This way, the OOM
    victim can not get stuck on locks the looping task may hold.

    2. When OOMing in a user fault but somebody else is handling it
    (either the kernel OOM killer or a userspace handler), don't go to
    sleep in the charge context. Instead, remember the OOMing memcg in
    the task struct and then fully unwind the page fault stack with
    -ENOMEM. pagefault_out_of_memory() will then call back into the
    memcg code to check if the -ENOMEM came from the memcg, and then
    either put the task to sleep on the memcg's OOM waitqueue or just
    restart the fault. The OOM victim can no longer get stuck on any
    lock a sleeping task may hold.

    Debugged by Michal Hocko.

    Signed-off-by: Johannes Weiner
    Reported-by: azurIt
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg OOM handler open-codes a sleeping lock for OOM serialization
    (trylock, wait, repeat) because the required locking is so specific to
    memcg hierarchies. However, it would be nice if this construct would be
    clearly recognizable and not be as obfuscated as it is right now. Clean
    up as follows:

    1. Remove the return value of mem_cgroup_oom_unlock()

    2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

    3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
    makes it more obvious that the task has to be on the waitqueue
    before attempting to OOM-trylock the hierarchy, to not miss any
    wakeups before going to sleep. It just didn't matter until now
    because it was all lumped together into the global memcg_oom_lock
    spinlock section.

    4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
    It is proctected by the hierarchical OOM-lock.

    5. The memcg_oom_lock spinlock is only required to propagate the OOM
    lock in any given hierarchy atomically. Restrict its scope to
    mem_cgroup_oom_(trylock|unlock).

    6. Do not wake up the waitqueue unconditionally at the end of the
    function. Only the lockholder has to wake up the next in line
    after releasing the lock.

    Note that the lockholder kicks off the OOM-killer, which in turn
    leads to wakeups from the uncharges of the exiting task. But a
    contender is not guaranteed to see them if it enters the OOM path
    after the OOM kills but before the lockholder releases the lock.
    Thus there has to be an explicit wakeup after releasing the lock.

    7. Put the OOM task on the waitqueue before marking the hierarchy as
    under OOM as that is the point where we start to receive wakeups.
    No point in listening before being on the waitqueue.

    8. Likewise, unmark the hierarchy before finishing the sleep, for
    symmetry.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: azurIt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • System calls and kernel faults (uaccess, gup) can handle an out of memory
    situation gracefully and just return -ENOMEM.

    Enable the memcg OOM killer only for user faults, where it's really the
    only option available.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: azurIt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Clean up some mess made by the "Soft limit rework" series, and a few other
    things.

    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Children in soft limit excess are currently tracked up the hierarchy in
    memcg->children_in_excess. Nevertheless there still might exist tons of
    groups that are not in hierarchy relation to the root cgroup (e.g. all
    first level groups if root_mem_cgroup->use_hierarchy == false).

    As the whole tree walk has to be done when the iteration starts at
    root_mem_cgroup the iterator should be able to skip the walk if there is
    no child above the limit without iterating them. This can be done
    easily if the root tracks all children rather than only hierarchical
    children. This is done by this patch which updates root_mem_cgroup
    children_in_excess if root_mem_cgroup->use_hierarchy == false so the
    root knows about all children in excess.

    Please note that this is not an issue for inner memcgs which have
    use_hierarchy == false because then only the single group is visited so
    no special optimization is necessary.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko