13 Jan, 2012

24 commits

  • If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page()
    shows a "Bad page state" message, removes page from circulation, adds a
    taint and continues. This is at a very low level, often when a spinlock
    is held (sometimes when page table lock is held, for example).

    We want to recover from this badness, not make it worse: we must not
    kmalloc memory here, we must not do a cgroup path lookup via dubious
    pointers. No doubt that code was useful to debug a particular case at one
    time, and may be again, but take it out of the mainline kernel.

    Signed-off-by: Hugh Dickins
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch started off as a cleanup: __split_huge_page_refcounts() has to
    cope with two scenarios, when the hugepage being split is already on LRU,
    and when it is not; but why does it have to split that accounting across
    three different sites? Consolidate it in lru_add_page_tail(), handling
    evictable and unevictable alike, and use standard add_page_to_lru_list()
    when accounting is needed (when the head is not yet on LRU).

    But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
    test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
    under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
    messing up reclaim calculations and causing a freeze at rmdir of cgroup.

    Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
    count - this has not been the only such incident. Document that
    lru_add_page_tail() is for Transparent HugePages by #ifdef around it.

    Signed-off-by: Hugh Dickins
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We already have for_each_node(node) define in nodemask.h, better to use it.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Now, at LRU handling, memory cgroup needs to do complicated works to see
    valid pc->mem_cgroup, which may be overwritten.

    This patch is for relaxing the protocol. This patch guarantees
    - when pc->mem_cgroup is overwritten, page must not be on LRU.

    By this, LRU routine can believe pc->mem_cgroup and don't need to check
    bits on pc->flags. This new rule may adds small overheads to swapin. But
    in most case, lru handling gets faster.

    After this patch, PCG_ACCT_LRU bit is obsolete and removed.

    [akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree]
    [akpm@linux-foundation.org: clean up code comment]
    [hughd@google.com: fix NULL mem_cgroup_try_charge]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch simplifies LRU handling of racy case (memcg+SwapCache). At
    charging, SwapCache tend to be on LRU already. So, before overwriting
    pc->mem_cgroup, the page must be removed from LRU and added to LRU
    later.

    This patch does
    spin_lock(zone->lru_lock);
    if (PageLRU(page))
    remove from LRU
    overwrite pc->mem_cgroup
    if (PageLRU(page))
    add to new LRU.
    spin_unlock(zone->lru_lock);

    And guarantee all pages are not on LRU at modifying pc->mem_cgroup.
    This patch also unfies lru handling of replace_page_cache() and
    swapin.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch is a clean up. No functional/logical changes.

    Because of commit ef6a3c6311 ("mm: add replace_page_cache_page()
    function") , FUSE uses replace_page_cache() instead of
    add_to_page_cache(). Then, mem_cgroup_cache_charge() is not called
    against FUSE's pages from splice.

    So now, mem_cgroup_cache_charge() gets pages that are not on the LRU
    with the exception of PageSwapCache pages. For checking,
    WARN_ON_ONCE(PageLRU(page)) is added.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The oom killer relies on logic that identifies threads that have already
    been oom killed when scanning the tasklist and, if found, deferring
    until such threads have exited. This is done by checking for any
    candidate threads that have the TIF_MEMDIE bit set.

    For memcg ooms, candidate threads are first found by calling
    task_in_mem_cgroup() since the oom killer should not defer if there's an
    oom killed thread in another memcg.

    Unfortunately, task_in_mem_cgroup() excludes threads if they have
    detached their mm in the process of exiting so TIF_MEMDIE is never
    detected for such conditions. This is different for global, mempolicy,
    and cpuset oom conditions where a detached mm is only excluded after
    checking for TIF_MEMDIE and deferring, if necessary, in
    select_bad_process().

    The fix is to return true if a task has a detached mm but is still in
    the memcg or its hierarchy that is currently oom. This will allow the
    oom killer to appropriately defer rather than kill unnecessarily or, in
    the worst case, panic the machine if nothing else is available to kill.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If we are not able to allocate tree nodes for all NUMA nodes then we
    should release those that were allocated.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are multiple places which need to get the swap_cgroup address, so
    add a helper function:

    static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent,
    struct swap_cgroup_ctrl **ctrl);

    to simplify the code.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • mem_cgroup_uncharge_page() is only called on either freshly allocated
    pages without page->mapping or on rmapped PageAnon() pages. There is no
    need to check for a page->mapping that is not an anon_vma.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All callsites pass in freshly allocated pages and a valid mm. As a
    result, all checks pertaining to the page's mapcount, page->mapping or the
    fallback to init_mm are unneeded.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pages have their corresponding page_cgroup descriptors set up before
    they are used in userspace, and thus managed by a memory cgroup.

    The only time where lookup_page_cgroup() can return NULL is in the
    CONFIG_DEBUG_VM-only page sanity checking code that executes while
    feeding pages into the page allocator for the first time.

    Remove the NULL checks against lookup_page_cgroup() results from all
    callsites where we know that corresponding page_cgroup descriptors must
    be allocated, and add a comment to the callsite that actually does have
    to check the return value.

    [hughd@google.com: stop oops in mem_cgroup_update_page_stat()]
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The fault accounting functions have a single, memcg-internal user, so they
    don't need to be global. In fact, their one-line bodies can be directly
    folded into the caller. And since faults happen one at a time, use
    this_cpu_inc() directly instead of this_cpu_add(foo, 1).

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only the ratelimit checks themselves have to run with preemption
    disabled, the resulting actions - checking for usage thresholds,
    updating the soft limit tree - can and should run with preemption
    enabled.

    Signed-off-by: Johannes Weiner
    Reported-by: Yong Zhang
    Tested-by: Yong Zhang
    Reported-by: Luis Henriques
    Tested-by: Luis Henriques
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
    page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
    page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

    But thinking again,
    - compound_lock() is held at move_accout...then, it's not necessary
    to take move_lock_page_cgroup().
    - LRU is locked and all tail pages will go into the same LRU as
    head is now on.
    - page_cgroup is contiguous in huge page range.

    This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
    hugepage and reduce costs for spliting.

    [akpm@linux-foundation.org: fix typo, per Michal]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now that all code that operated on global per-zone LRU lists is
    converted to operate on per-memory cgroup LRU lists instead, there is no
    reason to keep the double-LRU scheme around any longer.

    The pc->lru member is removed and page->lru is linked directly to the
    per-memory cgroup LRU lists, which removes two pointers from a
    descriptor that exists for every page frame in the system.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Ying Han
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • root_mem_cgroup, lacking a configurable limit, was never subject to
    limit reclaim, so the pages charged to it could be kept off its LRU
    lists. They would be found on the global per-zone LRU lists upon
    physical memory pressure and it made sense to avoid uselessly linking
    them to both lists.

    The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, with all pages being exclusively linked to their respective
    per-memcg LRU lists. As a result, pages of the root_mem_cgroup must
    also be linked to its LRU lists again. This is purely about the LRU
    list, root_mem_cgroup is still not charged.

    The overhead is temporary until the double-LRU scheme is going away
    completely.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim and traditional global pressure reclaim will
    soon share the same code to reclaim from a hierarchical tree of memory
    cgroups.

    In preparation of this, move the two right next to each other in
    shrink_zone().

    The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
    limit reclaim function, which still does hierarchy walking on its own,
    and a limit (shrinking) reclaim function, which relies on generic
    reclaim code to walk the hierarchy.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim currently picks one memory cgroup out of the
    target hierarchy, remembers it as the last scanned child, and reclaims
    all zones in it with decreasing priority levels.

    The new hierarchy reclaim code will pick memory cgroups from the same
    hierarchy concurrently from different zones and priority levels, it
    becomes necessary that hierarchy roots not only remember the last
    scanned child, but do so for each zone and priority level.

    Until now, we reclaimed memcgs like this:

    mem = mem_cgroup_iter(root)
    for each priority level:
    for each zone in zonelist:
    reclaim(mem, zone)

    But subsequent patches will move the memcg iteration inside the loop
    over the zones:

    for each priority level:
    for each zone in zonelist:
    mem = mem_cgroup_iter(root)
    reclaim(mem, zone)

    And to keep with the original scan order - memcg -> priority -> zone -
    the last scanned memcg has to be remembered per zone and per priority
    level.

    Furthermore, global reclaim will be switched to the hierarchy walk as
    well. Different from limit reclaim, which can just recheck the limit
    after some reclaim progress, its target is to scan all memcgs for the
    desired zone pages, proportional to the memcg size, and so reliably
    detecting a full hierarchy round-trip will become crucial.

    Currently, the code relies on one reclaimer encountering the same memcg
    twice, but that is error-prone with concurrent reclaimers. Instead, use
    a generation counter that is increased every time the child with the
    highest ID has been visited, so that reclaimers can stop when the
    generation changes.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Kirill A. Shutemov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg naturalization series:

    Memory control groups are currently bolted onto the side of
    traditional memory management in places where better integration would
    be preferrable. To reclaim memory, for example, memory control groups
    maintain their own LRU list and reclaim strategy aside from the global
    per-zone LRU list reclaim. But an extra list head for each existing
    page frame is expensive and maintaining it requires additional code.

    This patchset disables the global per-zone LRU lists on memory cgroup
    configurations and converts all its users to operate on the per-memory
    cgroup lists instead. As LRU pages are then exclusively on one list,
    this saves two list pointers for each page frame in the system:

    page_cgroup array size with 4G physical memory

    vanilla: allocated 31457280 bytes of page_cgroup
    patched: allocated 15728640 bytes of page_cgroup

    At the same time, system performance for various workloads is
    unaffected:

    100G sparse file cat, 4G physical memory, 10 runs, to test for code
    bloat in the traditional LRU handling and kswapd & direct reclaim
    paths, without/with the memory controller configured in

    vanilla: 71.603(0.207) seconds
    patched: 71.640(0.156) seconds

    vanilla: 79.558(0.288) seconds
    patched: 77.233(0.147) seconds

    100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
    bloat in the traditional memory cgroup LRU handling and reclaim path

    vanilla: 96.844(0.281) seconds
    patched: 94.454(0.311) seconds

    4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
    swap on SSD, 10 runs, to test for regressions in kswapd & direct
    reclaim using per-memcg LRU lists with multiple memcgs and multiple
    allocators within each memcg

    vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
    patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]

    16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
    swap on SSD, 10 runs, to test for regressions in hierarchical memcg
    setups

    vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
    patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

    This patch:

    There are currently two different implementations of iterating over a
    memory cgroup hierarchy tree.

    Consolidate them into one worker function and base the convenience
    looping-macros on top of it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

10 Jan, 2012

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    igmp: Avoid zero delay when receiving odd mixture of IGMP queries
    netdev: make net_device_ops const
    bcm63xx: make ethtool_ops const
    usbnet: make ethtool_ops const
    net: Fix build with INET disabled.
    net: introduce netif_addr_lock_nested() and call if when appropriate
    net: correct lock name in dev_[uc/mc]_sync documentations.
    net: sk_update_clone is only used in net/core/sock.c
    8139cp: fix missing napi_gro_flush.
    pktgen: set correct max and min in pktgen_setup_inject()
    smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h
    asix: fix infinite loop in rx_fixup()
    net: Default UDP and UNIX diag to 'n'.
    r6040: fix typo in use of MCR0 register bits
    net: fix sock_clone reference mismatch with tcp memcontrol

    Linus Torvalds
     
  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

08 Jan, 2012

1 commit

  • Sockets can also be created through sock_clone. Because it copies
    all data in the sock structure, it also copies the memcg-related pointer,
    and all should be fine. However, since we now use reference counts in
    socket creation, we are left with some sockets that have no reference
    counts. It matters when we destroy them, since it leads to a mismatch.

    Signed-off-by: Glauber Costa
    CC: David S. Miller
    CC: Greg Thelen
    CC: Hiroyouki Kamezawa
    CC: Laurent Chavey
    Signed-off-by: David S. Miller

    Glauber Costa
     

24 Dec, 2011

1 commit


23 Dec, 2011

1 commit

  • This reverts commit e5671dfae59b165e2adfd4dfbdeab11ac8db5bda.

    After a follow up discussion with Michal, it was agreed it would
    be better to leave the kmem controller with just the tcp files,
    deferring the behavior of the other general memory.kmem.* files
    for a later time, when more caches are controlled. This is because
    generic kmem files are not used by tcp accounting and it is
    not clear how other slab caches would fit into the scheme.

    We are reverting the original commit so we can track the reference.
    Part of the patch is kept, because it was used by the later tcp
    code. Conflicts are shown in the bottom. init/Kconfig is removed from
    the revert entirely.

    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    CC: Kirill A. Shutemov
    CC: Paul Menage
    CC: Greg Thelen
    CC: Johannes Weiner
    CC: David S. Miller

    Conflicts:

    Documentation/cgroups/memory.txt
    mm/memcontrol.c
    Signed-off-by: David S. Miller

    Glauber Costa
     

21 Dec, 2011

1 commit

  • If the request is to create non-root group and we fail to meet it, we
    should leave the root unchanged.

    Signed-off-by: Hillf Danton
    Acked-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

13 Dec, 2011

4 commits

  • Currently, there's no way to pass multiple tasks to cgroup_subsys
    methods necessitating the need for separate per-process and per-task
    methods. This patch introduces cgroup_taskset which can be used to
    pass multiple tasks and their associated cgroups to cgroup_subsys
    methods.

    Three methods - can_attach(), cancel_attach() and attach() - are
    converted to use cgroup_taskset. This unifies passed parameters so
    that all methods have access to all information. Conversions in this
    patchset are identical and don't introduce any behavior change.

    -v2: documentation updated as per Paul Menage's suggestion.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Frederic Weisbecker
    Acked-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris

    Tejun Heo
     
  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • The goal of this work is to move the memory pressure tcp
    controls to a cgroup, instead of just relying on global
    conditions.

    To avoid excessive overhead in the network fast paths,
    the code that accounts allocated memory to a cgroup is
    hidden inside a static_branch(). This branch is patched out
    until the first non-root cgroup is created. So when nobody
    is using cgroups, even if it is mounted, no significant performance
    penalty should be seen.

    This patch handles the generic part of the code, and has nothing
    tcp-specific.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Kirill A. Shutemov
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch lays down the foundation for the kernel memory component
    of the Memory Controller.

    As of today, I am only laying down the following files:

    * memory.independent_kmem_limit
    * memory.kmem.limit_in_bytes (currently ignored)
    * memory.kmem.usage_in_bytes (always zero)

    Signed-off-by: Glauber Costa
    CC: Kirill A. Shutemov
    CC: Paul Menage
    CC: Greg Thelen
    CC: Johannes Weiner
    CC: Michal Hocko
    Signed-off-by: David S. Miller

    Glauber Costa
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

5 commits

  • Various code in memcontrol.c () calls this_cpu_read() on the calculations
    to be done from two different percpu variables, or does an open-coded
    read-modify-write on a single percpu variable.

    Disable preemption throughout these operations so that the writes go to
    the correct palces.

    [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Steven Rostedt
    Cc: Greg Thelen
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • There is a potential race between a thread charging a page and another
    thread putting it back to the LRU list:

    charge: putback:
    SetPageCgroupUsed SetPageLRU
    PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU

    The order of setting one flag and checking the other is crucial, otherwise
    the charge may observe !PageLRU while the putback observes !PageCgroupUsed
    and the page is not linked to the memcg LRU at all.

    Global memory pressure may fix this by trying to isolate and putback the
    page for reclaim, where that putback would link it to the memcg LRU again.
    Without that, the memory cgroup is undeletable due to a charge whose
    physical page can not be found and moved out.

    Signed-off-by: Johannes Weiner
    Cc: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim decides to skip scanning an active list when the corresponding
    inactive list is above a certain size in comparison to leave the assumed
    working set alone while there are still enough reclaim candidates around.

    The memcg implementation of comparing those lists instead reports whether
    the whole memcg is low on the requested type of inactive pages,
    considering all nodes and zones.

    This can lead to an oversized active list not being scanned because of the
    state of the other lists in the memcg, as well as an active list being
    scanned while its corresponding inactive list has enough pages.

    Not only is this wrong, it's also a scalability hazard, because the global
    memory state over all nodes and zones has to be gathered for each memcg
    and zone scanned.

    Make these calculations purely based on the size of the two LRU lists
    that are actually affected by the outcome of the decision.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Reviewed-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If somebody is touching data too early, it might be easier to diagnose a
    problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
    understand why mem_cgroup_per_zone is [un|partly]initialized.

    Signed-off-by: Igor Mammedov
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Mammedov
     
  • Before calling schedule_timeout(), task state should be changed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki