13 Jan, 2012

40 commits

  • There are multiple places which need to get the swap_cgroup address, so
    add a helper function:

    static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent,
    struct swap_cgroup_ctrl **ctrl);

    to simplify the code.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • mem_cgroup_uncharge_page() is only called on either freshly allocated
    pages without page->mapping or on rmapped PageAnon() pages. There is no
    need to check for a page->mapping that is not an anon_vma.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All callsites pass in freshly allocated pages and a valid mm. As a
    result, all checks pertaining to the page's mapcount, page->mapping or the
    fallback to init_mm are unneeded.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • lookup_page_cgroup() is usually used only against pages that are used in
    userspace.

    The exception is the CONFIG_DEBUG_VM-only memcg check from the page
    allocator: it can run on pages without page_cgroup descriptors allocated
    when the pages are fed into the page allocator for the first time during
    boot or memory hotplug.

    Include the array check only when CONFIG_DEBUG_VM is set and save the
    unnecessary check in production kernels.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pages have their corresponding page_cgroup descriptors set up before
    they are used in userspace, and thus managed by a memory cgroup.

    The only time where lookup_page_cgroup() can return NULL is in the
    CONFIG_DEBUG_VM-only page sanity checking code that executes while
    feeding pages into the page allocator for the first time.

    Remove the NULL checks against lookup_page_cgroup() results from all
    callsites where we know that corresponding page_cgroup descriptors must
    be allocated, and add a comment to the callsite that actually does have
    to check the return value.

    [hughd@google.com: stop oops in mem_cgroup_update_page_stat()]
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The fault accounting functions have a single, memcg-internal user, so they
    don't need to be global. In fact, their one-line bodies can be directly
    folded into the caller. And since faults happen one at a time, use
    this_cpu_inc() directly instead of this_cpu_add(foo, 1).

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg argument of oom_kill_task() hasn't been used since 341aea2
    'oom-kill: remove boost_dying_task_prio()'. Kill it.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The two memcg stats pgpgin/pgpgout have different meaning than the ones
    in vmstat, which indicates that we picked a bad naming for them.

    It might be late to change the stat name, but better documentation is
    always helpful.

    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • It should be memsw.max_usage_in_bytes. This typo has been there for
    a really long time.

    Signed-off-by: Zhu Yanhai
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhu Yanhai
     
  • Only the ratelimit checks themselves have to run with preemption
    disabled, the resulting actions - checking for usage thresholds,
    updating the soft limit tree - can and should run with preemption
    enabled.

    Signed-off-by: Johannes Weiner
    Reported-by: Yong Zhang
    Tested-by: Yong Zhang
    Reported-by: Luis Henriques
    Tested-by: Luis Henriques
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
    page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
    page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

    But thinking again,
    - compound_lock() is held at move_accout...then, it's not necessary
    to take move_lock_page_cgroup().
    - LRU is locked and all tail pages will go into the same LRU as
    head is now on.
    - page_cgroup is contiguous in huge page range.

    This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
    hugepage and reduce costs for spliting.

    [akpm@linux-foundation.org: fix typo, per Michal]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • To find the page corresponding to a certain page_cgroup, the pc->flags
    encoded the node or section ID with the base array to compare the pc
    pointer to.

    Now that the per-memory cgroup LRU lists link page descriptors directly,
    there is no longer any code that knows the struct page_cgroup of a PFN
    but not the struct page.

    [hughd@google.com: remove unused node/section info from pc->flags fix]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that all code that operated on global per-zone LRU lists is
    converted to operate on per-memory cgroup LRU lists instead, there is no
    reason to keep the double-LRU scheme around any longer.

    The pc->lru member is removed and page->lru is linked directly to the
    per-memory cgroup LRU lists, which removes two pointers from a
    descriptor that exists for every page frame in the system.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Ying Han
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, global reclaim must be able to find its pages on the per-memcg
    LRU lists.

    Since the LRU pages of a zone are distributed over all existing memory
    cgroups, a scan target for a zone is complete when all memory cgroups
    are scanned for their proportional share of a zone's memory.

    The forced scanning of small scan targets from kswapd is limited to
    zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
    force-scanning the LRU lists of multiple memory cgroups.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • root_mem_cgroup, lacking a configurable limit, was never subject to
    limit reclaim, so the pages charged to it could be kept off its LRU
    lists. They would be found on the global per-zone LRU lists upon
    physical memory pressure and it made sense to avoid uselessly linking
    them to both lists.

    The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, with all pages being exclusively linked to their respective
    per-memcg LRU lists. As a result, pages of the root_mem_cgroup must
    also be linked to its LRU lists again. This is purely about the LRU
    list, root_mem_cgroup is still not charged.

    The overhead is temporary until the double-LRU scheme is going away
    completely.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim and traditional global pressure reclaim will
    soon share the same code to reclaim from a hierarchical tree of memory
    cgroups.

    In preparation of this, move the two right next to each other in
    shrink_zone().

    The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
    limit reclaim function, which still does hierarchy walking on its own,
    and a limit (shrinking) reclaim function, which relies on generic
    reclaim code to walk the hierarchy.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim currently picks one memory cgroup out of the
    target hierarchy, remembers it as the last scanned child, and reclaims
    all zones in it with decreasing priority levels.

    The new hierarchy reclaim code will pick memory cgroups from the same
    hierarchy concurrently from different zones and priority levels, it
    becomes necessary that hierarchy roots not only remember the last
    scanned child, but do so for each zone and priority level.

    Until now, we reclaimed memcgs like this:

    mem = mem_cgroup_iter(root)
    for each priority level:
    for each zone in zonelist:
    reclaim(mem, zone)

    But subsequent patches will move the memcg iteration inside the loop
    over the zones:

    for each priority level:
    for each zone in zonelist:
    mem = mem_cgroup_iter(root)
    reclaim(mem, zone)

    And to keep with the original scan order - memcg -> priority -> zone -
    the last scanned memcg has to be remembered per zone and per priority
    level.

    Furthermore, global reclaim will be switched to the hierarchy walk as
    well. Different from limit reclaim, which can just recheck the limit
    after some reclaim progress, its target is to scan all memcgs for the
    desired zone pages, proportional to the memcg size, and so reliably
    detecting a full hierarchy round-trip will become crucial.

    Currently, the code relies on one reclaimer encountering the same memcg
    twice, but that is error-prone with concurrent reclaimers. Instead, use
    a generation counter that is increased every time the child with the
    highest ID has been visited, so that reclaimers can stop when the
    generation changes.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Kirill A. Shutemov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup hierarchies are currently handled completely outside of
    the traditional reclaim code, which is invoked with a single memory
    cgroup as an argument for the whole call stack.

    Subsequent patches will switch this code to do hierarchical reclaim, so
    there needs to be a distinction between a) the memory cgroup that is
    triggering reclaim due to hitting its limit and b) the memory cgroup
    that is being scanned as a child of a).

    This patch introduces a struct mem_cgroup_zone that contains the
    combination of the memory cgroup and the zone being scanned, which is
    then passed down the stack instead of the zone argument.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The traditional zone reclaim code is scanning the per-zone LRU lists
    during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
    lists when reclaiming on behalf of a memory cgroup limit.

    Subsequent patches will convert the traditional reclaim code to reclaim
    exclusively from the per-memory cgroup LRU lists. As a result, using
    the predicate for which LRU list is scanned will no longer be
    appropriate to tell global reclaim from limit reclaim.

    This patch adds a global_reclaim() predicate to tell direct/kswapd
    reclaim from memory cgroup limit reclaim and substitutes it in all
    places where currently scanning_global_lru() is used for that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg naturalization series:

    Memory control groups are currently bolted onto the side of
    traditional memory management in places where better integration would
    be preferrable. To reclaim memory, for example, memory control groups
    maintain their own LRU list and reclaim strategy aside from the global
    per-zone LRU list reclaim. But an extra list head for each existing
    page frame is expensive and maintaining it requires additional code.

    This patchset disables the global per-zone LRU lists on memory cgroup
    configurations and converts all its users to operate on the per-memory
    cgroup lists instead. As LRU pages are then exclusively on one list,
    this saves two list pointers for each page frame in the system:

    page_cgroup array size with 4G physical memory

    vanilla: allocated 31457280 bytes of page_cgroup
    patched: allocated 15728640 bytes of page_cgroup

    At the same time, system performance for various workloads is
    unaffected:

    100G sparse file cat, 4G physical memory, 10 runs, to test for code
    bloat in the traditional LRU handling and kswapd & direct reclaim
    paths, without/with the memory controller configured in

    vanilla: 71.603(0.207) seconds
    patched: 71.640(0.156) seconds

    vanilla: 79.558(0.288) seconds
    patched: 77.233(0.147) seconds

    100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
    bloat in the traditional memory cgroup LRU handling and reclaim path

    vanilla: 96.844(0.281) seconds
    patched: 94.454(0.311) seconds

    4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
    swap on SSD, 10 runs, to test for regressions in kswapd & direct
    reclaim using per-memcg LRU lists with multiple memcgs and multiple
    allocators within each memcg

    vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
    patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]

    16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
    swap on SSD, 10 runs, to test for regressions in hierarchical memcg
    setups

    vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
    patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

    This patch:

    There are currently two different implementations of iterating over a
    memory cgroup hierarchy tree.

    Consolidate them into one worker function and base the convenience
    looping-macros on top of it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
    with size bigger than kmalloc() can alloc it spits out an ugly warning:

    ------------[ cut here ]------------
    WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
    Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
    Call Trace:
    warn_slowpath_common+0x75/0xb0
    warn_slowpath_null+0x15/0x20
    __alloc_pages_nodemask+0x5d3/0x7a0
    __get_free_pages+0x12/0x50
    __kmalloc+0x12b/0x150
    pipe_set_size+0x75/0x120
    pipe_fcntl+0xf8/0x140
    do_fcntl+0x2d4/0x410
    sys_fcntl+0x66/0xa0
    system_call_fastpath+0x16/0x1b
    ---[ end trace 432f702e6db7b5ee ]---

    Instead, make kcalloc() handle the overflow case and fail quietly.

    [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
    Signed-off-by: Sasha Levin
    Cc: Alexander Viro
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • The address limit is already set in flush_old_exec() so those calls to
    set_fs(USER_DS) are redundant.

    Signed-off-by: Mathias Krause
    Cc: Kyle McMartin
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • The address limit is already set in flush_old_exec() so this
    set_fs(USER_DS) is redundant.

    Signed-off-by: Mathias Krause
    Cc: Fenghua Yu
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • Fix the int/bool confusion in there.

    drivers/video/nvidia/nvidia.c:1602: warning: return from incompatible pointer type

    Cc: Florian Tobias Schandinat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • While implementing cmpxchg_double() on s390 I realized that we don't set
    CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

    However setting that option will increase the size of struct page by
    eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
    make sense that a present cpu feature should increase the size of struct
    page.

    Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
    that it should depend on CMPXCHG_DOUBLE instead.

    This patch:

    If an architecture supports CMPXCHG_LOCAL this shouldn't result
    automatically in larger struct pages if the SLUB allocator is used.
    Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
    can be selected if a double word aligned struct page is required. Also
    update x86 Kconfig so that it should work as before.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • The uses have been renamed so delete the unused macro.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Use the more commonly used __noreturn instead of ATTRIB_NORETURN.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Joe Perches
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Acked-by: Geert Uytterhoeven
    Acked-by: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • It's a very old and now unused prototype marking so just delete it.

    Neaten panic pointer argument style to keep checkpatch quiet.

    Signed-off-by: Joe Perches
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Acked-by: Geert Uytterhoeven
    Acked-by: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The only use in kernel.h is gone so remove the macro.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Use __printf macro.
    Convert NORET_AND to ATTRIB_NORET.
    Use the normal kernel style for pointer arguments.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:

    In file included from arch/x86/include/asm/uaccess.h:573,
    from kernel/kprobes.c:55:
    In function 'copy_from_user',
    inlined from 'write_enabled_file_bool' at
    kernel/kprobes.c:2191:
    arch/x86/include/asm/uaccess_64.h:65:
    warning: call to 'copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct

    presumably due to buf_size being signed causing GCC to fail to see that
    buf_size can't become negative.

    Signed-off-by: Stephen Boyd
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: David S. Miller
    Acked-by: Masami Hiramatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     
  • get_proc_task() can fail to search the task and return NULL,
    put_task_struct() will then bomb the kernel with following oops:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] proc_pid_permission+0x64/0xe0
    PGD 112075067 PUD 112814067 PMD 0
    Oops: 0002 [#1] PREEMPT SMP

    This is a regression introduced by commit 0499680a ("procfs: add hidepid=
    and gid= mount options"). The kernel should return -ESRCH if
    get_proc_task() failed.

    Signed-off-by: Xiaotian Feng
    Cc: Al Viro
    Cc: Vasiliy Kulikov
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • This very noisy sparse warning appears on almost every file in the
    kernel:

    CHECK init/main.c
    arch/x86/include/asm/thread_info.h:43:55: error: dubious one-bit signed bitfield
    arch/x86/include/asm/thread_info.h:44:46: error: dubious one-bit signed bitfield

    This patch changes sig_on_uaccess_error and uaccess_err flags to unsigned
    type and thus fixes the warning.

    Signed-off-by: Anton Vorontsov
    Acked-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Anton Vorontsov