18 Apr, 2015

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "Numerous fixes, the overdue removal of the i2o docs, some new Chinese
    translations, and, hopefully, the README fix that will end the flow of
    identical patches to that file"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (34 commits)
    Documentation/memcg: update memcg/kmem status
    Documentation: blackfin: Makefile: Typo building issue
    Documentation/vm/pagemap.txt: correct location of page-types tool
    Documentation/memory-barriers.txt: typo fix
    doc: Add guest_nice column to example output of `cat /proc/stat'
    Documentation/kernel-parameters: Move "eagerfpu" to its right place
    Documentation: gpio: Update ACPI part of the document to mention _DSD
    docs/completion.txt: Various tweaks and corrections
    doc: completion: context, scope and language fixes
    Documentation:Update Documentation/zh_CN/arm64/memory.txt
    Documentation:Update Documentation/zh_CN/arm64/booting.txt
    Documentation: Chinese translation of arm64/legacy_instructions.txt
    DocBook media: fix broken EIA hyperlink
    Documentation: tweak the maintainers entry
    README: Change gzip/bzip2 to xz compression format
    README: Update version number reference
    doc:pci: Fix typo in Documentation/PCI
    Documentation: drm: Use '->' when describing access through pointers.
    Documentation: Remove mentioning of block barriers
    Documentation/email-clients.txt: Fix one grammar mistake, add extra info about TB
    ...

    Linus Torvalds
     

14 Apr, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Rik made cpuset cooperate better with
    isolcpus and there are several other cleanup patches"

    * 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset, isolcpus: document relationship between cpusets & isolcpus
    cpusets, isolcpus: exclude isolcpus from load balancing in cpusets
    sched, isolcpu: make cpu_isolated_map visible outside scheduler
    cpuset: initialize cpuset a bit early
    cgroup: Use kvfree in pidlist_free()
    cgroup: call cgroup_subsys->bind on cgroup subsys initialization

    Linus Torvalds
     

11 Apr, 2015

1 commit


20 Mar, 2015

1 commit


01 Mar, 2015

1 commit

  • The memcg control knobs indicate the highest possible value using the
    symbolic name "infinity", which is long and awkward to type.

    Switch to the string "max", which is just as descriptive but shorter and
    sweeter.

    This changes a user interface, so do it before the release and before
    the development flag is dropped from the default hierarchy.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

12 Feb, 2015

1 commit

  • Introduce the basic control files to account, partition, and limit
    memory using cgroups in default hierarchy mode.

    This interface versioning allows us to address fundamental design
    issues in the existing memory cgroup interface, further explained
    below. The old interface will be maintained indefinitely, but a
    clearer model and improved workload performance should encourage
    existing users to switch over to the new one eventually.

    The control files are thus:

    - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

    - memory.low configures the lower end of the cgroup's expected
    memory consumption range. The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

    - memory.high configures the upper end of the cgroup's expected
    memory consumption range. A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

    - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

    - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max. This
    allows users to identify configuration problems when observing a
    degradation in workload performance. An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

    For existing users of memory cgroups, the following deviations from
    the current interface are worth pointing out and explaining:

    - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset. As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out. The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior. First off, the soft limit has no
    hierarchical meaning. All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy. This makes subtree delegation
    impossible. Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve. A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible. Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass. This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes. Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

    - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory. The memory consumption of workloads varies
    during runtime, and that requires users to overcommit. But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively. When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer. As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation. The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded. But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group. Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

    - The original control file names are unwieldy and inconsistent in
    many different ways. For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

    - The original limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files. But that very high number is implementation
    and architecture dependent and not very descriptive. And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's not inconsistent.

    memory.low, memory.high, and memory.max will use the string
    "infinity" to indicate and set the highest possible value.

    [akpm@linux-foundation.org: use seq_puts() for basic strings]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Jan, 2015

1 commit

  • unified-hierarchy.txt was added by 65731578 (cgroup: add documentation
    about unified hierarchy)

    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Jonathan Corbet
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Henrik Austad
    Signed-off-by: Tejun Heo

    Henrik Austad
     

14 Dec, 2014

2 commits

  • Merge second patchbomb from Andrew Morton:
    - the rest of MM
    - misc fs fixes
    - add execveat() syscall
    - new ratelimit feature for fault-injection
    - decompressor updates
    - ipc/ updates
    - fallocate feature creep
    - fsnotify cleanups
    - a few other misc things

    * emailed patches from Andrew Morton : (99 commits)
    cgroups: Documentation: fix trivial typos and wrong paragraph numberings
    parisc: percpu: update comments referring to __get_cpu_var
    percpu: update local_ops.txt to reflect this_cpu operations
    percpu: remove __get_cpu_var and __raw_get_cpu_var macros
    fsnotify: remove destroy_list from fsnotify_mark
    fsnotify: unify inode and mount marks handling
    fallocate: create FAN_MODIFY and IN_MODIFY events
    mm/cma: make kmemleak ignore CMA regions
    slub: fix cpuset check in get_any_partial
    slab: fix cpuset check in fallback_alloc
    shmdt: use i_size_read() instead of ->i_size
    ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
    ipc/msg: increase MSGMNI, remove scaling
    ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
    ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
    lib/decompress.c: consistency of compress formats for kernel image
    decompress_bunzip2: off by one in get_next_block()
    usr/Kconfig: make initrd compression algorithm selection not expert
    fault-inject: add ratelimit option
    ratelimit: add initialization macro
    ...

    Linus Torvalds
     
  • Signed-off-by: SeongJae Park
    Cc: Jonathan Corbet
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     

13 Dec, 2014

1 commit

  • Pull documentation update from Jonathan Corbet:
    "Here's my set of accumulated documentation changes for 3.19.

    It includes a couple of additions to the coding style document, some
    fixes for minor build problems within the documentation tree, the
    relocation of the kselftest docs, and various tweaks and additions.

    A couple of changes reach outside of Documentation/; they only make
    trivial comment changes and I did my best to get the required acks.

    Complete with a shiny signed tag this time around"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6:
    kobject: grammar fix
    Input: xpad - update docs to reflect current state
    Documentation: Build mic/mpssd only for x86_64
    cgroups: Documentation: fix wrong cgroupfs paths
    Documentation/email-clients.txt: add info about Claws Mail
    CodingStyle: add some more error handling guidelines
    kselftest: Move the docs to the Documentation dir
    Documentation: fix formatting to make 's' happy
    Documentation: power: Fix typo in Documentation/power
    Documentation: vm: Add 1GB large page support information
    ipv4: add kernel parameter tcpmhash_entries
    Documentation: Fix a typo in mailbox.txt
    treewide: Fix typo in Documentation/DocBook/device-drivers
    CodingStyle: Add a chapter on conditional compilation

    Linus Torvalds
     

11 Dec, 2014

4 commits

  • Memory cgroups used to have 5 per-page pointers. To allow users to
    disable that amount of overhead during runtime, those pointers were
    allocated in a separate array, with a translation layer between them and
    struct page.

    There is now only one page pointer remaining: the memcg pointer, that
    indicates which cgroup the page is associated with when charged. The
    complexity of runtime allocation and the runtime translation overhead is
    no longer justified to save that *potential* 0.19% of memory. With
    CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
    after the page->private member and doesn't even increase struct page,
    and then this patch actually saves space. Remaining users that care can
    still compile their kernels without CONFIG_MEMCG.

    text data bss dec hex filename
    8828345 1725264 983040 11536649 b00909 vmlinux.old
    8827425 1725264 966656 11519345 afc571 vmlinux.new

    [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Acked-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All memory accounting and limiting has been switched over to the
    lockless page counters. Bye, res_counter!

    [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
    [mhocko@suse.cz: ditch the last remainings of res_counter]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Paul Bolle
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Abandon the spinlock-protected byte counters in favor of the unlocked
    page counters in the hugetlb controller as well.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory is internally accounted in bytes, using spinlock-protected 64-bit
    counters, even though the smallest accounting delta is a page. The
    counter interface is also convoluted and does too many things.

    Introduce a new lockless word-sized page counter API, then change all
    memory accounting over to it. The translation from and to bytes then only
    happens when interfacing with userspace.

    The removed locking overhead is noticable when scaling beyond the per-cpu
    charge caches - on a 4-socket machine with 144-threads, the following test
    shows the performance differences of 288 memcgs concurrently running a
    page fault benchmark:

    vanilla:

    18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
    1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
    24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
    1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
    50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
    stalled-cycles-frontend
    stalled-cycles-backend
    8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
    1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
    1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )

    132.474343877 seconds time elapsed ( +- 0.21% )

    lockless:

    12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
    832,850 context-switches # 0.068 K/sec ( +- 0.54% )
    15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
    1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
    32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
    stalled-cycles-frontend
    stalled-cycles-backend
    9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
    2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
    1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )

    91.369330729 seconds time elapsed ( +- 0.45% )

    On top of improved scalability, this also gets rid of the icky long long
    types in the very heart of memcg, which is great for 32 bit and also makes
    the code a lot more readable.

    Notable differences between the old and new API:

    - res_counter_charge() and res_counter_charge_nofail() become
    page_counter_try_charge() and page_counter_charge() resp. to match
    the more common kernel naming scheme of try_do()/do()

    - res_counter_uncharge_until() is only ever used to cancel a local
    counter and never to uncharge bigger segments of a hierarchy, so
    it's replaced by the simpler page_counter_cancel()

    - res_counter_set_limit() is replaced by page_counter_limit(), which
    expects its callers to serialize against themselves

    - res_counter_memparse_write_strategy() is replaced by
    page_counter_limit(), which rounds down to the nearest page size -
    rather than up. This is more reasonable for explicitely requested
    hard upper limits.

    - to keep charging light-weight, page_counter_try_charge() charges
    speculatively, only to roll back if the result exceeds the limit.
    Because of this, a failing bigger charge can temporarily lock out
    smaller charges that would otherwise succeed. The error is bounded
    to the difference between the smallest and the biggest possible
    charge size, so for memcg, this means that a failing THP charge can
    send base page charges into reclaim upto 2MB (4MB) before the limit
    would have been reached. This should be acceptable.

    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Dec, 2014

1 commit


25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

09 Aug, 2014

2 commits

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2014

1 commit

  • Until now, cftype arrays carried files for both the default and legacy
    hierarchies and the files which needed to be used on only one of them
    were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
    gets confusing very quickly and we may end up exposing interface files
    to the default hierarchy without thinking it through.

    This patch makes cgroup core provide separate sets of interfaces for
    cftype handling so that the cftypes for the default and legacy
    hierarchies are clearly distinguished. The previous two patches
    renamed the existing ones so that they clearly indicate that they're
    for the legacy hierarchies. This patch adds the interface for the
    default hierarchy and apply them selectively depending on the
    hierarchy type.

    * cftypes added through cgroup_subsys->dfl_cftypes and
    cgroup_add_dfl_cftypes() only show up on the default hierarchy.

    * cftypes added through cgroup_subsys->legacy_cftypes and
    cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.

    * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
    same array for the cases where the interface files are identical on
    both types of hierarchies.

    * This makes all the existing subsystem interface files legacy-only by
    default and all subsystems will have no interface file created when
    enabled on the default hierarchy. Each subsystem should explicitly
    review and compose the interface for the default hierarchy.

    * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
    makes subsystems which haven't decided the interface files for the
    default hierarchy to present the legacy files on the default
    hierarchy so that its behavior on the default hierarchy can be
    tested. As the awkward name suggests, this is for development only.

    * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
    array isn't used on the default hierarchy. The flag is removed.

    v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.

    v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
    as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

09 Jul, 2014

2 commits

  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    blkio piggybacking on memory is an implementation detail which
    preferably should be handled automatically without requiring explicit
    userland action. To achieve that, this patch implements
    cgroup_subsys->depends_on which contains the mask of subsystems which
    should be enabled together when the subsystem is enabled.

    The previous patches already implemented the support for enabled but
    invisible subsystems and cgroup_subsys->depends_on can be easily
    implemented by updating cgroup_refresh_child_subsys_mask() so that it
    calculates cgroup->child_subsys_mask considering
    cgroup_subsys->depends_on of the explicitly enabled subsystems.

    Documentation/cgroups/unified-hierarchy.txt is updated to explain that
    subsystems may not become immediately available after being unused
    from userland and that dependency could be a factor in it. As
    subsystems may already keep residual references, this doesn't
    significantly change how subsystem rebinding can be used.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     
  • cgroup is implementing support for subsystem dependency which would
    require a way to enable a subsystem even when it's not directly
    configured through "cgroup.subtree_control".

    The previous patches added support for explicitly and implicitly
    enabled subsystems and showing/hiding their interface files. An
    explicitly enabled subsystem may become implicitly enabled if it's
    turned off through "cgroup.subtree_control" but there are subsystems
    depending on it. In such cases, the subsystem, as it's turned off
    when seen from userland, shouldn't enforce any resource control.
    Also, the subsystem may be explicitly turned on later again and its
    interface files should be as close to the intial state as possible.

    This patch adds cgroup_subsys->css_reset() which is invoked when a css
    is hidden. The callback should disable resource control and reset the
    state to the vanilla state.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

07 Jun, 2014

1 commit

  • Memory reclaim always uses swappiness of the reclaim target memcg
    (origin of the memory pressure) or vm_swappiness for global memory
    reclaim. This behavior was consistent (except for difference between
    global and hard limit reclaim) because swappiness was enforced to be
    consistent within each memcg hierarchy.

    After "mm: memcontrol: remove hierarchy restrictions for swappiness and
    oom_control" each memcg can have its own swappiness independent of
    hierarchical parents, though, so the consistency guarantee is gone.
    This can lead to an unexpected behavior. Say that a group is explicitly
    configured to not swapout by memory.swappiness=0 but its memory gets
    swapped out anyway when the memory pressure comes from its parent with a
    It is also unexpected that the knob is meaningless without setting the
    hard limit which would trigger the reclaim and enforce the swappiness.
    There are setups where the hard limit is configured higher in the
    hierarchy by an administrator and children groups are under control of
    somebody else who is interested in the swapout behavior but not
    necessarily about the memory limit.

    From a semantic point of view swappiness is an attribute defining anon
    vs.
    file proportional scanning of LRU which is memcg specific (unlike
    charges which are propagated up the hierarchy) so it should be applied
    to the particular memcg's LRU regardless where the memory pressure comes
    from.

    This patch removes vmscan_swappiness() and stores the swappiness into
    the scan_control structure. mem_cgroup_swappiness is then used to
    provide the correct value before shrink_lruvec is called. The global
    vm_swappiness is used for the root memcg.

    [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Jun, 2014

2 commits

  • Kmemcg is currently under development and lacks some important features.
    In particular, it does not have support of kmem reclaim on memory pressure
    inside cgroup, which practically makes it unusable in real life. Let's
    warn about it in both Kconfig and Documentation to prevent complaints
    arising.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Per-memcg swappiness and oom killing can currently not be tweaked on a
    memcg that is part of a hierarchy, but not the root of that hierarchy.
    Users have complained that they can't configure this when they turned on
    hierarchy mode. In fact, with hierarchy mode becoming the default, this
    restriction disables the tunables entirely.

    But there is no good reason for this restriction. The settings for
    swappiness and OOM killing are taken from whatever memcg whose limit
    triggered reclaim and OOM invocation, regardless of its position in the
    hierarchy tree.

    Allow setting swappiness on any group. The knob on the root memcg
    already reads the global VM swappiness, make it writable as well.

    Allow disabling the OOM killer on any non-root memcg.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

17 May, 2014

1 commit

  • Tejun has correctly pointed out that tasks/children test in
    mem_cgroup_force_empty is not correct because there is no other locking
    which preserves this state throughout the rest of the function so both
    new tasks can join the group or new children groups can be added while
    somebody is writing to memory.force_empty. A new task would break
    mem_cgroup_reparent_charges expectation that all failures as described
    by mem_cgroup_force_empty_list are temporal and there is no way out.

    The main use case for the knob as described by
    Documentation/cgroups/memory.txt is to:
    "
    The typical use case for this interface is before calling rmdir().
    Because rmdir() moves all pages to parent, some out-of-use page caches can be
    moved to the parent. If you want to avoid that, force_empty will be useful.
    "

    This means that reparenting is not really required as rmdir will
    reparent pages implicitly from the safe context. If we remove it from
    mem_cgroup_force_empty then we are safe even with existing tasks because
    the number of reclaim attempts is bounded. Moreover the knob still does
    what the documentation claims (modulo reparenting which doesn't make any
    difference) and users might expect. Longterm we want to deprecate the
    whole knob and put the reparented pages to the tail of parent LRU during
    cgroup removal.

    tj: Removed unused variable @cgrp from mem_cgroup_force_empty()

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Michal Hocko
     

26 Apr, 2014

1 commit

  • Unified hierarchy will be the new version of cgroup interface. This
    patch adds Documentation/cgroups/unified-hierarchy.txt which describes
    the design and rationales of unified hierarchy.

    v2: Grammatical updates as per Randy Dunlap's review.

    Signed-off-by: Tejun Heo
    Cc: Randy Dunlap

    Tejun Heo
     

08 Apr, 2014

2 commits

  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The res_counter_{charge,uncharge}_locked() variants are not used in the
    kernel outside of the resource counter code itself, so remove the
    interface.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

01 Feb, 2014

1 commit

  • Pull media updates from Mauro Carvalho Chehab:
    - a new jpeg codec driver for Samsung Exynos (jpeg-hw-exynos4)
    - a new dvb frontend for ds2103 chipset (m88ds2103)
    - a new sensor driver for Samsung S5K5BAF UXGA (s5k5baf)
    - new drivers for R-Car VSP1
    - a new radio driver: radio-raremono
    - a new tuner driver for ts2022 chipset (m88ts2022)
    - the analog part of em28xx is now a separate module that only
    load/runs if the device is not a pure digital TV device
    - added a staging driver for bcm2048 radio devices
    - the omap 2 video driver (omap24xx) was moved to staging. This driver
    is for an old hardware and uses a deprecated Kernel internal API. If
    nobody cares enough to fix it, it would be removed on a couple Kernel
    releases
    - the sn9c102 driver was moved to staging. This driver was replaced by
    gspca, and disabled on some distros, as almost all devices are known
    to work properly with gspca. It should be removed from kernel on a
    couple Kernel releases
    - lots of driver fixes, improvements and cleanups

    * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (421 commits)
    [media] media: v4l2-dev: fix video device index assignment
    [media] rc-core: reuse device numbers
    [media] em28xx-cards: properly initialize the device bitmap
    [media] Staging: media: Fix line length exceeding 80 characters in as102_drv.c
    [media] Staging: media: Fix line length exceeding 80 characters in as102_fe.c
    [media] Staging: media: Fix quoted string split across line in as102_fe.c
    [media] media: st-rc: Add reset support
    [media] m2m-deinterlace: fix allocated struct type
    [media] radio-usb-si4713: fix sparse non static symbol warnings
    [media] em28xx-audio: remove needless check before usb_free_coherent()
    [media] au0828: Fix sparse non static symbol warning
    Revert "[media] go7007-usb: only use go->dev after allocated"
    [media] em28xx-audio: provide an error code when URB submit fails
    [media] em28xx: fix check for audio only usb interfaces when changing the usb alternate setting
    [media] em28xx: fix usb alternate setting for analog and digital video endpoints > 0
    [media] em28xx: make 'em28xx_ctrl_ops' static
    em28xx-alsa: Fix error patch for init/fini
    [media] em28xx-audio: flush work at .fini
    [media] drxk: remove the option to load firmware asynchronously
    [media] em28xx: adjust period size at runtime
    ...

    Linus Torvalds
     

26 Jan, 2014

1 commit

  • Pull networking updates from David Miller:

    1) BPF debugger and asm tool by Daniel Borkmann.

    2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

    3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

    4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match. From Ben Hutchings.

    5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

    6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

    7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

    8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

    9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

    10) Fix ipv6 router reachability probing, from Jiri Benc.

    11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

    12) Support tunneling in GRO layer, from Jerry Chu.

    13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

    14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI. From Atzm Watanabe.

    15) New "Heavy Hitter" qdisc, from Terry Lam.

    16) Significantly improve the IPSEC support in pktgen, from Fan Du.

    17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
    Herbert.

    18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

    19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

    20) Key TCP metrics blobs also by source address, not just destination
    address. From Christoph Paasch.

    21) Support 10G in generic phylib. From Andy Fleming.

    22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided. From Tom Herbert.

    The wireless and netfilter folks have been busy little bees too.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
    net/cxgb4: Fix referencing freed adapter
    ipv6: reallocate addrconf router for ipv6 address when lo device up
    fib_frontend: fix possible NULL pointer dereference
    rtnetlink: remove IFLA_BOND_SLAVE definition
    rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
    qlcnic: update version to 5.3.55
    qlcnic: Enhance logic to calculate msix vectors.
    qlcnic: Refactor interrupt coalescing code for all adapters.
    qlcnic: Update poll controller code path
    qlcnic: Interrupt code cleanup
    qlcnic: Enhance Tx timeout debugging.
    qlcnic: Use bool for rx_mac_learn.
    bonding: fix u64 division
    rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
    sfc: Use the correct maximum TX DMA ring size for SFC9100
    Add Shradha Shah as the sfc driver maintainer.
    net/vxlan: Share RX skb de-marking and checksum checks with ovs
    tulip: cleanup by using ARRAY_SIZE()
    ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
    net/cxgb4: Don't retrieve stats during recovery
    ...

    Linus Torvalds
     

04 Jan, 2014

1 commit

  • It would be useful e.g. in a server or desktop environment to have
    a facility in the notion of fine-grained "per application" or "per
    application group" firewall policies. Probably, users in the mobile,
    embedded area (e.g. Android based) with different security policy
    requirements for application groups could have great benefit from
    that as well. For example, with a little bit of configuration effort,
    an admin could whitelist well-known applications, and thus block
    otherwise unwanted "hard-to-track" applications like [1] from a
    user's machine. Blocking is just one example, but it is not limited
    to that, meaning we can have much different scenarios/policies that
    netfilter allows us than just blocking, e.g. fine grained settings
    where applications are allowed to connect/send traffic to, application
    traffic marking/conntracking, application-specific packet mangling,
    and so on.

    Implementation of PID-based matching would not be appropriate
    as they frequently change, and child tracking would make that
    even more complex and ugly. Cgroups would be a perfect candidate
    for accomplishing that as they associate a set of tasks with a
    set of parameters for one or more subsystems, in our case the
    netfilter subsystem, which, of course, can be combined with other
    cgroup subsystems into something more complex if needed.

    As mentioned, to overcome this constraint, such processes could
    be placed into one or multiple cgroups where different fine-grained
    rules can be defined depending on the application scenario, while
    e.g. everything else that is not part of that could be dropped (or
    vice versa), thus making life harder for unwanted processes to
    communicate to the outside world. So, we make use of cgroups here
    to track jobs and limit their resources in terms of iptables
    policies; in other words, limiting, tracking, etc what they are
    allowed to communicate.

    In our case we're working on outgoing traffic based on which local
    socket that originated from. Also, one doesn't even need to have
    an a-prio knowledge of the application internals regarding their
    particular use of ports or protocols. Matching is *extremly*
    lightweight as we just test for the sk_classid marker of sockets,
    originating from net_cls. net_cls and netfilter do not contradict
    each other; in fact, each construct can live as standalone or they
    can be used in combination with each other, which is perfectly fine,
    plus it serves Tejun's requirement to not introduce a new cgroups
    subsystem. Through this, we result in a very minimal and efficient
    module, and don't add anything except netfilter code.

    One possible, minimal usage example (many other iptables options
    can be applied obviously):

    1) Configuring cgroups if not already done, e.g.:

    mkdir /sys/fs/cgroup/net_cls
    mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
    mkdir /sys/fs/cgroup/net_cls/0
    echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
    (resp. a real flow handle id for tc)

    2) Configuring netfilter (iptables-nftables), e.g.:

    iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP

    3) Running applications, e.g.:

    ping 208.67.222.222
    echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
    64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
    [...]
    ping 208.67.220.220
    ping: sendmsg: Operation not permitted
    [...]
    echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
    64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
    [...]

    Of course, real-world deployments would make use of cgroups user
    space toolsuite, or own custom policy daemons dynamically moving
    applications from/to various cgroups.

    [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdf

    Signed-off-by: Daniel Borkmann
    Cc: Tejun Heo
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

31 Dec, 2013

1 commit


11 Dec, 2013

1 commit


23 Nov, 2013

2 commits

  • Merge v3.12 based patch series to move cgroup_event implementation to
    memcg into for-3.14. The following two commits cause a conflict in
    kernel/cgroup.c

    2ff2a7d03bbe4 ("cgroup: kill css_id")
    79bd9814e5ec9 ("cgroup, memcg: move cgroup_event implementation to memcg")

    Each patch removes a struct definition from kernel/cgroup.c. As the
    two are adjacent, they cause a context conflict. Easily resolved by
    removing both structs.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_event is only available in memcg now. Let's brand it that way.
    While at it, add a comment encouraging deprecation of the feature and
    remove the respective section from cgroup documentation.

    This patch is cosmetic.

    v3: Typo update as per Li Zefan.

    v2: Index in cgroups.txt updated accordingly as suggested by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     

13 Nov, 2013

1 commit

  • The memory.numa_stat file was not hierarchical. Memory charged to the
    children was not shown in parent's numa_stat.

    This change adds the "hierarchical_" stats to the existing stats. The
    new hierarchical stats include the sum of all children's values in
    addition to the value of the memcg.

    Tested: Create cgroup a, a/b and run workload under b. The values of
    b are included in the "hierarchical_*" under a.

    $ cd /sys/fs/cgroup
    $ echo 1 > memory.use_hierarchy
    $ mkdir a a/b

    Run workload in a/b:
    $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

    The hierarchical_ fields in parent (a) show use of workload in a/b:
    $ cat a/memory.numa_stat
    total=0 N0=0 N1=0 N2=0 N3=0
    file=0 N0=0 N1=0 N2=0 N3=0
    anon=0 N0=0 N1=0 N2=0 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    $ cat a/b/memory.numa_stat
    total=908 N0=552 N1=317 N2=39 N3=0
    file=850 N0=549 N1=301 N2=0 N3=0
    anon=58 N0=3 N1=16 N2=39 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    Signed-off-by: Ying Han
    Signed-off-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

13 Sep, 2013

1 commit


12 Jul, 2013

1 commit

  • Pull core block IO updates from Jens Axboe:
    "Here are the core IO block bits for 3.11. It contains:

    - A tweak to the reserved tag logic from Jan, for weirdo devices with
    just 3 free tags. But for those it improves things substantially
    for random writes.

    - Periodic writeback fix from Jan. Marked for stable as well.

    - Fix for a race condition in IO scheduler switching from Jianpeng.

    - The hierarchical blk-cgroup support from Tejun. This is the grunt
    of the series.

    - blk-throttle fix from Vivek.

    Just a note that I'm in the middle of a relocation, whole family is
    flying out tomorrow. Hence I will be awal the remainder of this week,
    but back at work again on Monday the 15th. CC'ing Tejun, since any
    potential "surprises" will most likely be from the blk-cgroup work.
    But it's been brewing for a while and sitting in my tree and
    linux-next for a long time, so should be solid."

    * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
    elevator: Fix a race in elevator switching
    block: Reserve only one queue tag for sync IO if only 3 tags are available
    writeback: Fix periodic writeback after fs mount
    blk-throttle: implement proper hierarchy support
    blk-throttle: implement throtl_grp->has_rules[]
    blk-throttle: Account for child group's start time in parent while bio climbs up
    blk-throttle: add throtl_qnode for dispatch fairness
    blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
    blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
    blk-throttle: make blk_throtl_bio() ready for hierarchy
    blk-throttle: make blk_throtl_drain() ready for hierarchy
    blk-throttle: dispatch from throtl_pending_timer_fn()
    blk-throttle: implement dispatch looping
    blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
    blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
    blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
    blk-throttle: add throtl_service_queue->parent_sq
    blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
    blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
    blk-throttle: move bio_lists[] and friends to throtl_service_queue
    ...

    Linus Torvalds
     

05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds