14 Dec, 2014

2 commits

  • Merge second patchbomb from Andrew Morton:
    - the rest of MM
    - misc fs fixes
    - add execveat() syscall
    - new ratelimit feature for fault-injection
    - decompressor updates
    - ipc/ updates
    - fallocate feature creep
    - fsnotify cleanups
    - a few other misc things

    * emailed patches from Andrew Morton : (99 commits)
    cgroups: Documentation: fix trivial typos and wrong paragraph numberings
    parisc: percpu: update comments referring to __get_cpu_var
    percpu: update local_ops.txt to reflect this_cpu operations
    percpu: remove __get_cpu_var and __raw_get_cpu_var macros
    fsnotify: remove destroy_list from fsnotify_mark
    fsnotify: unify inode and mount marks handling
    fallocate: create FAN_MODIFY and IN_MODIFY events
    mm/cma: make kmemleak ignore CMA regions
    slub: fix cpuset check in get_any_partial
    slab: fix cpuset check in fallback_alloc
    shmdt: use i_size_read() instead of ->i_size
    ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
    ipc/msg: increase MSGMNI, remove scaling
    ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
    ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
    lib/decompress.c: consistency of compress formats for kernel image
    decompress_bunzip2: off by one in get_next_block()
    usr/Kconfig: make initrd compression algorithm selection not expert
    fault-inject: add ratelimit option
    ratelimit: add initialization macro
    ...

    Linus Torvalds
     
  • Signed-off-by: SeongJae Park
    Cc: Jonathan Corbet
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     

13 Dec, 2014

1 commit

  • Pull documentation update from Jonathan Corbet:
    "Here's my set of accumulated documentation changes for 3.19.

    It includes a couple of additions to the coding style document, some
    fixes for minor build problems within the documentation tree, the
    relocation of the kselftest docs, and various tweaks and additions.

    A couple of changes reach outside of Documentation/; they only make
    trivial comment changes and I did my best to get the required acks.

    Complete with a shiny signed tag this time around"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6:
    kobject: grammar fix
    Input: xpad - update docs to reflect current state
    Documentation: Build mic/mpssd only for x86_64
    cgroups: Documentation: fix wrong cgroupfs paths
    Documentation/email-clients.txt: add info about Claws Mail
    CodingStyle: add some more error handling guidelines
    kselftest: Move the docs to the Documentation dir
    Documentation: fix formatting to make 's' happy
    Documentation: power: Fix typo in Documentation/power
    Documentation: vm: Add 1GB large page support information
    ipv4: add kernel parameter tcpmhash_entries
    Documentation: Fix a typo in mailbox.txt
    treewide: Fix typo in Documentation/DocBook/device-drivers
    CodingStyle: Add a chapter on conditional compilation

    Linus Torvalds
     

11 Dec, 2014

4 commits

  • Memory cgroups used to have 5 per-page pointers. To allow users to
    disable that amount of overhead during runtime, those pointers were
    allocated in a separate array, with a translation layer between them and
    struct page.

    There is now only one page pointer remaining: the memcg pointer, that
    indicates which cgroup the page is associated with when charged. The
    complexity of runtime allocation and the runtime translation overhead is
    no longer justified to save that *potential* 0.19% of memory. With
    CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
    after the page->private member and doesn't even increase struct page,
    and then this patch actually saves space. Remaining users that care can
    still compile their kernels without CONFIG_MEMCG.

    text data bss dec hex filename
    8828345 1725264 983040 11536649 b00909 vmlinux.old
    8827425 1725264 966656 11519345 afc571 vmlinux.new

    [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Acked-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All memory accounting and limiting has been switched over to the
    lockless page counters. Bye, res_counter!

    [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
    [mhocko@suse.cz: ditch the last remainings of res_counter]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Paul Bolle
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Abandon the spinlock-protected byte counters in favor of the unlocked
    page counters in the hugetlb controller as well.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory is internally accounted in bytes, using spinlock-protected 64-bit
    counters, even though the smallest accounting delta is a page. The
    counter interface is also convoluted and does too many things.

    Introduce a new lockless word-sized page counter API, then change all
    memory accounting over to it. The translation from and to bytes then only
    happens when interfacing with userspace.

    The removed locking overhead is noticable when scaling beyond the per-cpu
    charge caches - on a 4-socket machine with 144-threads, the following test
    shows the performance differences of 288 memcgs concurrently running a
    page fault benchmark:

    vanilla:

    18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
    1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
    24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
    1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
    50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
    stalled-cycles-frontend
    stalled-cycles-backend
    8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
    1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
    1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )

    132.474343877 seconds time elapsed ( +- 0.21% )

    lockless:

    12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
    832,850 context-switches # 0.068 K/sec ( +- 0.54% )
    15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
    1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
    32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
    stalled-cycles-frontend
    stalled-cycles-backend
    9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
    2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
    1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )

    91.369330729 seconds time elapsed ( +- 0.45% )

    On top of improved scalability, this also gets rid of the icky long long
    types in the very heart of memcg, which is great for 32 bit and also makes
    the code a lot more readable.

    Notable differences between the old and new API:

    - res_counter_charge() and res_counter_charge_nofail() become
    page_counter_try_charge() and page_counter_charge() resp. to match
    the more common kernel naming scheme of try_do()/do()

    - res_counter_uncharge_until() is only ever used to cancel a local
    counter and never to uncharge bigger segments of a hierarchy, so
    it's replaced by the simpler page_counter_cancel()

    - res_counter_set_limit() is replaced by page_counter_limit(), which
    expects its callers to serialize against themselves

    - res_counter_memparse_write_strategy() is replaced by
    page_counter_limit(), which rounds down to the nearest page size -
    rather than up. This is more reasonable for explicitely requested
    hard upper limits.

    - to keep charging light-weight, page_counter_try_charge() charges
    speculatively, only to roll back if the result exceeds the limit.
    Because of this, a failing bigger charge can temporarily lock out
    smaller charges that would otherwise succeed. The error is bounded
    to the difference between the smallest and the biggest possible
    charge size, so for memcg, this means that a failing THP charge can
    send base page charges into reclaim upto 2MB (4MB) before the limit
    would have been reached. This should be acceptable.

    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Dec, 2014

1 commit


25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

09 Aug, 2014

2 commits

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2014

1 commit

  • Until now, cftype arrays carried files for both the default and legacy
    hierarchies and the files which needed to be used on only one of them
    were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
    gets confusing very quickly and we may end up exposing interface files
    to the default hierarchy without thinking it through.

    This patch makes cgroup core provide separate sets of interfaces for
    cftype handling so that the cftypes for the default and legacy
    hierarchies are clearly distinguished. The previous two patches
    renamed the existing ones so that they clearly indicate that they're
    for the legacy hierarchies. This patch adds the interface for the
    default hierarchy and apply them selectively depending on the
    hierarchy type.

    * cftypes added through cgroup_subsys->dfl_cftypes and
    cgroup_add_dfl_cftypes() only show up on the default hierarchy.

    * cftypes added through cgroup_subsys->legacy_cftypes and
    cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.

    * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
    same array for the cases where the interface files are identical on
    both types of hierarchies.

    * This makes all the existing subsystem interface files legacy-only by
    default and all subsystems will have no interface file created when
    enabled on the default hierarchy. Each subsystem should explicitly
    review and compose the interface for the default hierarchy.

    * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
    makes subsystems which haven't decided the interface files for the
    default hierarchy to present the legacy files on the default
    hierarchy so that its behavior on the default hierarchy can be
    tested. As the awkward name suggests, this is for development only.

    * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
    array isn't used on the default hierarchy. The flag is removed.

    v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.

    v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
    as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

09 Jul, 2014

2 commits

  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    blkio piggybacking on memory is an implementation detail which
    preferably should be handled automatically without requiring explicit
    userland action. To achieve that, this patch implements
    cgroup_subsys->depends_on which contains the mask of subsystems which
    should be enabled together when the subsystem is enabled.

    The previous patches already implemented the support for enabled but
    invisible subsystems and cgroup_subsys->depends_on can be easily
    implemented by updating cgroup_refresh_child_subsys_mask() so that it
    calculates cgroup->child_subsys_mask considering
    cgroup_subsys->depends_on of the explicitly enabled subsystems.

    Documentation/cgroups/unified-hierarchy.txt is updated to explain that
    subsystems may not become immediately available after being unused
    from userland and that dependency could be a factor in it. As
    subsystems may already keep residual references, this doesn't
    significantly change how subsystem rebinding can be used.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     
  • cgroup is implementing support for subsystem dependency which would
    require a way to enable a subsystem even when it's not directly
    configured through "cgroup.subtree_control".

    The previous patches added support for explicitly and implicitly
    enabled subsystems and showing/hiding their interface files. An
    explicitly enabled subsystem may become implicitly enabled if it's
    turned off through "cgroup.subtree_control" but there are subsystems
    depending on it. In such cases, the subsystem, as it's turned off
    when seen from userland, shouldn't enforce any resource control.
    Also, the subsystem may be explicitly turned on later again and its
    interface files should be as close to the intial state as possible.

    This patch adds cgroup_subsys->css_reset() which is invoked when a css
    is hidden. The callback should disable resource control and reset the
    state to the vanilla state.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

07 Jun, 2014

1 commit

  • Memory reclaim always uses swappiness of the reclaim target memcg
    (origin of the memory pressure) or vm_swappiness for global memory
    reclaim. This behavior was consistent (except for difference between
    global and hard limit reclaim) because swappiness was enforced to be
    consistent within each memcg hierarchy.

    After "mm: memcontrol: remove hierarchy restrictions for swappiness and
    oom_control" each memcg can have its own swappiness independent of
    hierarchical parents, though, so the consistency guarantee is gone.
    This can lead to an unexpected behavior. Say that a group is explicitly
    configured to not swapout by memory.swappiness=0 but its memory gets
    swapped out anyway when the memory pressure comes from its parent with a
    It is also unexpected that the knob is meaningless without setting the
    hard limit which would trigger the reclaim and enforce the swappiness.
    There are setups where the hard limit is configured higher in the
    hierarchy by an administrator and children groups are under control of
    somebody else who is interested in the swapout behavior but not
    necessarily about the memory limit.

    From a semantic point of view swappiness is an attribute defining anon
    vs.
    file proportional scanning of LRU which is memcg specific (unlike
    charges which are propagated up the hierarchy) so it should be applied
    to the particular memcg's LRU regardless where the memory pressure comes
    from.

    This patch removes vmscan_swappiness() and stores the swappiness into
    the scan_control structure. mem_cgroup_swappiness is then used to
    provide the correct value before shrink_lruvec is called. The global
    vm_swappiness is used for the root memcg.

    [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Jun, 2014

2 commits

  • Kmemcg is currently under development and lacks some important features.
    In particular, it does not have support of kmem reclaim on memory pressure
    inside cgroup, which practically makes it unusable in real life. Let's
    warn about it in both Kconfig and Documentation to prevent complaints
    arising.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Per-memcg swappiness and oom killing can currently not be tweaked on a
    memcg that is part of a hierarchy, but not the root of that hierarchy.
    Users have complained that they can't configure this when they turned on
    hierarchy mode. In fact, with hierarchy mode becoming the default, this
    restriction disables the tunables entirely.

    But there is no good reason for this restriction. The settings for
    swappiness and OOM killing are taken from whatever memcg whose limit
    triggered reclaim and OOM invocation, regardless of its position in the
    hierarchy tree.

    Allow setting swappiness on any group. The knob on the root memcg
    already reads the global VM swappiness, make it writable as well.

    Allow disabling the OOM killer on any non-root memcg.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

17 May, 2014

1 commit

  • Tejun has correctly pointed out that tasks/children test in
    mem_cgroup_force_empty is not correct because there is no other locking
    which preserves this state throughout the rest of the function so both
    new tasks can join the group or new children groups can be added while
    somebody is writing to memory.force_empty. A new task would break
    mem_cgroup_reparent_charges expectation that all failures as described
    by mem_cgroup_force_empty_list are temporal and there is no way out.

    The main use case for the knob as described by
    Documentation/cgroups/memory.txt is to:
    "
    The typical use case for this interface is before calling rmdir().
    Because rmdir() moves all pages to parent, some out-of-use page caches can be
    moved to the parent. If you want to avoid that, force_empty will be useful.
    "

    This means that reparenting is not really required as rmdir will
    reparent pages implicitly from the safe context. If we remove it from
    mem_cgroup_force_empty then we are safe even with existing tasks because
    the number of reclaim attempts is bounded. Moreover the knob still does
    what the documentation claims (modulo reparenting which doesn't make any
    difference) and users might expect. Longterm we want to deprecate the
    whole knob and put the reparented pages to the tail of parent LRU during
    cgroup removal.

    tj: Removed unused variable @cgrp from mem_cgroup_force_empty()

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Michal Hocko
     

26 Apr, 2014

1 commit

  • Unified hierarchy will be the new version of cgroup interface. This
    patch adds Documentation/cgroups/unified-hierarchy.txt which describes
    the design and rationales of unified hierarchy.

    v2: Grammatical updates as per Randy Dunlap's review.

    Signed-off-by: Tejun Heo
    Cc: Randy Dunlap

    Tejun Heo
     

08 Apr, 2014

2 commits

  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The res_counter_{charge,uncharge}_locked() variants are not used in the
    kernel outside of the resource counter code itself, so remove the
    interface.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

01 Feb, 2014

1 commit

  • Pull media updates from Mauro Carvalho Chehab:
    - a new jpeg codec driver for Samsung Exynos (jpeg-hw-exynos4)
    - a new dvb frontend for ds2103 chipset (m88ds2103)
    - a new sensor driver for Samsung S5K5BAF UXGA (s5k5baf)
    - new drivers for R-Car VSP1
    - a new radio driver: radio-raremono
    - a new tuner driver for ts2022 chipset (m88ts2022)
    - the analog part of em28xx is now a separate module that only
    load/runs if the device is not a pure digital TV device
    - added a staging driver for bcm2048 radio devices
    - the omap 2 video driver (omap24xx) was moved to staging. This driver
    is for an old hardware and uses a deprecated Kernel internal API. If
    nobody cares enough to fix it, it would be removed on a couple Kernel
    releases
    - the sn9c102 driver was moved to staging. This driver was replaced by
    gspca, and disabled on some distros, as almost all devices are known
    to work properly with gspca. It should be removed from kernel on a
    couple Kernel releases
    - lots of driver fixes, improvements and cleanups

    * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (421 commits)
    [media] media: v4l2-dev: fix video device index assignment
    [media] rc-core: reuse device numbers
    [media] em28xx-cards: properly initialize the device bitmap
    [media] Staging: media: Fix line length exceeding 80 characters in as102_drv.c
    [media] Staging: media: Fix line length exceeding 80 characters in as102_fe.c
    [media] Staging: media: Fix quoted string split across line in as102_fe.c
    [media] media: st-rc: Add reset support
    [media] m2m-deinterlace: fix allocated struct type
    [media] radio-usb-si4713: fix sparse non static symbol warnings
    [media] em28xx-audio: remove needless check before usb_free_coherent()
    [media] au0828: Fix sparse non static symbol warning
    Revert "[media] go7007-usb: only use go->dev after allocated"
    [media] em28xx-audio: provide an error code when URB submit fails
    [media] em28xx: fix check for audio only usb interfaces when changing the usb alternate setting
    [media] em28xx: fix usb alternate setting for analog and digital video endpoints > 0
    [media] em28xx: make 'em28xx_ctrl_ops' static
    em28xx-alsa: Fix error patch for init/fini
    [media] em28xx-audio: flush work at .fini
    [media] drxk: remove the option to load firmware asynchronously
    [media] em28xx: adjust period size at runtime
    ...

    Linus Torvalds
     

26 Jan, 2014

1 commit

  • Pull networking updates from David Miller:

    1) BPF debugger and asm tool by Daniel Borkmann.

    2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

    3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

    4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match. From Ben Hutchings.

    5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

    6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

    7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

    8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

    9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

    10) Fix ipv6 router reachability probing, from Jiri Benc.

    11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

    12) Support tunneling in GRO layer, from Jerry Chu.

    13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

    14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI. From Atzm Watanabe.

    15) New "Heavy Hitter" qdisc, from Terry Lam.

    16) Significantly improve the IPSEC support in pktgen, from Fan Du.

    17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
    Herbert.

    18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

    19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

    20) Key TCP metrics blobs also by source address, not just destination
    address. From Christoph Paasch.

    21) Support 10G in generic phylib. From Andy Fleming.

    22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided. From Tom Herbert.

    The wireless and netfilter folks have been busy little bees too.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
    net/cxgb4: Fix referencing freed adapter
    ipv6: reallocate addrconf router for ipv6 address when lo device up
    fib_frontend: fix possible NULL pointer dereference
    rtnetlink: remove IFLA_BOND_SLAVE definition
    rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
    qlcnic: update version to 5.3.55
    qlcnic: Enhance logic to calculate msix vectors.
    qlcnic: Refactor interrupt coalescing code for all adapters.
    qlcnic: Update poll controller code path
    qlcnic: Interrupt code cleanup
    qlcnic: Enhance Tx timeout debugging.
    qlcnic: Use bool for rx_mac_learn.
    bonding: fix u64 division
    rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
    sfc: Use the correct maximum TX DMA ring size for SFC9100
    Add Shradha Shah as the sfc driver maintainer.
    net/vxlan: Share RX skb de-marking and checksum checks with ovs
    tulip: cleanup by using ARRAY_SIZE()
    ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
    net/cxgb4: Don't retrieve stats during recovery
    ...

    Linus Torvalds
     

04 Jan, 2014

1 commit

  • It would be useful e.g. in a server or desktop environment to have
    a facility in the notion of fine-grained "per application" or "per
    application group" firewall policies. Probably, users in the mobile,
    embedded area (e.g. Android based) with different security policy
    requirements for application groups could have great benefit from
    that as well. For example, with a little bit of configuration effort,
    an admin could whitelist well-known applications, and thus block
    otherwise unwanted "hard-to-track" applications like [1] from a
    user's machine. Blocking is just one example, but it is not limited
    to that, meaning we can have much different scenarios/policies that
    netfilter allows us than just blocking, e.g. fine grained settings
    where applications are allowed to connect/send traffic to, application
    traffic marking/conntracking, application-specific packet mangling,
    and so on.

    Implementation of PID-based matching would not be appropriate
    as they frequently change, and child tracking would make that
    even more complex and ugly. Cgroups would be a perfect candidate
    for accomplishing that as they associate a set of tasks with a
    set of parameters for one or more subsystems, in our case the
    netfilter subsystem, which, of course, can be combined with other
    cgroup subsystems into something more complex if needed.

    As mentioned, to overcome this constraint, such processes could
    be placed into one or multiple cgroups where different fine-grained
    rules can be defined depending on the application scenario, while
    e.g. everything else that is not part of that could be dropped (or
    vice versa), thus making life harder for unwanted processes to
    communicate to the outside world. So, we make use of cgroups here
    to track jobs and limit their resources in terms of iptables
    policies; in other words, limiting, tracking, etc what they are
    allowed to communicate.

    In our case we're working on outgoing traffic based on which local
    socket that originated from. Also, one doesn't even need to have
    an a-prio knowledge of the application internals regarding their
    particular use of ports or protocols. Matching is *extremly*
    lightweight as we just test for the sk_classid marker of sockets,
    originating from net_cls. net_cls and netfilter do not contradict
    each other; in fact, each construct can live as standalone or they
    can be used in combination with each other, which is perfectly fine,
    plus it serves Tejun's requirement to not introduce a new cgroups
    subsystem. Through this, we result in a very minimal and efficient
    module, and don't add anything except netfilter code.

    One possible, minimal usage example (many other iptables options
    can be applied obviously):

    1) Configuring cgroups if not already done, e.g.:

    mkdir /sys/fs/cgroup/net_cls
    mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
    mkdir /sys/fs/cgroup/net_cls/0
    echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
    (resp. a real flow handle id for tc)

    2) Configuring netfilter (iptables-nftables), e.g.:

    iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP

    3) Running applications, e.g.:

    ping 208.67.222.222
    echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
    64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
    [...]
    ping 208.67.220.220
    ping: sendmsg: Operation not permitted
    [...]
    echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
    64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
    [...]

    Of course, real-world deployments would make use of cgroups user
    space toolsuite, or own custom policy daemons dynamically moving
    applications from/to various cgroups.

    [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdf

    Signed-off-by: Daniel Borkmann
    Cc: Tejun Heo
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

31 Dec, 2013

1 commit


11 Dec, 2013

1 commit


23 Nov, 2013

2 commits

  • Merge v3.12 based patch series to move cgroup_event implementation to
    memcg into for-3.14. The following two commits cause a conflict in
    kernel/cgroup.c

    2ff2a7d03bbe4 ("cgroup: kill css_id")
    79bd9814e5ec9 ("cgroup, memcg: move cgroup_event implementation to memcg")

    Each patch removes a struct definition from kernel/cgroup.c. As the
    two are adjacent, they cause a context conflict. Easily resolved by
    removing both structs.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_event is only available in memcg now. Let's brand it that way.
    While at it, add a comment encouraging deprecation of the feature and
    remove the respective section from cgroup documentation.

    This patch is cosmetic.

    v3: Typo update as per Li Zefan.

    v2: Index in cgroups.txt updated accordingly as suggested by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     

13 Nov, 2013

1 commit

  • The memory.numa_stat file was not hierarchical. Memory charged to the
    children was not shown in parent's numa_stat.

    This change adds the "hierarchical_" stats to the existing stats. The
    new hierarchical stats include the sum of all children's values in
    addition to the value of the memcg.

    Tested: Create cgroup a, a/b and run workload under b. The values of
    b are included in the "hierarchical_*" under a.

    $ cd /sys/fs/cgroup
    $ echo 1 > memory.use_hierarchy
    $ mkdir a a/b

    Run workload in a/b:
    $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

    The hierarchical_ fields in parent (a) show use of workload in a/b:
    $ cat a/memory.numa_stat
    total=0 N0=0 N1=0 N2=0 N3=0
    file=0 N0=0 N1=0 N2=0 N3=0
    anon=0 N0=0 N1=0 N2=0 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    $ cat a/b/memory.numa_stat
    total=908 N0=552 N1=317 N2=39 N3=0
    file=850 N0=549 N1=301 N2=0 N3=0
    anon=58 N0=3 N1=16 N2=39 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    Signed-off-by: Ying Han
    Signed-off-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

13 Sep, 2013

1 commit


12 Jul, 2013

1 commit

  • Pull core block IO updates from Jens Axboe:
    "Here are the core IO block bits for 3.11. It contains:

    - A tweak to the reserved tag logic from Jan, for weirdo devices with
    just 3 free tags. But for those it improves things substantially
    for random writes.

    - Periodic writeback fix from Jan. Marked for stable as well.

    - Fix for a race condition in IO scheduler switching from Jianpeng.

    - The hierarchical blk-cgroup support from Tejun. This is the grunt
    of the series.

    - blk-throttle fix from Vivek.

    Just a note that I'm in the middle of a relocation, whole family is
    flying out tomorrow. Hence I will be awal the remainder of this week,
    but back at work again on Monday the 15th. CC'ing Tejun, since any
    potential "surprises" will most likely be from the blk-cgroup work.
    But it's been brewing for a while and sitting in my tree and
    linux-next for a long time, so should be solid."

    * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
    elevator: Fix a race in elevator switching
    block: Reserve only one queue tag for sync IO if only 3 tags are available
    writeback: Fix periodic writeback after fs mount
    blk-throttle: implement proper hierarchy support
    blk-throttle: implement throtl_grp->has_rules[]
    blk-throttle: Account for child group's start time in parent while bio climbs up
    blk-throttle: add throtl_qnode for dispatch fairness
    blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
    blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
    blk-throttle: make blk_throtl_bio() ready for hierarchy
    blk-throttle: make blk_throtl_drain() ready for hierarchy
    blk-throttle: dispatch from throtl_pending_timer_fn()
    blk-throttle: implement dispatch looping
    blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
    blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
    blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
    blk-throttle: add throtl_service_queue->parent_sq
    blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
    blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
    blk-throttle: move bio_lists[] and friends to throtl_service_queue
    ...

    Linus Torvalds
     

05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

04 Jul, 2013

1 commit


24 Jun, 2013

1 commit


19 Jun, 2013

1 commit

  • Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time
    back and the comments/Documentation never got updated.

    I figured it out when I was going through sched-domains.txt and so thought of
    fixing it globally.

    I haven't crossed check if the stuff that is referenced in sched/core.c by all
    these files is still present and hasn't changed as that wasn't the motive behind
    this patch.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

28 May, 2013

1 commit


15 May, 2013

1 commit

  • With the recent updates, blk-throttle is finally ready for proper
    hierarchy support. Dispatching now honors service_queue->parent_sq
    and propagates correctly. The only thing missing is setting
    ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
    hierarchy.

    This patch updates throtl_pd_init() such that service_queues form the
    same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
    As this concludes proper hierarchy support for blkcg, the shameful
    .broken_hierarchy tag is removed from blkio_subsys.

    v2: Updated blkio-controller.txt as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Li Zefan

    Tejun Heo
     

08 May, 2013

1 commit

  • This exports the amount of anonymous transparent hugepages for each
    memcg via the new "rss_huge" stat in memory.stat. The units are in
    bytes.

    This is helpful to determine the hugepage utilization for individual
    jobs on the system in comparison to rss and opportunities where
    MADV_HUGEPAGE may be helpful.

    The amount of anonymous transparent hugepages is also included in "rss"
    for backwards compatibility.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

02 May, 2013

1 commit

  • Pull networking updates from David Miller:
    "Highlights (1721 non-merge commits, this has to be a record of some
    sort):

    1) Add 'random' mode to team driver, from Jiri Pirko and Eric
    Dumazet.

    2) Make it so that any driver that supports configuration of multiple
    MAC addresses can provide the forwarding database add and del
    calls by providing a default implementation and hooking that up if
    the driver doesn't have an explicit set of handlers. From Vlad
    Yasevich.

    3) Support GSO segmentation over tunnels and other encapsulating
    devices such as VXLAN, from Pravin B Shelar.

    4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

    5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
    Dukkipati.

    6) In the PHY layer, allow supporting wake-on-lan in situations where
    the PHY registers have to be written for it to be configured.

    Use it to support wake-on-lan in mv643xx_eth.

    From Michael Stapelberg.

    7) Significantly improve firewire IPV6 support, from YOSHIFUJI
    Hideaki.

    8) Allow multiple packets to be sent in a single transmission using
    network coding in batman-adv, from Martin Hundebøll.

    9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

    10) Generalize the VXLAN forwarding tables so that there is more
    flexibility in configurating various aspects of the endpoints.
    From David Stevens.

    11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
    from Dmitry Kravkov.

    12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
    Neira Ayuso.

    13) Start adding networking selftests.

    14) In situations of overload on the same AF_PACKET fanout socket, or
    per-cpu packet receive queue, minimize drop by distributing the
    load to other cpus/fanouts. From Willem de Bruijn and Eric
    Dumazet.

    15) Add support for new payload offset BPF instruction, from Daniel
    Borkmann.

    16) Convert several drivers over to mdoule_platform_driver(), from
    Sachin Kamat.

    17) Provide a minimal BPF JIT image disassembler userspace tool, from
    Daniel Borkmann.

    18) Rewrite F-RTO implementation in TCP to match the final
    specification of it in RFC4138 and RFC5682. From Yuchung Cheng.

    19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
    you like netlink, so I implemented netlink dumping of netlink
    sockets.") From Andrey Vagin.

    20) Remove ugly passing of rtnetlink attributes into rtnl_doit
    functions, from Thomas Graf.

    21) Allow userspace to be able to see if a configuration change occurs
    in the middle of an address or device list dump, from Nicolas
    Dichtel.

    22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
    Frederic Sowa.

    23) Increase accuracy of packet length used by packet scheduler, from
    Jason Wang.

    24) Beginning set of changes to make ipv4/ipv6 fragment handling more
    scalable and less susceptible to overload and locking contention,
    from Jesper Dangaard Brouer.

    25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
    instead. From Hong Zhiguo.

    26) Optimize route usage in IPVS by avoiding reference counting where
    possible, from Julian Anastasov.

    27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

    28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
    Eitzenberger.

    29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
    nfnetlink_log, and nfnetlink_queue. From Gao feng.

    30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

    31) Support several new r8169 chips, from Hayes Wang.

    32) Support tokenized interface identifiers in ipv6, from Daniel
    Borkmann.

    33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

    34) Add 802.1ad vlan offload support, from Patrick McHardy.

    35) Support mmap() based netlink communication, also from Patrick
    McHardy.

    36) Support HW timestamping in mlx4 driver, from Amir Vadai.

    37) Rationalize AF_PACKET packet timestamping when transmitting, from
    Willem de Bruijn and Daniel Borkmann.

    38) Bring parity to what's provided by /proc/net/packet socket dumping
    and the info provided by netlink socket dumping of AF_PACKET
    sockets. From Nicolas Dichtel.

    39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
    Poirier"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    filter: fix va_list build error
    af_unix: fix a fatal race with bit fields
    bnx2x: Prevent memory leak when cnic is absent
    bnx2x: correct reading of speed capabilities
    net: sctp: attribute printl with __printf for gcc fmt checks
    netlink: kconfig: move mmap i/o into netlink kconfig
    netpoll: convert mutex into a semaphore
    netlink: Fix skb ref counting.
    net_sched: act_ipt forward compat with xtables
    mlx4_en: fix a build error on 32bit arches
    Revert "bnx2x: allow nvram test to run when device is down"
    bridge: avoid OOPS if root port not found
    drivers: net: cpsw: fix kernel warn on cpsw irq enable
    sh_eth: use random MAC address if no valid one supplied
    3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
    tg3: fix to append hardware time stamping flags
    unix/stream: fix peeking with an offset larger than data in queue
    unix/dgram: fix peeking with an offset larger than data in queue
    unix/dgram: peek beyond 0-sized skbs
    openvswitch: Remove unneeded ovs_netdev_get_ifindex()
    ...

    Linus Torvalds