27 May, 2011

40 commits

  • The kernel automatically evaluates partition tables of storage devices.
    The code for evaluating GUID partitions (in fs/partitions/efi.c) contains
    a bug that causes a kernel oops on certain corrupted GUID partition
    tables.

    This bug has security impacts, because it allows, for example, to
    prepare a storage device that crashes a kernel subsystem upon connecting
    the device (e.g., a "USB Stick of (Partial) Death").

    crc = efi_crc32((const unsigned char *) (*gpt), le32_to_cpu((*gpt)->header_size));

    computes a CRC32 checksum over gpt covering (*gpt)->header_size bytes.
    There is no validation of (*gpt)->header_size before the efi_crc32 call.

    A corrupted partition table may have large values for (*gpt)->header_size.
    In this case, the CRC32 computation access memory beyond the memory
    allocated for gpt, which may cause a kernel heap overflow.

    Validate value of GUID partition table header size.

    [akpm@linux-foundation.org: fix layout and indenting]
    Signed-off-by: Timo Warns
    Cc: Matt Domsch
    Cc: Eugene Teo
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timo Warns
     
  • Let memory allocator initialize the allocated memory as null, thus remove
    the use of memset.

    Signed-off-by: Rakib Mullick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rakib Mullick
     
  • The ->read_proc interface is going away, convert to seq_file.

    Signed-off-by: Alexey Dobriyan
    Cc:Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The balloon driver in a Xen guest frees guest pages and marks them as
    mmio. When the kernel crashes and the crash kernel attempts to read the
    oldmem via /proc/vmcore a read from ballooned pages will generate 100%
    load in dom0 because Xen asks qemu-dm for the page content. Since the
    reads come in as 8byte requests each ballooned page is tried 512 times.

    With this change a hook can be registered which checks wether the given
    pfn is really ram. The hook has to return a value > 0 for ram pages, a
    value < 0 on error (because the hypercall is not known) and 0 for non-ram
    pages.

    This will reduce the time to read /proc/vmcore. Without this change a
    512M guest with 128M crashkernel region needs 200 seconds to read it, with
    this change it takes just 2 seconds.

    Signed-off-by: Olaf Hering
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • Currently, pagemap_read() has three error and/or corner case handling
    mistake.

    (1) If ppos parameter is wrong, mm refcount will be leak.
    (2) If count parameter is 0, mm refcount will be leak too.
    (3) If the current task is sleeping in kmalloc() and the system
    is out of memory and oom-killer kill the proc associated task,
    mm_refcount prevent the task free its memory. then system may
    hang up.

    Cc: Hugh Dickins
    Cc: Jovi Zhang
    Acked-by: Hugh Dickins
    Cc: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It whould be better if put check_mem_permission after __get_free_page in
    mem_write, to be same as function mem_read.

    Hugh Dickins explained the reason.

    check_mem_permission gets a reference to the mm. If we __get_free_page
    after check_mem_permission, imagine what happens if the system is out
    of memory, and the mm we're looking at is selected for killing by the
    OOM killer: while we wait in __get_free_page for more memory, no memory
    is freed from the selected mm because it cannot reach exit_mmap while
    we hold that reference.

    Reported-by: Jovi Zhang
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Reviewed-by: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • There is a macro for the max size kmalloc can allocate, so use it instead
    of a hardcoded number.

    Signed-off-by: Yuanhan Liu
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuanhan Liu
     
  • No need for this local array to be writable, so mark it const.

    Signed-off-by: Mike Frysinger
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Convert fs/proc/ from strict_strto*() to kstrto*() functions.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Now, exe_file is not proc FS dependent, so we can use it to name core
    file. So we add %E pattern for core file name cration which extract path
    from mm_struct->exe_file. Then it converts slashes to exclamation marks
    and pastes the result to the core file name itself.

    This is useful for environments where binary names are longer than 16
    character (the current->comm limitation). Also where there are binaries
    with same name but in a different path. Further in case the binery itself
    changes its current->comm after exec.

    So by doing (s/$/#/ -- # is treated as git comment):

    $ sysctl kernel.core_pattern='core.%p.%e.%E'
    $ ln /bin/cat cat45678901234567890
    $ ./cat45678901234567890
    ^Z
    $ rm cat45678901234567890
    $ fg
    ^\Quit (core dumped)
    $ ls core*

    we now get:

    core.2434.cat456789012345.!root!cat45678901234567890 (deleted)

    Signed-off-by: Jiri Slaby
    Cc: Al Viro
    Cc: Alan Cox
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The Blackfin arch, like the x86 arch, needs to adjust the PC manually
    after a breakpoint is hit as normally this is handled by the remote gdb.
    However, rather than starting another arch ifdef mess, create a common
    GDB_ADJUSTS_BREAK_OFFSET define for any arch to opt-in via their kgdb.h.

    Signed-off-by: Mike Frysinger
    Cc: Oleg Nesterov
    Cc: Jason Wessel
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Acked-by: Paul Mundt
    Acked-by: Dongdong Deng
    Cc: Sergei Shtylyov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Signed-off-by: Mike Frysinger
    Cc: Oleg Nesterov
    Cc: Jason Wessel
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Paul Mundt
    Cc: Sergei Shtylyov
    Cc: Dongdong Deng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Signed-off-by: Mike Frysinger
    Cc: Oleg Nesterov
    Cc: Jason Wessel
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Paul Mundt
    Cc: Sergei Shtylyov
    Cc: Dongdong Deng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Signed-off-by: Mike Frysinger
    Cc: Oleg Nesterov
    Cc: Jason Wessel
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Paul Mundt
    Cc: Sergei Shtylyov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • This is a series of low level ptrace unification steps to make it easier
    for common code (like KGDB) to poke at register state. This also avoids
    having to duplicate higher level operations for most ports which don't
    have special needs for accessing things.

    This patch:

    This implements a bunch of helper funcs for poking the registers of a
    ptrace structure. Now common code should be able to portably update
    specific registers (like kgdb updating the PC).

    Signed-off-by: Mike Frysinger
    Cc: Oleg Nesterov
    Cc: Jason Wessel
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Paul Mundt
    Cc: Sergei Shtylyov
    Cc: Dongdong Deng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The new API exports numa_maps per-memcg basis. This is a piece of useful
    information where it exports per-memcg page distribution across real numa
    nodes.

    One of the usecases is evaluating application performance by combining
    this information w/ the cpu allocation to the application.

    The output of the memory.numastat tries to follow w/ simiar format of
    numa_maps like:

    total= N0= N1= ...
    file= N0= N1= ...
    anon= N0= N1= ...
    unevictable= N0= N1= ...

    And we have per-node:

    total = file + anon + unevictable

    $ cat /dev/cgroup/memory/memory.numa_stat
    total=250020 N0=87620 N1=52367 N2=45298 N3=64735
    file=225232 N0=83402 N1=46160 N2=40522 N3=55148
    anon=21053 N0=3424 N1=6207 N2=4776 N3=6646
    unevictable=3735 N0=794 N1=0 N2=0 N3=2941

    Signed-off-by: Ying Han
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The caller of the function has been renamed to zone_nr_lru_pages(), and
    this is just fixing up in the memcg code. The current name is easily to
    be mis-read as zone's total number of pages.

    Signed-off-by: Ying Han
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • If the memcg reclaim code detects the target memcg below its limit it
    exits and returns a guaranteed non-zero value so that the charge is
    retried.

    Nowadays, the charge side checks the memcg limit itself and does not rely
    on this non-zero return value trick.

    This patch removes it. The reclaim code will now always return the true
    number of pages it reclaimed on its own.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • During memory reclaim we determine the number of pages to be scanned per
    zone as

    (anon + file) >> priority.
    Assume
    scan = (anon + file) >> priority.

    If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
    priority gets higher. This has some problems.

    1. This increases priority as 1 without any scan.
    To do scan in this priority, amount of pages should be larger than 512M.
    If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
    batched, later. (But we lose 1 priority.)
    If memory size is below 16M, pages >> priority is 0 and no scan in
    DEF_PRIORITY forever.

    2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
    So, x86's ZONE_DMA will never be recoverred until the user of pages
    frees memory by itself.

    3. With memcg, the limit of memory can be small. When using small memcg,
    it gets priority < DEF_PRIORITY-2 very easily and need to call
    wait_iff_congested().
    For doing scan before priorty=9, 64MB of memory should be used.

    Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

    1. the target is enough small.
    2. it's kswapd or memcg reclaim.

    Then we can avoid rapid priority drop and may be able to recover
    all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
    This will allow scanning in this priority even when pages >> priority is
    very small.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, memory cgroup's direct reclaim frees memory from the current
    node. But this has some troubles. Usually when a set of threads works in
    a cooperative way, they tend to operate on the same node. So if they hit
    limits under memcg they will reclaim memory from themselves, damaging the
    active working set.

    For example, assume 2 node system which has Node 0 and Node 1 and a memcg
    which has 1G limit. After some work, file cache remains and the usages
    are

    Node 0: 1M
    Node 1: 998M.

    and run an application on Node 0, it will eat its foot before freeing
    unnecessary file caches.

    This patch adds round-robin for NUMA and adds equal pressure to each node.
    When using cpuset's spread memory feature, this will work very well.

    But yes, a better algorithm is needed.

    [akpm@linux-foundation.org: comment editing]
    [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
    Signed-off-by: Ying Han
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • AFAICS mm/page_cgroup.c is for memcg subsystem, but it was directed only
    to generic cgroup maintainers. Fix it.

    Signed-off-by: Namhyung Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Move page-freeing code out of swap_cgroup_mutex in the hope that it could
    reduce few of theoretical contentions between swapons and/or swapoffs.

    This is just a cleanup, no functional changes.

    Signed-off-by: Namhyung Kim
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • It allocated one more page than necessary if @max_pages was a multiple of
    SC_PER_PAGE.

    Signed-off-by: Namhyung Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Commit ca371c0d7e23 ("memcg: fix page_cgroup fatal error in FLATMEM")
    removes call to alloc_bootmem() in the function so that it can be marked
    as __meminit to reduce memory usage when MEMORY_HOTPLUG=n.

    Also as the new helper function alloc_page_cgroup() is called only in the
    function, it should be marked too.

    Signed-off-by: Namhyung Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Michal Hocko
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node
    selects the same mz. This doesn't make much sense as we assign to the
    variable right in the next loop.

    Compiler will probably optimize this out but it is little bit confusing
    for the code reading.

    Signed-off-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We recently added the change in global background reclaim which counts the
    return value of soft_limit reclaim. Now this patch adds the similar logic
    on global direct reclaim.

    We should skip scanning global LRU on shrink_zone if soft_limit reclaim
    does enough work. This is the first step where we start with counting the
    nr_scanned and nr_reclaimed from soft_limit reclaim into global
    scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The global kswapd scans per-zone LRU and reclaims pages regardless of the
    cgroup. It breaks memory isolation since one cgroup can end up reclaiming
    pages from another cgroup. Instead we should rely on memcg-aware target
    reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
    memory pressure.

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored. This patch is the first
    step to skip shrink_zone() if soft_limit reclaim does enough work.

    This is part of the effort which tries to reduce reclaiming pages in global
    LRU in memcg. The per-memcg background reclaim patchset further enhances the
    per-cgroup targetting reclaim, which I should have V4 posted shortly.

    Try running multiple memory intensive workloads within seperate memcgs. Watch
    the counters of soft_steal in memory.stat.

    $ cat /dev/cgroup/A/memory.stat | grep 'soft'
    soft_steal 240000
    soft_scan 240000
    total_soft_steal 240000
    total_soft_scan 240000

    This patch:

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored.

    We would like to skip shrink_zone() if soft_limit reclaim does enough
    work. Also, we need to make the memory pressure balanced across per-memcg
    zones, like the logic vm-core. This patch is the first step where we
    start with counting the nr_scanned and nr_reclaimed from soft_limit
    reclaim into the global scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • enums are problematic because they cannot be forward-declared:

    akpm2:/home/akpm> cat t.c

    enum foo;

    static inline void bar(enum foo f)
    {
    }
    akpm2:/home/akpm> gcc -c t.c
    t.c:4: error: parameter 1 ('f') has incomplete type

    So move the enum's definition into a standalone header file which can be used
    wherever its definition is needed.

    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     
  • Convert cgroup_attach_proc to use flex_array.

    The cgroup_attach_proc implementation requires a pre-allocated array to
    store task pointers to atomically move a thread-group, but asking for a
    monolithic array with kmalloc() may be unreliable for very large groups.
    Using flex_array provides the same functionality with less risk of
    failure.

    This is a post-patch for cgroup-procs-write.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Make procs file writable to move all threads by tgid at once.

    Add functionality that enables users to move all threads in a threadgroup
    at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
    current implementation makes use of a per-threadgroup rwsem that's taken
    for reading in the fork() path to prevent newly forking threads within the
    threadgroup from "escaping" while the move is in progress.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

    Add an rwsem that lives in a threadgroup's signal_struct that's taken for
    reading in the fork path, under CONFIG_CGROUPS. If another part of the
    kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
    ifdefs should be changed to a higher-up flag that CGROUPS and the other
    system would both depend on.

    This is a pre-patch for cgroup-procs-write.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • When configfs_register_subsystem() fails, we unregister too many
    subsystems in configfs_example_init. Decrement i by one to not unregister
    non-registered subsystem.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jiri Slaby
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • I find it very handy to show the average delays in milliseconds.

    Example output (on 100 concurrent dd reading sparse files):

    CPU count real total virtual total delay total delay average
    986 3223509952 3207643301 38863410579 39.415ms
    IO count delay total delay average
    0 0 0ms
    SWAP count delay total delay average
    0 0 0ms
    RECLAIM count delay total delay average
    1059 5131834899 4ms
    dd: read=0, write=0, cancelled_write=0

    Signed-off-by: Wu Fengguang
    Cc: Mel Gorman
    Cc: Balbir Singh
    Reviewed-by: Satoru Moriya
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Fixes

    Documentation/accounting/getdelays.c: In function `get_family_id':
    Documentation/accounting/getdelays.c:172:14: warning: variable `rc' set but not used [-Wunused-but-set-variable]

    Reported-by: "Justin P. Mattock"
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Fixes

    Documentation/accounting/getdelays.c: In function `main':
    Documentation/accounting/getdelays.c:436:7: warning: variable `i' set but not used [-Wunused-but-set-variable]

    Signed-off-by: Justin P. Mattock
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Justin P. Mattock
     
  • As declaring counter as volatile is discouraged, it is best not to use it
    in sample code as well.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan