07 Apr, 2009

4 commits

  • Add /proc entries to give the admin the ability to control the minimum and
    maximum number of pdflush threads. This allows finer control of pdflush
    on both large and small machines.

    The rationale is simply one size does not fit all. Admins on large and/or
    small systems may want to tune the min/max pdflush thread count to best
    suit their needs. Right now the min/max is hardcoded to 2/8. While
    probably a fair estimate for smaller machines, large machines with large
    numbers of CPUs and large numbers of filesystems/block devices may benefit
    from larger numbers of threads working on different block devices.

    Even if the background flushing algorithm is radically changed, it is
    still likely that multiple threads will be involved and admins would still
    desire finer control on the min/max other than to have to recompile the
    kernel.

    The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
    '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

    The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
    the current value of nr_pdflush_threads_max. This minimum is required
    since additional thread creation is performed in a pdflush thread itself.

    The minimum value for nr_pdflush_threads_max is the current value of
    nr_pdflush_threads_min and the maximum value can be 1000.

    Documentation/sysctl/vm.txt is also updated.

    [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
    Signed-off-by: Peter W Morreale
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     
  • Fix a race on creating pdflush threads. Without the patch, it is possible
    to create more than MAX_PDFLUSH_THREADS threads, and this has been
    observed in practice on IO loaded SMP machines.

    The fix involves moving the lock around to protect the check against the
    thread count and correctly dealing with thread creation failure.

    This fix also _mostly_ repairs a race condition on how quickly the threads
    are created. The original intent was to create a pdflush thread (up to
    the max allowed) every second. Without this patch is is possible to
    create NCPUS pdflush threads concurrently. The 'mostly' caveat is because
    an assumption is made that thread creation will be successful. If we fail
    to create the thread, the miss is not considered fatal. (we will try
    again in 1 second)

    Signed-off-by: Peter W Morreale
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     
  • This eliminates a compiler warning:

    mm/allocpercpu.c: In function 'free_percpu':
    mm/allocpercpu.c:146: warning: passing argument 2 of '__percpu_depopulate_mask' discards qualifiers from pointer target type

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • …git/tip/linux-2.6-tip

    * 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    kmemtrace: trace kfree() calls with NULL or zero-length objects
    kmemtrace: small cleanups
    kmemtrace: restore original tracing data binary format, improve ABI
    kmemtrace: kmemtrace_alloc() must fill type_id
    kmemtrace: use tracepoints
    kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints
    kmemtrace, rcu: fix rcupreempt.c data structure dependencies
    kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies
    kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies
    kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c
    kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c
    kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c
    kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c
    kmemtrace, squashfs: fix slab.h dependency problem in squasfs
    kmemtrace, befs: fix slab.h dependency problem
    kmemtrace, security: fix linux/key.h header file dependencies
    kmemtrace, fs: fix linux/fdtable.h header file dependencies
    kmemtrace, fs: uninline simple_transaction_set()
    kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h

    Linus Torvalds
     

06 Apr, 2009

4 commits

  • This makes sure that we never wait on async IO for sync requests, instead
    of doing the split on writes vs reads.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (714 commits)
    Staging: sxg: slicoss: Specify the license for Sahara SXG and Slicoss drivers
    Staging: serqt_usb: fix build due to proc tty changes
    Staging: serqt_usb: fix checkpatch errors
    Staging: serqt_usb: add TODO file
    Staging: serqt_usb: Lindent the code
    Staging: add USB serial Quatech driver
    staging: document that the wifi staging drivers a bit better
    Staging: echo cleanup
    Staging: BUG to BUG_ON changes
    Staging: remove some pointless conditionals before kfree_skb()
    Staging: line6: fix build error, select SND_RAWMIDI
    Staging: line6: fix checkpatch errors in variax.c
    Staging: line6: fix checkpatch errors in toneport.c
    Staging: line6: fix checkpatch errors in pcm.c
    Staging: line6: fix checkpatch errors in midibuf.c
    Staging: line6: fix checkpatch errors in midi.c
    Staging: line6: fix checkpatch errors in dumprequest.c
    Staging: line6: fix checkpatch errors in driver.c
    Staging: line6: fix checkpatch errors in audio.c
    Staging: line6: fix checkpatch errors in pod.c
    ...

    Linus Torvalds
     
  • * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
    tracing, net: fix net tree and tracing tree merge interaction
    tracing, powerpc: fix powerpc tree and tracing tree interaction
    ring-buffer: do not remove reader page from list on ring buffer free
    function-graph: allow unregistering twice
    trace: make argument 'mem' of trace_seq_putmem() const
    tracing: add missing 'extern' keywords to trace_output.h
    tracing: provide trace_seq_reserve()
    blktrace: print out BLK_TN_MESSAGE properly
    blktrace: extract duplidate code
    blktrace: fix memory leak when freeing struct blk_io_trace
    blktrace: fix blk_probes_ref chaos
    blktrace: make classic output more classic
    blktrace: fix off-by-one bug
    blktrace: fix the original blktrace
    blktrace: fix a race when creating blk_tree_root in debugfs
    blktrace: fix timestamp in binary output
    tracing, Text Edit Lock: cleanup
    tracing: filter fix for TRACE_EVENT_FORMAT events
    ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
    x86: kretprobe-booster interrupt emulation code fix
    ...

    Fix up trivial conflicts in
    arch/parisc/include/asm/ftrace.h
    include/linux/memory.h
    kernel/extable.c
    kernel/module.c

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
    cpumask: remove cpumask allocation from idle_balance, fix
    numa, cpumask: move numa_node_id default implementation to topology.h, fix
    cpumask: remove cpumask allocation from idle_balance
    x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
    x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
    x86: microcode: cleanup
    x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
    cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
    numa, cpumask: move numa_node_id default implementation to topology.h
    cpumask: convert node_to_cpumask_map[] to cpumask_var_t
    cpumask: remove x86 cpumask_t uses.
    cpumask: use cpumask_var_t in uv_flush_tlb_others.
    cpumask: remove cpumask_t assignment from vector_allocation_domain()
    cpumask: make Xen use the new operators.
    cpumask: clean up summit's send_IPI functions
    cpumask: use new cpumask functions throughout x86
    x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
    cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
    cpumask: convert node_to_cpumask_map[] to cpumask_var_t
    x86: unify 32 and 64-bit node_to_cpumask_map
    ...

    Linus Torvalds
     

04 Apr, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
    trivial: Update my email address
    trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
    trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
    trivial: Fix misspelling of "Celsius".
    trivial: remove unused variable 'path' in alloc_file()
    trivial: fix a pdlfush -> pdflush typo in comment
    trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
    trivial: wusb: Storage class should be before const qualifier
    trivial: drivers/char/bsr.c: Storage class should be before const qualifier
    trivial: h8300: Storage class should be before const qualifier
    trivial: fix where cgroup documentation is not correctly referred to
    trivial: Give the right path in Documentation example
    trivial: MTD: remove EOL from MODULE_DESCRIPTION
    trivial: Fix typo in bio_split()'s documentation
    trivial: PWM: fix of #endif comment
    trivial: fix typos/grammar errors in Kconfig texts
    trivial: Fix misspelling of firmware
    trivial: cgroups: documentation typo and spelling corrections
    trivial: Update contact info for Jochen Hein
    trivial: fix typo "resgister" -> "register"
    ...

    Linus Torvalds
     
  • This patch adds Kconfig and Makefile entries and exports to
    VFS functions to be used by POHMELFS.

    Signed-off-by: Evgeniy Polyakov
    Signed-off-by: Greg Kroah-Hartman

    Evgeniy Polyakov
     

03 Apr, 2009

22 commits

  • Add a function to install a monitor on the page lock waitqueue for a particular
    page, thus allowing the page being unlocked to be detected.

    This is used by CacheFiles to detect read completion on a page in the backing
    filesystem so that it can then copy the data to the waiting netfs page.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • The attached patch causes read_cache_pages() to release page-private data on a
    page for which add_to_page_cache() fails. If the filler function fails, then
    the problematic page is left attached to the pagecache (with appropriate flags
    set, one presumes) and the remaining to-be-attached pages are invalidated and
    discarded. This permits pages with caching references associated with them to
    be cleaned up.

    The invalidatepage() address space op is called (indirectly) to do the honours.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Impact: also output kfree(NULL) entries

    This patch moves the trace_kfree() calls before the ZERO_OR_NULL_PTR
    check so that we can trace call-sites that call kfree() with NULL many
    times which might be an indication of a bug.

    Signed-off-by: Pekka Enberg
    Cc: Eduard - Gabriel Munteanu
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Pekka Enberg
     
  • kmemtrace now uses tracepoints instead of markers. We no longer need to
    use format specifiers to pass arguments.

    Signed-off-by: Eduard - Gabriel Munteanu
    [ folded: Use the new TP_PROTO and TP_ARGS to fix the build. ]
    [ folded: fix build when CONFIG_KMEMTRACE is disabled. ]
    [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ]
    Signed-off-by: Pekka Enberg
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eduard - Gabriel Munteanu
     
  • Impact: cleanup

    mm/failslab.c depends on slab.h without including it:

    CC mm/failslab.o
    mm/failslab.c: In function ‘should_failslab’:
    mm/failslab.c:16: error: ‘__GFP_NOFAIL’ undeclared (first use in this function)
    mm/failslab.c:16: error: (Each undeclared identifier is reported only once
    mm/failslab.c:16: error: for each function it appears in.)
    mm/failslab.c:19: error: ‘__GFP_WAIT’ undeclared (first use in this function)
    make[1]: *** [mm/failslab.o] Error 1
    make: *** [mm] Error 2

    It gets included implicitly currently - but this will not be the
    case with upcoming kmemtrace changes.

    Signed-off-by: Pekka Enberg
    Cc: Eduard - Gabriel Munteanu
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Pekka Enberg
     
  • Current mem_cgroup_cache_charge is a bit complicated especially
    in the case of shmem's swap-in.

    This patch cleans it up by using try_charge_swapin and commit_charge_swapin.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • It's pointed out that swap_cgroup's message at swapon() is nonsense.
    Because

    * It can be calculated very easily if all necessary information is
    written in Kconfig.

    * It's not necessary to annoying people at every swapon().

    In other view, now, memory usage per swp_entry is reduced to 2bytes from
    8bytes(64bit) and I think it's reasonably small.

    Reported-by: Hugh Dickins
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine,
    size of swap_cgroup goes down to 2 bytes from 8bytes.

    This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)

    From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
    To size of swap_cgroup = 2G/4k * 2 = 1Mbytes.

    Reduction is large. Of course, there are trade-offs. This CSS ID will
    add overhead to swap-in/swap-out/swap-free.

    But in general,
    - swap is a resource which the user tend to avoid use.
    - If swap is never used, swap_cgroup area is not used.
    - Reading traditional manuals, size of swap should be proportional to
    size of memory. Memory size of machine is increasing now.

    I think reducing size of swap_cgroup makes sense.

    Note:
    - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
    - memcg can be obsolete at rmdir() but not freed while refcnt from
    swap_cgroup is available.

    Changelog v4->v5:
    - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
    Changlog ->v4:
    - fixed not configured case.
    - deleted unnecessary comments.
    - fixed NULL pointer bug.
    - fixed message in dmesg.

    [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg_test.txt says at 4.1:

    This swap-in is one of the most complicated work. In do_swap_page(),
    following events occur when pte is unchanged.

    (1) the page (SwapCache) is looked up.
    (2) lock_page()
    (3) try_charge_swapin()
    (4) reuse_swap_page() (may call delete_swap_cache())
    (5) commit_charge_swapin()
    (6) swap_free().

    Considering following situation for example.

    (A) The page has not been charged before (2) and reuse_swap_page()
    doesn't call delete_from_swap_cache().
    (B) The page has not been charged before (2) and reuse_swap_page()
    calls delete_from_swap_cache().
    (C) The page has been charged before (2) and reuse_swap_page() doesn't
    call delete_from_swap_cache().
    (D) The page has been charged before (2) and reuse_swap_page() calls
    delete_from_swap_cache().

    memory.usage/memsw.usage changes to this page/swp_entry will be
    Case (A) (B) (C) (D)
    Event
    Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
    ===========================================
    (3) +1/+1 +1/+1 +1/+1 +1/+1
    (4) - 0/ 0 - -1/ 0
    (5) 0/-1 0/ 0 -1/-1 0/ 0
    (6) - 0/-1 - 0/-1
    ===========================================
    Result 1/ 1 1/ 1 1/ 1 1/ 1

    In any cases, charges to this page should be 1/ 1.

    In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
    (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
    charges to the memcg("foo") to which the "current" belongs.
    OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
    to which the page has been charged.

    So, if the "foo" and "baa" is different(for example because of task move),
    this charge will be moved from "baa" to "foo".

    I think this is an unexpected behavior.

    This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
    to return the memcg to which the swapcache has been charged if PCG_USED bit
    is set.
    IIUC, checking PCG_USED bit of swapcache is safe under page lock.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Currently, mem_cgroup_calc_mapped_ratio() is unused at all. it can be
    removed and KAMEZAWA-san suggested it.

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add RSS and swap to OOM output from memcg

    Display memcg values like failcnt, usage and limit when an OOM occurs due
    to memcg.

    Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
    Daisuke Nishimura and KOSAKI Motohiro for review.

    Sample output
    -------------

    Task in /a/x killed as a result of limit of /a
    memory: usage 1048576kB, limit 1048576kB, failcnt 4183
    memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

    [akpm@linux-foundation.org: compilation fix]
    [akpm@linux-foundation.org: fix kerneldoc and whitespace]
    [akpm@linux-foundation.org: add printk facility level]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • This patch tries to fix OOM Killer problems caused by hierarchy.
    Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
    kill a task in memcg.

    But, when hierarchy is used, it's broken and correct task cannot
    be killed. For example, in following cgroup

    /groupA/ hierarchy=1, limit=1G,
    01 nolimit
    02 nolimit
    All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
    groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

    This patch provides makes the bad process be selected from all tasks
    under hierarchy. BTW, currently, oom_jiffies is updated against groupA
    in above case. oom_jiffies of tree should be updated.

    To see how oom_jiffies is used, please check mem_cgroup_oom_called()
    callers.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: const fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • As pointed out, shrinking memcg's limit should return -EBUSY after
    reasonable retries. This patch tries to fix the current behavior of
    shrink_usage.

    Before looking into "shrink should return -EBUSY" problem, we should fix
    hierarchical reclaim code. It compares current usage and current limit,
    but it only makes sense when the kernel reclaims memory because hit
    limits. This is also a problem.

    What this patch does are.

    1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
    hierarchical reclaim returns immediately and the caller checks the kernel
    should shrink more or not.
    (At shrinking memory, usage is always smaller than limit. So check for
    usage < limit is useless.)

    2. For adjusting to above change, 2 changes in "shrink"'s retry path.
    2-a. retry_count depends on # of children because the kernel visits
    the children under hierarchy one by one.
    2-b. rather than checking return value of hierarchical_reclaim's progress,
    compares usage-before-shrink and usage-after-shrink.
    If usage-before-shrink
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Clean up memory.stat file routine and show "total" hierarchical stat.

    This patch does
    - renamed get_all_zonestat to be get_local_zonestat.
    - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
    - add mcs_stat to cover both of per-cpu/per-lru stat.
    - add "total" stat of hierarchy (*)
    - add a callback system to scan all memcg under a root.
    == "total" is added.
    [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
    cache 0
    rss 0
    pgpgin 0
    pgpgout 0
    inactive_anon 0
    active_anon 0
    inactive_file 0
    active_file 0
    unevictable 0
    hierarchical_memory_limit 50331648
    hierarchical_memsw_limit 9223372036854775807
    total_cache 65536
    total_rss 192512
    total_pgpgin 218
    total_pgpgout 155
    total_inactive_anon 0
    total_active_anon 135168
    total_inactive_file 61440
    total_active_file 4096
    total_unevictable 0
    ==
    (*) maybe the user can do calc hierarchical stat by his own program
    in userland but if it can be written in clean way, it's worth to be
    shown, I think.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

    Assume folloing tree.

    group_A (ID=3)
    /01 (ID=4)
    /0A (ID=7)
    /02 (ID=10)
    group_B (ID=5)
    and task in group_A/01/0A hits limit at group_A.

    reclaim will be done in following order (round-robin).
    group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
    -> group_A -> .....

    Round robin by ID. The last visited cgroup is recorded and restart
    from it when it start reclaim again.
    (More smart algorithm can be implemented..)

    No cgroup_mutex or hierarchy_mutex is required.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In following situation, with memory subsystem,

    /groupA use_hierarchy==1
    /01 some tasks
    /02 some tasks
    /03 some tasks
    /04 empty

    When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
    is triggered and the kernel walks tree under groupA. In this case,
    rmdir /groupA/04 fails with -EBUSY frequently because of temporal
    refcnt from the kernel.

    In general. cgroup can be rmdir'd if there are no children groups and
    no tasks. Frequent fails of rmdir() is not useful to users.
    (And the reason for -EBUSY is unknown to users.....in most cases)

    This patch tries to modify above behavior, by
    - retries if css_refcnt is got by someone.
    - add "return value" to pre_destroy() and allows subsystem to
    say "we're really busy!"

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • It is a fairly common operation to have a pointer to a work and to need a
    pointer to the delayed work it is contained in. In particular, all
    delayed works which want to rearm themselves will have to do that. So it
    would seem fair to offer a helper function for this operation.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jean Delvare
    Acked-by: Ingo Molnar
    Cc: "David S. Miller"
    Cc: Herbert Xu
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Greg KH
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • The calculation of the value nr in do_xip_mapping_read is incorrect. If
    the copy required more than one iteration in the do while loop the copies
    variable will be non-zero. The maximum length that may be passed to the
    call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the
    check only compares against (nr > len).

    This bug is the cause for the heap corruption Carsten has been chasing
    for so long:

    *** glibc detected *** /bin/bash: free(): invalid next size (normal): 0x00000000800e39f0 ***
    ======= Backtrace: =========
    /lib64/libc.so.6[0x200000b9b44]
    /lib64/libc.so.6(cfree+0x8e)[0x200000bdade]
    /bin/bash(free_buffered_stream+0x32)[0x80050e4e]
    /bin/bash(close_buffered_stream+0x1c)[0x80050ea4]
    /bin/bash(unset_bash_input+0x2a)[0x8001c366]
    /bin/bash(make_child+0x1d4)[0x8004115c]
    /bin/bash[0x8002fc3c]
    /bin/bash(execute_command_internal+0x656)[0x8003048e]
    /bin/bash(execute_command+0x5e)[0x80031e1e]
    /bin/bash(execute_command_internal+0x79a)[0x800305d2]
    /bin/bash(execute_command+0x5e)[0x80031e1e]
    /bin/bash(reader_loop+0x270)[0x8001efe0]
    /bin/bash(main+0x1328)[0x8001e960]
    /lib64/libc.so.6(__libc_start_main+0x100)[0x200000592a8]
    /bin/bash(clearerr+0x5e)[0x8001c092]

    With this bug fix the commit 0e4a9b59282914fe057ab17027f55123964bc2e2
    "ext2/xip: refuse to change xip flag during remount with busy inodes" can
    be removed again.

    Cc: Carsten Otte
    Cc: Nick Piggin
    Cc: Jared Hulbert
    Cc:
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Even though vmstat_work is marked deferrable, there are still benefits to
    aligning it. For certain applications we want to keep OS jitter as low as
    possible and aligning timers and work so they occur together can reduce
    their overall impact.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • Fix a number of issues with the per-MM VMA patch:

    (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
    a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
    system.

    (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
    lest it overflow.

    (3) Move the allocation of the vm_area_struct slab back for fork.c.

    (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

    (5) Use BUG_ON() rather than if () BUG().

    (6) Make the default validate_nommu_regions() a static inline rather than a
    #define.

    (7) Make free_page_series()'s objection to pages with a refcount != 1 more
    informative.

    (8) Adjust the __put_nommu_region() banner comment to indicate that the
    semaphore must be held for writing.

    (9) Limit the number of warnings about munmaps of non-mmapped regions.

    Reported-by: Andrew Morton
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • This fixes a build failure with generic debug pagealloc:

    mm/debug-pagealloc.c: In function 'set_page_poison':
    mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: In function 'clear_page_poison':
    mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: In function 'page_poison':
    mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: At top level:
    mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
    include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
    mm/debug-pagealloc.c: In function 'kernel_map_pages':
    mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)

    by fixing

    - debug_flags should be in struct page
    - define DEBUG_PAGEALLOC config option for all architectures

    Signed-off-by: Akinobu Mita
    Reported-by: Alexander Beregalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

02 Apr, 2009

1 commit


01 Apr, 2009

7 commits

  • Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
    swap loads benefit, and a catastrophic interaction between SLUB and some
    flash storage is avoided.

    shmem_writepage() has always been peculiar in making no attempt to write:
    it has just transferred a shmem page from file cache to swap cache, then
    let that page make its way around the LRU again before being written and
    freed.

    The idea was that people use tmpfs because they want those pages to stay
    in RAM; so although we give it an overflow to swap, we should resist
    writing too soon, giving those pages a second chance before they can be
    reclaimed.

    That was always questionable, and I've toyed with this patch for years;
    but never had a clear justification to depart from the original design.

    It became more questionable in 2.6.28, when the split LRU patches classed
    shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
    itself gives them more resistance to reclaim than normal file pages. I
    prepared this patch for 2.6.29, but the merge window arrived before I'd
    completed gathering statistics to justify sending it in.

    Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
    habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
    tests five times slower than SLAB or SLQB - other machines slower too, but
    nowhere near so bad. Simpler "cp -a" swapping tests showed the same.

    slub_max_order=0 brings sanity to all, but heavy swapping is too far from
    normal to justify such a tuning. The crucial factor on that laptop turns
    out to be that I'm using an SD card for swap. What happens is this:

    By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
    fs inodes), so creating tmpfs files under memory pressure brings lumpy
    reclaim into play. One subpage of the order is chosen from the bottom of
    the LRU as usual, then the other three picked out from their random
    positions on the LRUs.

    In a tmpfs load, many of these pages will be ones which already passed
    through shmem_writepage, so already have swap allocated. And though their
    offsets on swap were probably allocated sequentially, now that the pages
    are picked off at random, their swap offsets are scattered.

    But the flash storage on the SD card is very sensitive to having its
    writes merged: once swap is written at scattered offsets, performance
    falls apart. Rotating disk seeks increase too, but less disastrously.

    So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
    out to swap as soon as their swap has been allocated.

    It's surely possible to devise an artificial load which runs faster the
    old way, one whose sizing is such that the tmpfs pages on their second
    pass are the ones that are wanted again, and other pages not.

    But I've not yet found such a load: on all machines, under the loads I've
    tried, immediate swap_writepage speeds up shmem swapping: especially when
    using the SLUB allocator (and more effectively than slub_max_order=0), but
    also with the others; and it also reduces the variance between runs. How
    much faster varies widely: a factor of five is rare, 5% is common.

    One load which might have suffered: imagine a swapping shmem load in a
    limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the
    swapcache was not charged, and such a load would have run quickest with
    the shmem swapcache never written to swap. But now swapcache is charged,
    so even this load benefits from shmem_writepage directly to swap.

    Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
    it's silly because that will never get called; but refactoring shmem.c
    sensibly according to CONFIG_SWAP will be a separate task.

    Signed-off-by: Hugh Dickins
    Acked-by: Pekka Enberg
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • try_to_free_pages() is used for the direct reclaim of up to
    SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
    alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
    be used but this is not passed to try_to_free_pages(). This can lead to
    unnecessary reclaim of pages that are unusable by the caller and int the
    worst case lead to allocation failure as progress was not been make where
    it is needed.

    This patch passes the nodemask used for alloc_pages_nodemask() to
    try_to_free_pages().

    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When a shrinker has a negative number of objects to delete, the symbol
    name of the shrinker should be printed, not shrink_slab. This also makes
    the error message slightly more informative.

    Cc: Ingo Molnar
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n. There's no logical
    reason it shouldn't be available, and it can be used for ramfs.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The mlock() facility does not exist for NOMMU since all mappings are
    effectively locked anyway, so we don't make the bits available when
    they're not useful.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • x86 has debug_kmap_atomic_prot() which is error checking function for
    kmap_atomic. It is usefull for the other architectures, although it needs
    CONFIG_TRACE_IRQFLAGS_SUPPORT.

    This patch exposes it to the other architectures.

    Signed-off-by: Akinobu Mita
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin