10 Nov, 2009

2 commits


09 Oct, 2009

1 commit


08 Oct, 2009

3 commits

  • When a vmalloc'd area is mmap'd into userspace, some kind of
    co-ordination is necessary for this to work on platforms with cpu
    D-caches which can have aliases.

    Otherwise kernel side writes won't be seen properly in userspace
    and vice versa.

    If the kernel side mapping and the user side one have the same
    alignment, modulo SHMLBA, this can work as long as VM_SHARED is
    shared of VMA and for all current users this is true. VM_SHARED
    will force SHMLBA alignment of the user side mmap on platforms with
    D-cache aliasing matters.

    The bulk of this patch is just making it so that a specific
    alignment can be passed down into __get_vm_area_node(). All
    existing callers pass in '1' which preserves existing behavior.
    vmalloc_user() gives SHMLBA for the alignment.

    As a side effect this should get the video media drivers and other
    vmalloc_user() users into more working shape on such systems.

    Signed-off-by: David S. Miller
    Acked-by: Peter Zijlstra
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    David Miller
     
  • fix the following 'make includecheck' warning:

    mm/vmalloc.c: linux/highmem.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • Adjust the max_kernel_pages default to a quarter of totalram_pages,
    instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
    highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
    pinning 32MB of lowmem with their rmap_items, so no need for the more
    obscure calculation (nor for its own special init function).

    There is no way for the user to switch KSM on if CONFIG_SYSFS is not
    enabled, so in that case default run to KSM_RUN_MERGE.

    Update KSM Documentation and Kconfig to reflect the new defaults.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Oct, 2009

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits)
    Revert "Seperate read and write statistics of in_flight requests"
    cfq-iosched: don't delay async queue if it hasn't dispatched at all
    block: Topology ioctls
    cfq-iosched: use assigned slice sync value, not default
    cfq-iosched: rename 'desktop' sysfs entry to 'low_latency'
    cfq-iosched: implement slower async initiate and queue ramp up
    cfq-iosched: delay async IO dispatch, if sync IO was just done
    cfq-iosched: add a knob for desktop interactiveness
    Add a tracepoint for block request remapping
    block: allow large discard requests
    block: use normal I/O path for discard requests
    swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL
    fs/bio.c: move EXPORT* macros to line after function
    Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
    cciss: fix build when !PROC_FS
    block: Do not clamp max_hw_sectors for stacking devices
    block: Set max_sectors correctly for stacking devices
    cciss: cciss_host_attr_groups should be const
    cciss: Dynamically allocate the drive_info_struct for each logical drive.
    cciss: Add usage_count attribute to each logical drive in /sys
    ...

    Linus Torvalds
     

02 Oct, 2009

5 commits

  • In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
    and it takes res_counter's spin_lock every time.

    This patch removes unnecessary calls for res_count_soft_limit_excess.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch clean up/fixes for memcg's uncharge soft limit path.

    Problems:
    Now, res_counter_charge()/uncharge() handles softlimit information at
    charge/uncharge and softlimit-check is done when event counter per memcg
    goes over limit. Now, event counter per memcg is updated only when
    memory usage is over soft limit. Here, considering hierarchical memcg
    management, ancesotors should be taken care of.

    Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
    This is not good.

    Prolems:
    1. memcg's event counter incremented only when softlimit hits. That's bad.
    It makes event counter hard to be reused for other purpose.

    2. At uncharge, only the lowest level rescounter is handled. This is bug.
    Because ancesotor's event counter is not incremented, children should
    take care of them.

    3. res_counter_uncharge()'s 3rd argument is NULL in most case.
    ops under res_counter->lock should be small. No "if" sentense is better.

    Fixes:
    * Removed soft_limit_xx poitner and checks in charge and uncharge.
    Do-check-only-when-necessary scheme works enough well without them.

    * make event-counter of memcg incremented at every charge/uncharge.
    (per-cpu area will be accessed soon anyway)

    * All ancestors are checked at soft-limit-check. This is necessary because
    ancesotor's event counter may never be modified. Then, they should be
    checked at the same time.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
    with incremnted mz->mem->css's refcnt. Then, the caller of this function
    has to call css_put(mz->mem->css).

    But, mz can be !NULL even if "not found" i.e. without css_get(). By
    this, css->refcnt will go down to minus.

    This may cause various things...one of results will be
    initite-loop in css_tryget() as this.

    INFO: RCU detected CPU 0 stall (t=10000 jiffies)
    sending NMI to all CPUs:
    NMI backtrace for cpu 0
    CPU 0:

    <> [] trace_hardirqs_off+0xd/0x10
    [] flat_send_IPI_mask+0x90/0xb0
    [] flat_send_IPI_all+0x69/0x70
    [] arch_trigger_all_cpu_backtrace+0x62/0xa0
    [] __rcu_pending+0x7e/0x370
    [] rcu_check_callbacks+0x47/0x130
    [] update_process_times+0x46/0x70
    [] tick_sched_timer+0x60/0x160
    [] ? tick_sched_timer+0x0/0x160
    [] __run_hrtimer+0xba/0x150
    [] hrtimer_interrupt+0xd5/0x1b0
    [] ? trace_hardirqs_off_thunk+0x3a/0x3c
    [] smp_apic_timer_interrupt+0x6d/0x9b
    [] apic_timer_interrupt+0x13/0x20
    [] ? mem_cgroup_walk_tree+0x156/0x180
    [] ? mem_cgroup_walk_tree+0x73/0x180
    [] ? mem_cgroup_walk_tree+0x32/0x180
    [] ? mem_cgroup_get_local_stat+0x0/0x110
    [] ? mem_control_stat_show+0x14b/0x330
    [] ? cgroup_seqfile_show+0x3d/0x60

    Above shows CPU0 caught in css_tryget()'s inifinite loop because
    of bad refcnt.

    This is a fix to set mz=NULL at the top of retry path.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The page_address_in_vma() is not only used in unuse_vma().

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • While testing Swap over NFS patchset, I noticed an oops that was triggered
    during swapon. Investigating further, the NULL pointer deference is due to the
    SSD device check/optimization in the swapon code that assumes s_bdev could never
    be NULL.

    inode->i_sb->s_bdev could be NULL in a few cases. For e.g. one such case is
    loopback NFS mount, there could be others as well. Fix this by ensuring s_bdev
    is not NULL before we try to deference s_bdev.

    Signed-off-by: Suresh Jayaraman
    Signed-off-by: Jens Axboe

    Suresh Jayaraman
     

29 Sep, 2009

5 commits

  • Warn and dump stack when percpu allocation fails. percpu allocator is
    still young and unchecked NULL percpu pointer usage can result in
    random memory corruption when combined with the pointer shifting in
    access macros. Allocation failures should be rare and the warning
    message will be disabled after certain times.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • The parameters to pcpu_setup_first_chunk() come from different sources
    depending on architecture and can be quite complex. The function runs
    various sanity checks on the parameters and triggers BUG() if
    something isn't right. However, this is very early during the boot
    and not reporting exactly what the problem is makes debugging even
    harder.

    Add PCPU_SETUP_BUG() macro which prints out enough information about
    the parameters. As the macro still puts separate BUG() for each
    check, it won't lose any information even on the situations where only
    the program counter can be retrieved.

    While at it, also bump pcpu_dump_alloc_info() message to KERN_INFO so
    that it's visible on the console if boot fails to complete.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Embedding first chunk allocator maintains the distances between units
    in the vmalloc area and thus needs vmalloc space to be larger than the
    maximum distances between units; otherwise, it wouldn't be able to
    create any dynamic chunks. This patch makes the embedding first chunk
    allocator check vmalloc space size and if the maximum distance between
    units is larger than 75% of it, print warning and, if page mapping
    allocator is available, fail initialization so that the system falls
    back onto it.

    This should work around percpu allocation failure problems on certain
    sparc64 configurations where distances between NUMA nodes are larger
    than the vmalloc area and makes percpu allocator more robust for
    future configurations.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • pcpu_build_alloc_info() may be called multiple times when percpu is
    falling back to different first chunk allocator. Make it clear static
    buffers so that they don't contain values from previous runs.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • pcpu_setup_first_chunk() incorrectly used NR_CPUS as the impossible
    unit number while unit number can equal and go over NR_CPUS with
    sparse unit map. This triggers BUG_ON() spuriously on machines which
    have non-power-of-two number of cpus. Use UINT_MAX instead.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Tony Vroon

    Tejun Heo
     

28 Sep, 2009

1 commit


27 Sep, 2009

1 commit

  • This build failure triggers:

    In file included from include/linux/suspend.h:8,
    from arch/x86/kernel/asm-offsets_32.c:11,
    from arch/x86/kernel/asm-offsets.c:2:
    include/linux/mm.h:503:2: error: #error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS

    Because due to the hwpoison page flag we ran out of page
    flags on 32-bit.

    Dont turn on hwpoison on 32-bit NUMA (it's rare in any
    case).

    Also clean up the Kconfig dependencies in the generic MM
    code by introducing ARCH_SUPPORTS_MEMORY_FAILURE.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Linus Torvalds
     

26 Sep, 2009

5 commits

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Treat bdi_start_writeback(0) as a special request to do background write,
    and stop such work when we are below the background dirty threshold.

    Also simplify the (nr_pages
    CC: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Some filesystem may choose to write much more than ratelimit_pages
    before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
    determine number to write based on real number of dirtied pages.

    Otherwise it is possible that
    loop {
    btrfs_file_write(): dirty 1024 pages
    balance_dirty_pages(): write up to 48 pages (= ratelimit_pages * 1.5)
    }
    in which the writeback rate cannot keep up with dirty rate, and the
    dirty pages go all the way beyond dirty_thresh.

    The increased write_chunk may make the dirtier more bumpy.
    So filesystems shall be take care not to dirty too much at
    a time (eg. > 4MB) without checking the ratelimit.

    Signed-off-by: Wu Fengguang
    Acked-by: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Wu Fengguang
     

25 Sep, 2009

3 commits

  • Ignore the address parameter given to NOMMU mmap() as it is a hint, rather
    than giving an error if it's non-zero. MAP_FIXED still gets an error.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
    might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
    whether or not we should be trying to map a file directly. This can be used to
    determine whether or not a region has been filled in at the point where we call
    do_mmap_shared() or do_mmap_private().

    The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
    there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
    f_op->get_unmapped_area() fails.

    Without this fix, attempting to run a program from a RomFS image on a
    non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
    this can be caught in gdb:

    Program received signal SIGABRT, Aborted.
    0xc005dce8 in add_nommu_region (region=) at mm/nommu.c:547
    (gdb) bt
    #0 0xc005dce8 in add_nommu_region (region=) at mm/nommu.c:547
    #1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
    #2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
    #3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=) at fs/binfmt_elf_fdpic.c:343
    #4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
    #5 0xc006c648 in do_execve (filename=, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
    #6 0xc0008cf0 in sys_execve (name=, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
    #7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897

    Note that this fix does the following commit differently:

    commit a190887b58c32d19c2eee007c5eb8faa970a69ba
    Author: David Howells
    Date: Sat Sep 5 11:17:07 2009 -0700
    nommu: fix error handling in do_mmap_pgoff()

    Reported-by: Graff Yang
    Signed-off-by: David Howells
    Acked-by: Pekka Enberg
    Cc: Paul Mundt
    Cc: Mel Gorman
    Cc: Greg Ungerer
    Signed-off-by: Linus Torvalds

    David Howells
     
  • It needs walk_page_range().

    Reported-by: Michal Simek
    Tested-by: Michal Simek
    Cc: Stefani Seibold
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Sep, 2009

13 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would
    be useful for users to show swap usage in memory.stat file, because they
    don't need calculate memsw.usage - res.usage to know swap usage.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Reduce the resource counter overhead (mostly spinlock) associated with the
    root cgroup. This is a part of the several patches to reduce mem cgroup
    overhead. I had posted other approaches earlier (including using percpu
    counters). Those patches will be a natural addition and will be added
    iteratively on top of these.

    The patch stops resource counter accounting for the root cgroup. The data
    for display is derived from the statisitcs we maintain via
    mem_cgroup_charge_statistics (which is more scalable). What happens today
    is that, we do double accounting, once using res_counter_charge() and once
    using memory_cgroup_charge_statistics(). For the root, since we don't
    implement limits any more, we don't need to track every charge via
    res_counter_charge() and check for limit being exceeded and reclaim.

    The main mem->res usage_in_bytes can be derived by summing the cache and
    rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
    MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
    additional data about swapped out memory. This patch adds a
    MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
    MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
    recursively when hierarchy is enabled.

    The tests results I see on a 24 way show that

    1. The lock contention disappears from /proc/lock_stats
    2. The results of the test are comparable to running with
    cgroup_disable=memory.

    Here is a sample of my program runs

    Without Patch

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7192804.124144 task-clock-msecs # 23.937 CPUs
    424691 context-switches # 0.000 M/sec
    267 CPU-migrations # 0.000 M/sec
    28498113 page-faults # 0.004 M/sec
    5826093739340 cycles # 809.989 M/sec
    408883496292 instructions # 0.070 IPC
    7057079452 cache-references # 0.981 M/sec
    3036086243 cache-misses # 0.422 M/sec

    300.485365680 seconds time elapsed

    With cgroup_disable=memory

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7182183.546587 task-clock-msecs # 23.915 CPUs
    425458 context-switches # 0.000 M/sec
    203 CPU-migrations # 0.000 M/sec
    92545093 page-faults # 0.013 M/sec
    6034363609986 cycles # 840.185 M/sec
    437204346785 instructions # 0.072 IPC
    6636073192 cache-references # 0.924 M/sec
    2358117732 cache-misses # 0.328 M/sec

    300.320905827 seconds time elapsed

    With this patch applied

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7191619.223977 task-clock-msecs # 23.955 CPUs
    422579 context-switches # 0.000 M/sec
    88 CPU-migrations # 0.000 M/sec
    91946060 page-faults # 0.013 M/sec
    5957054385619 cycles # 828.333 M/sec
    1058117350365 instructions # 0.178 IPC
    9161776218 cache-references # 1.274 M/sec
    1920494280 cache-misses # 0.267 M/sec

    300.218764862 seconds time elapsed

    Data from Prarit (kernel compile with make -j64 on a 64
    CPU/32G machine)

    For a single run

    Without patch

    real 27m8.988s
    user 87m24.916s
    sys 382m6.037s

    With patch

    real 4m18.607s
    user 84m58.943s
    sys 50m52.682s

    With config turned off

    real 4m54.972s
    user 90m13.456s
    sys 50m19.711s

    NOTE: The data looks counterintuitive due to the increased performance
    with the patch, even over the config being turned off. We probably need
    more runs, but so far all testing has shown that the patches definitely
    help.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Balbir Singh
    Cc: Prarit Bhargava
    Cc: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Implement reclaim from groups over their soft limit

    Permit reclaim from memory cgroups on contention (via the direct reclaim
    path).

    memory cgroup soft limit reclaim finds the group that exceeds its soft
    limit by the largest number of pages and reclaims pages from it and then
    reinserts the cgroup into its correct place in the rbtree.

    Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
    loops in case all swap is turned off. The code has been refactored and
    the loop check (loop < 2) has been enhanced for soft limits. For soft
    limits, we try to do more targetted reclaim. Instead of bailing out after
    two loops, the routine now reclaims memory proportional to the size by
    which the soft limit is exceeded. The proportion has been empirically
    determined.

    [akpm@linux-foundation.org: build fix]
    [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
    [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Refactor mem_cgroup_hierarchical_reclaim()

    Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
    flags, so that new parameters don't have to be passed as we make the
    reclaim routine more flexible

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Organize cgroups over soft limit in a RB-Tree

    Introduce an RB-Tree for storing memory cgroups that are over their soft
    limit. The overall goal is to

    1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
    We are careful about updates, updates take place only after a particular
    time interval has passed
    2. We remove the node from the RB-Tree when the usage goes below the soft
    limit

    The next set of patches will exploit the RB-Tree to get the group that is
    over its soft limit by the largest amount and reclaim from it, when we
    face memory contention.

    [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
    Signed-off-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add an interface to allow get/set of soft limits. Soft limits for memory
    plus swap controller (memsw) is currently not supported. Resource
    counters have been enhanced to support soft limits and new type
    RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
    directly set and do not need any reclaim or checks before setting them to
    a newer value.

    Kamezawa-San raised a question as to whether soft limit should belong to
    res_counter. Since all resources understand the basic concepts of hard
    and soft limits, it is justified to add soft limits here. Soft limits are
    a generic resource usage feature, even file system quotas support soft
    limits.

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Change the memory cgroup to remove the overhead associated with accounting
    all pages in the root cgroup. As a side-effect, we can no longer set a
    memory hard limit in the root cgroup.

    A new flag to track whether the page has been accounted or not has been
    added as well. Flags are now set atomically for page_cgroup,
    pcg_default_flags is now obsolete and removed.

    [akpm@linux-foundation.org: fix a few documentation glitches]
    Signed-off-by: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Now that ksm is in mainline it is better to change the default values to
    better fit to most of the users.

    This patch change the ksm default values to be:

    ksm_thread_pages_to_scan = 100 (instead of 200)
    ksm_thread_sleep_millisecs = 20 (like before)
    ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
    disabled by default)
    ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

    The important aspect of this patch is: it disables ksm by default, and sets
    the number of the kernel_pages that can be allocated to be a reasonable
    number.

    Signed-off-by: Izik Eidus
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus