27 May, 2011

18 commits

  • …x/kernel/git/jeremy/xen

    * 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
    xen: fix compile without CONFIG_XEN_DEBUG_FS
    Use arbitrary_virt_to_machine() to deal with ioremapped pud updates.
    Use arbitrary_virt_to_machine() to deal with ioremapped pmd updates.
    xen/mmu: remove all ad-hoc stats stuff
    xen: use normal virt_to_machine for ptes
    xen: make a pile of mmu pvop functions static
    vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
    xen: condense everything onto xen_set_pte
    xen: use mmu_update for xen_set_pte_at()
    xen: drop all the special iomap pte paths.

    Linus Torvalds
     
  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The new API exports numa_maps per-memcg basis. This is a piece of useful
    information where it exports per-memcg page distribution across real numa
    nodes.

    One of the usecases is evaluating application performance by combining
    this information w/ the cpu allocation to the application.

    The output of the memory.numastat tries to follow w/ simiar format of
    numa_maps like:

    total= N0= N1= ...
    file= N0= N1= ...
    anon= N0= N1= ...
    unevictable= N0= N1= ...

    And we have per-node:

    total = file + anon + unevictable

    $ cat /dev/cgroup/memory/memory.numa_stat
    total=250020 N0=87620 N1=52367 N2=45298 N3=64735
    file=225232 N0=83402 N1=46160 N2=40522 N3=55148
    anon=21053 N0=3424 N1=6207 N2=4776 N3=6646
    unevictable=3735 N0=794 N1=0 N2=0 N3=2941

    Signed-off-by: Ying Han
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The caller of the function has been renamed to zone_nr_lru_pages(), and
    this is just fixing up in the memcg code. The current name is easily to
    be mis-read as zone's total number of pages.

    Signed-off-by: Ying Han
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • If the memcg reclaim code detects the target memcg below its limit it
    exits and returns a guaranteed non-zero value so that the charge is
    retried.

    Nowadays, the charge side checks the memcg limit itself and does not rely
    on this non-zero return value trick.

    This patch removes it. The reclaim code will now always return the true
    number of pages it reclaimed on its own.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • During memory reclaim we determine the number of pages to be scanned per
    zone as

    (anon + file) >> priority.
    Assume
    scan = (anon + file) >> priority.

    If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
    priority gets higher. This has some problems.

    1. This increases priority as 1 without any scan.
    To do scan in this priority, amount of pages should be larger than 512M.
    If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
    batched, later. (But we lose 1 priority.)
    If memory size is below 16M, pages >> priority is 0 and no scan in
    DEF_PRIORITY forever.

    2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
    So, x86's ZONE_DMA will never be recoverred until the user of pages
    frees memory by itself.

    3. With memcg, the limit of memory can be small. When using small memcg,
    it gets priority < DEF_PRIORITY-2 very easily and need to call
    wait_iff_congested().
    For doing scan before priorty=9, 64MB of memory should be used.

    Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

    1. the target is enough small.
    2. it's kswapd or memcg reclaim.

    Then we can avoid rapid priority drop and may be able to recover
    all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
    This will allow scanning in this priority even when pages >> priority is
    very small.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, memory cgroup's direct reclaim frees memory from the current
    node. But this has some troubles. Usually when a set of threads works in
    a cooperative way, they tend to operate on the same node. So if they hit
    limits under memcg they will reclaim memory from themselves, damaging the
    active working set.

    For example, assume 2 node system which has Node 0 and Node 1 and a memcg
    which has 1G limit. After some work, file cache remains and the usages
    are

    Node 0: 1M
    Node 1: 998M.

    and run an application on Node 0, it will eat its foot before freeing
    unnecessary file caches.

    This patch adds round-robin for NUMA and adds equal pressure to each node.
    When using cpuset's spread memory feature, this will work very well.

    But yes, a better algorithm is needed.

    [akpm@linux-foundation.org: comment editing]
    [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
    Signed-off-by: Ying Han
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Move page-freeing code out of swap_cgroup_mutex in the hope that it could
    reduce few of theoretical contentions between swapons and/or swapoffs.

    This is just a cleanup, no functional changes.

    Signed-off-by: Namhyung Kim
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • It allocated one more page than necessary if @max_pages was a multiple of
    SC_PER_PAGE.

    Signed-off-by: Namhyung Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Commit ca371c0d7e23 ("memcg: fix page_cgroup fatal error in FLATMEM")
    removes call to alloc_bootmem() in the function so that it can be marked
    as __meminit to reduce memory usage when MEMORY_HOTPLUG=n.

    Also as the new helper function alloc_page_cgroup() is called only in the
    function, it should be marked too.

    Signed-off-by: Namhyung Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Michal Hocko
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node
    selects the same mz. This doesn't make much sense as we assign to the
    variable right in the next loop.

    Compiler will probably optimize this out but it is little bit confusing
    for the code reading.

    Signed-off-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We recently added the change in global background reclaim which counts the
    return value of soft_limit reclaim. Now this patch adds the similar logic
    on global direct reclaim.

    We should skip scanning global LRU on shrink_zone if soft_limit reclaim
    does enough work. This is the first step where we start with counting the
    nr_scanned and nr_reclaimed from soft_limit reclaim into global
    scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The global kswapd scans per-zone LRU and reclaims pages regardless of the
    cgroup. It breaks memory isolation since one cgroup can end up reclaiming
    pages from another cgroup. Instead we should rely on memcg-aware target
    reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
    memory pressure.

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored. This patch is the first
    step to skip shrink_zone() if soft_limit reclaim does enough work.

    This is part of the effort which tries to reduce reclaiming pages in global
    LRU in memcg. The per-memcg background reclaim patchset further enhances the
    per-cgroup targetting reclaim, which I should have V4 posted shortly.

    Try running multiple memory intensive workloads within seperate memcgs. Watch
    the counters of soft_steal in memory.stat.

    $ cat /dev/cgroup/A/memory.stat | grep 'soft'
    soft_steal 240000
    soft_scan 240000
    total_soft_steal 240000
    total_soft_scan 240000

    This patch:

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored.

    We would like to skip shrink_zone() if soft_limit reclaim does enough
    work. Also, we need to make the memory pressure balanced across per-memcg
    zones, like the logic vm-core. This patch is the first step where we
    start with counting the nr_scanned and nr_reclaimed from soft_limit
    reclaim into the global scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This fourth patch of eight in this cleancache series provides the
    core hooks in VFS for: initializing cleancache per filesystem;
    capturing clean pages reclaimed by page cache; attempting to get
    pages from cleancache before filesystem read; and ensuring coherency
    between pagecache, disk, and cleancache. Note that the placement
    of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
    change was required due to a patchset in 2.6.39.

    All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
    a check of a boolean global if CONFIG_CLEANCACHE is set but no
    cleancache "backend" has claimed cleancache_ops.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
    Signed-off-by: Chris Mason
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     
  • This third patch of eight in this cleancache series provides
    the core code for cleancache that interfaces between the hooks in
    VFS and individual filesystems and a cleancache backend. It also
    includes build and config patches.

    Two new files are added: mm/cleancache.c and include/linux/cleancache.h.

    Note that CONFIG_CLEANCACHE can default to on; in systems that do
    not provide a cleancache backend, all hooks devolve to a simple
    check of a global enable flag, so performance impact should
    be negligible but can be reduced to zero impact if config'ed off.
    However for this first commit, it defaults to off.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
    design for tmem

    [v8: dan.magenheimer@oracle.com: fix exportfs call affecting btrfs]
    [v8: akpm@linux-foundation.org: use static inline function, not macro]
    [v7: dan.magenheimer@oracle.com: cleanup sysfs and remove cleancache prefix]
    [v6: JBeulich@novell.com: robustly handle buggy fs encode_fh actor definition]
    [v5: jeremy@goop.org: clean up global usage and static var names]
    [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
    [v5: hch@infradead.org: cleaner non-global interface for ops registration]
    [v4: adilger@sun.com: interface must support exportfs FS's]
    [v4: hch@infradead.org: interface must support 64-bit FS on 32-bit kernel]
    [v3: akpm@linux-foundation.org: use one ops struct to avoid pointer hops]
    [v3: akpm@linux-foundation.org: document and ensure PageLocked reqts are met]
    [v3: ngupta@vflare.org: fix success/fail codes, change funcs to void]
    [v2: viro@ZenIV.linux.org.uk: use sane types]
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Al Viro
    Acked-by: Andrew Morton
    Acked-by: Nitin Gupta
    Acked-by: Minchan Kim
    Acked-by: Andreas Dilger
    Acked-by: Jan Beulich
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Chris Mason
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker

    Dan Magenheimer
     

26 May, 2011

2 commits

  • Commit a71ae47a2cbf ("slub: Fix double bit unlock in debug mode")
    removed the only goto to this label, resulting in

    mm/slub.c: In function '__slab_alloc':
    mm/slub.c:1834: warning: label 'unlock_out' defined but not used

    fixed trivially by the removal of the label itself too.

    Reported-by: Stephen Rothwell
    Cc: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits)
    cfq-iosched: free cic_index if cfqd allocation fails
    cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
    cfq-iosched: reduce bit operations in cfq_choose_req()
    cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
    blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
    block: move bd_set_size() above rescan_partitions() in __blkdev_get()
    block: call elv_bio_merged() when merged
    cfq-iosched: Make IO merge related stats per cpu
    cfq-iosched: Fix a memory leak of per cpu stats for root group
    backing-dev: Kill set but not used var in bdi_debug_stats_show()
    block: get rid of on-stack plugging debug checks
    blk-throttle: Make no throttling rule group processing lockless
    blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
    blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
    blk-throttle: Make dispatch stats per cpu
    blk-throttle: Free up a group only after one rcu grace period
    blk-throttle: Use helper function to add root throtl group to lists
    blk-throttle: Introduce a helper function to fill in device details
    blk-throttle: Dynamically allocate root group
    blk-cgroup: Allow sleeping while dynamically allocating a group
    ...

    Linus Torvalds
     

25 May, 2011

20 commits

  • Currently on nommu arch mmap(),mremap() and munmap() doesn't do
    page_align() which isn't consist with mmu arch and cause some issues.

    First, some drivers' mmap() function depends on vma->vm_end - vma->start
    is page aligned which is true on mmu arch but not on nommu. eg: uvc
    camera driver.

    Second munmap() may return -EINVAL[split file] error in cases when end is
    not page aligned(passed into from userspace) but vma->vm_end is aligned
    dure to split or driver's mmap() ops.

    Add page alignment to fix those issues.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Bob Liu
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    Andrew Morton suggested activate_page() and putback_lru_pages() should
    follow the same path to active pages, but this is hard to implement (see
    commit 7a608572a282a ("Revert "mm: batch activate_page() to reduce lock
    contention")). On the other hand, do we really need putback_lru_pages()
    to follow the same path? I tested several FIO/FFSB benchmark (about 20
    scripts for each benchmark) in 3 machines here from 2 sockets to 4
    sockets. My test doesn't show anything significant with/without below
    patch (there is slight difference but mostly some noise which we found
    even without below patch before). Below patch basically returns to the
    same as my first post.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    In UP case, 'size mm/swap.o'
    before the two patches:
    text data bss dec hex filename
    6466 896 4 7366 1cc6 mm/swap.o
    after the two patches:
    text data bss dec hex filename
    6343 896 4 7243 1c4b mm/swap.o

    Signed-off-by: Shaohua Li
    Cc: KOSAKI Motohiro
    Cc: Hiroyuki Kamezawa
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • I believe I found a problem in __alloc_pages_slowpath, which allows a
    process to get stuck endlessly looping, even when lots of memory is
    available.

    Running an I/O and memory intensive stress-test I see a 0-order page
    allocation with __GFP_IO and __GFP_WAIT, running on a system with very
    little free memory. Right about the same time that the stress-test gets
    killed by the OOM-killer, the utility trying to allocate memory gets stuck
    in __alloc_pages_slowpath even though most of the systems memory was freed
    by the oom-kill of the stress-test.

    The utility ends up looping from the rebalance label down through the
    wait_iff_congested continiously. Because order=0,
    __alloc_pages_direct_compact skips the call to get_page_from_freelist.
    Because all of the reclaimable memory on the system has already been
    reclaimed, __alloc_pages_direct_reclaim skips the call to
    get_page_from_freelist. Since there is no __GFP_FS flag, the block with
    __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested,
    then jumps back to rebalance without ever trying to
    get_page_from_freelist. This loop repeats infinitely.

    The test case is pretty pathological. Running a mix of I/O stress-tests
    that do a lot of fork() and consume all of the system memory, I can pretty
    reliably hit this on 600 nodes, in about 12 hours. 32GB/node.

    Signed-off-by: Andrew Barry
    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Barry
     
  • The noswapaccount parameter has been deprecated since 2.6.38 without any
    complaints from users so we can remove it. swapaccount=0|1 can be used
    instead.

    As we are removing the parameter we can also clean up swapaccount because
    it doesn't have to accept an empty string anymore (to match noswapaccount)
    and so we can push = into __setup macro rather than checking "=1" resp.
    "=0" strings

    Signed-off-by: Michal Hocko
    Cc: Hiroyuki Kamezawa
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
    issues.

    - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

    - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

    - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • This function has been superseded by gather_hugetbl_stats() and is no
    longer needed.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Improve the prototype of gather_stats() to take a struct numa_maps as
    argument instead of a generic void *. Update all callers to make the
    required type explicit.

    Since gather_stats() is not needed before its definition and is scheduled
    to be moved out of mempolicy.c the declaration is removed as well.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Mapping statistics in a NUMA environment is now computed using the generic
    walk_page_range() logic. Remove the old/equivalent functionality.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Converting show_numa_map() to use the generic routine decouples the
    function from mempolicy.c, allowing it to be moved out of the mm subsystem
    and into fs/proc.

    Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk
    logic implemented by check_pte_range() failed to account for such pages as
    they were not applicable to the page migration case.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • In commit 48fce3429d ("mempolicies: unexport get_vma_policy()")
    get_vma_policy() was marked static as all clients were local to
    mempolicy.c.

    However, the decision to generate /proc/pid/numa_maps in the numa memory
    policy code and outside the procfs subsystem introduces an artificial
    interdependency between the two systems. Exporting get_vma_policy() once
    again is the first step to clean up this interdependency.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Implement generic xattrs for tmpfs filesystems. The Feodra project, while
    trying to replace suid apps with file capabilities, realized that tmpfs,
    which is used on the build systems, does not support file capabilities and
    thus cannot be used to build packages which use file capabilities. Xattrs
    are also needed for overlayfs.

    The xattr interface is a bit odd. If a filesystem does not implement any
    {get,set,list}xattr functions the VFS will call into some random LSM hooks
    and the running LSM can then implement some method for handling xattrs.
    SELinux for example provides a method to support security.selinux but no
    other security.* xattrs.

    As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have
    xattr handler routines specifically to handle acls. Because of this tmpfs
    would loose the VFS/LSM helpers to support the running LSM. To make up
    for that tmpfs had stub functions that did nothing but call into the LSM
    hooks which implement the helpers.

    This new patch does not use the LSM fallback functions and instead just
    implements a native get/set/list xattr feature for the full security.* and
    trusted.* namespace like a normal filesystem. This means that tmpfs can
    now support both security.selinux and security.capability, which was not
    previously possible.

    The basic implementation is that I attach a:

    struct shmem_xattr {
    struct list_head list; /* anchored by shmem_inode_info->xattr_list */
    char *name;
    size_t size;
    char value[0];
    };

    Into the struct shmem_inode_info for each xattr that is set. This
    implementation could easily support the user.* namespace as well, except
    some care needs to be taken to prevent large amounts of unswappable memory
    being allocated for unprivileged users.

    [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks]
    Signed-off-by: Eric Paris
    Signed-off-by: Miklos Szeredi
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Cc: Kyle McMartin
    Acked-by: Hugh Dickins
    Tested-by: Jordi Pujol
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • The bootmem wrapper with memblock supports top-down now, so we no longer
    need this trick.

    Signed-off-by: Yinghai LU
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Olaf Hering
    Cc: Tejun Heo
    Cc: Lucas De Marchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • The page allocator will improperly return a page from ZONE_NORMAL even
    when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller
    expects DMA memory, perhaps for ISA devices with 16-bit address registers,
    and may get higher memory resulting in undefined behavior.

    This patch causes the page allocator to return NULL in such circumstances
    with a warning emitted to the kernel log on the first occurrence.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • online_pages() is only compiled for CONFIG_MEMORY_HOTPLUG_SPARSE, so there
    is no need to support CONFIG_FLATMEM code within it.

    This patch removes code that is never used.

    Signed-off-by: Daniel Kiper
    Acked-by: Dave Hansen
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • It's pointless that deactive_page's operates on unevictable pages. This
    patch removes unnecessary overhead which might be a bit problem in case
    that there are many unevictable page in system(ex, mprotect workload)

    [akpm@linux-foundation.org: tidy up comment]
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Previously the mmap sequential readahead is triggered by updating
    ra->prev_pos on each page fault and compare it with current page offset.

    It costs dirtying the cache line on each _minor_ page fault. So remove
    the ra->prev_pos recording, and instead tag PG_readahead to trigger the
    possible sequential readahead. It's not only more simple, but also will
    work more reliably and reduce cache line bouncing on concurrent page
    faults on shared struct file.

    In the mosbench exim benchmark which does multi-threaded page faults on
    shared struct file, the ra->mmap_miss and ra->prev_pos updates are found
    to cause excessive cache line bouncing on tmpfs, which actually disabled
    readahead totally (shmem_backing_dev_info.ra_pages == 0).

    So remove the ra->prev_pos recording, and instead tag PG_readahead to
    trigger the possible sequential readahead. It's not only more simple, but
    also will work more reliably on concurrent reads on shared struct file.

    Signed-off-by: Wu Fengguang
    Tested-by: Tim Chen
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The original INT_MAX is too large, reduce it to

    - avoid unnecessarily dirtying/bouncing the cache line

    - restore mmap read-around faster on changed access pattern

    Background: in the mosbench exim benchmark which does multi-threaded page
    faults on shared struct file, the ra->mmap_miss updates are found to cause
    excessive cache line bouncing on tmpfs. The ra state updates are needless
    for tmpfs because it actually disabled readahead totally
    (shmem_backing_dev_info.ra_pages == 0).

    Tested-by: Tim Chen
    Signed-off-by: Andi Kleen
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Reduce readahead overheads by returning early in do_sync_mmap_readahead().

    tmpfs has ra_pages=0 and it can page fault really fast (not constraint by
    IO if not swapping).

    Signed-off-by: Wu Fengguang
    Tested-by: Tim Chen
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Change each shrinker's API by consolidating the existing parameters into
    shrink_control struct. This will simplify any further features added w/o
    touching each file of shrinker.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: fix warning]
    [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
    [akpm@linux-foundation.org: fix xfs warning]
    [akpm@linux-foundation.org: update gfs2]
    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Acked-by: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Consolidate the existing parameters to shrink_slab() into a new
    shrink_control struct. This is needed later to pass the same struct to
    shrinkers.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Acked-by: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han