24 Sep, 2009

4 commits

  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Now that ksm is in mainline it is better to change the default values to
    better fit to most of the users.

    This patch change the ksm default values to be:

    ksm_thread_pages_to_scan = 100 (instead of 200)
    ksm_thread_sleep_millisecs = 20 (like before)
    ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
    disabled by default)
    ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

    The important aspect of this patch is: it disables ksm by default, and sets
    the number of the kernel_pages that can be allocated to be a reasonable
    number.

    Signed-off-by: Izik Eidus
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus
     
  • This slipped past the previous sweeps.

    Signed-off-by: Rusty Russell
    Acked-by: Christoph Lameter

    Rusty Russell
     
  • My 58fa879e1e640a1856f736b418984ebeccee1c95 "mm: FOLL flags for GUP flags"
    broke CONFIG_NOMMU build by forgetting to update nommu.c foll_flags type:

    mm/nommu.c:171: error: conflicting types for `__get_user_pages'
    mm/internal.h:254: error: previous declaration of `__get_user_pages' was here
    make[1]: *** [mm/nommu.o] Error 1

    My 03f6462a3ae78f36eb1f0ee8b4d5ae2f7859c1d5 "mm: move highest_memmap_pfn"
    broke CONFIG_NOMMU build by forgetting to add a nommu.c highest_memmap_pfn:

    mm/built-in.o: In function `memmap_init_zone':
    (.meminit.text+0x326): undefined reference to `highest_memmap_pfn'
    mm/built-in.o: In function `memmap_init_zone':
    (.meminit.text+0x32d): undefined reference to `highest_memmap_pfn'

    Fix both breakages, and give myself 30 lashes (ouch!)

    Reported-by: Michal Simek
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Sep, 2009

3 commits

  • Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
    This is handled only in x86-64. This patch make it more generic. And we
    can use vread/vwrite to access the area. Fix it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Originally, walk_memory_resource() was introduced to traverse all memory
    of "System RAM" for detecting memory hotplug/unplug range. For doing so,
    flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
    memory hotplug.

    But for using other purpose, /proc/kcore, this may includes some firmware
    area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
    check strict to find out busy "System RAM".

    Note: PPC64 keeps their own walk_memory_resouce(), which walk through
    ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
    patch makes no difference in behavior, finally.

    And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
    Because pfn_valid() just show "there is memmap or not* and cannot be used
    for "there is physical memory or not", this function is useful in generic
    to scan physical memory range.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Cc: Américo Wang
    Cc: David Rientjes
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

22 Sep, 2009

33 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    trivial: fix typo in aic7xxx comment
    trivial: fix comment typo in drivers/ata/pata_hpt37x.c
    trivial: typo in kernel-parameters.txt
    trivial: fix typo in tracing documentation
    trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
    trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
    trivial: remove unnecessary semicolons
    trivial: Fix duplicated word "options" in comment
    trivial: kbuild: remove extraneous blank line after declaration of usage()
    trivial: improve help text for mm debug config options
    trivial: doc: hpfall: accept disk device to unload as argument
    trivial: doc: hpfall: reduce risk that hpfall can do harm
    trivial: SubmittingPatches: Fix reference to renumbered step
    trivial: fix typos "man[ae]g?ment" -> "management"
    trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
    trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
    trivial: fix missing printk space in amd_k7_smp_check
    trivial: fix typo s/ketymap/keymap/ in comment
    trivial: fix typo "to to" in multiple files
    trivial: fix typos in comments s/DGBU/DBGU/
    ...

    Linus Torvalds
     
  • Some architectures (like the Blackfin arch) implement some of the
    "simpler" features that one would expect out of a MMU such as memory
    protection.

    In our case, we actually get read/write/exec protection down to the page
    boundary so processes can't stomp on each other let alone the kernel.

    There is a performance decrease (which depends greatly on the workload)
    however as the hardware/software interaction was not optimized at design
    time.

    Signed-off-by: Bernd Schmidt
    Signed-off-by: Bryan Wu
    Signed-off-by: Mike Frysinger
    Acked-by: David Howells
    Acked-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bernd Schmidt
     
  • When the mm being switched to matches the active mm, we don't need to
    increment and then drop the mm count. In a simple benchmark this happens
    in about 50% of time. Making that conditional reduces contention on that
    cacheline on SMP systems.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Anyone who wants to do copy to/from user from a kernel thread, needs
    use_mm (like what fs/aio has). Move that into mm/, to make reusing and
    exporting easier down the line, and make aio use it. Next intended user,
    besides aio, will be vhost-net.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Fixes the following kmemcheck false positive (the compiler is using
    a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):

    [ 0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
    [ 0.352000] CPU0 attaching NULL sched-domain.
    [ 0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
    memory (9f8020fc)
    [ 0.361000]
    a44240820000000041f6998100000000000000000000000000000000ff030000
    [ 0.368000] i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
    u
    [ 0.375000] ^
    [ 0.376000]
    [ 0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
    [ 0.378000] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
    [ 0.379000] EIP is at shmem_fill_super+0xb5/0x120
    [ 0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
    [ 0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
    [ 0.382000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    [ 0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
    [ 0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 0.385000] DR6: ffff4ff0 DR7: 00000400
    [ 0.386000] [] get_sb_nodev+0x3c/0x80
    [ 0.388000] [] shmem_get_sb+0x14/0x20
    [ 0.390000] [] vfs_kern_mount+0x4f/0x120
    [ 0.392000] [] init_tmpfs+0x7e/0xb0
    [ 0.394000] [] do_basic_setup+0x17/0x30
    [ 0.396000] [] kernel_init+0x57/0xa0
    [ 0.398000] [] kernel_thread_helper+0x7/0x10
    [ 0.400000] [] 0xffffffff
    [ 0.402000] khelper used greatest stack depth: 2820 bytes left
    [ 0.407000] calling init_mmap_min_addr+0x0/0x10 @ 1
    [ 0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs

    Reported-by: Ingo Molnar
    Analysed-by: Vegard Nossum
    Signed-off-by: Pekka Enberg
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Add a flag for mmap that will be used to request a huge page region that
    will look like anonymous memory to userspace. This is accomplished by
    using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
    MAP_ANONYMOUS and so must be specified with it. The region will behave
    the same as a MAP_ANONYMOUS region using small pages.

    [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
    Signed-off-by: Eric B Munson
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
    drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).

    Move these codes to a more proper place to save cycles for shared
    anonymous mapping.

    Signed-off-by: Huang Shijie
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • We noticed very erratic behavior [throughput] with the AIM7 shared
    workload running on recent distro [SLES11] and mainline kernels on an
    8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
    [2.6.27.19+] with Barcelona processors, as we increased the load [10s of
    thousands of tasks], the throughput would vary between two "plateaus"--one
    at ~65K jobs per minute and one at ~130K jpm. The simple patch below
    causes the results to smooth out at the ~130k plateau.

    But wait, there's more:

    We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
    This could be the result of the larger number of cpus on the larger
    platform--a scalability issue--or it could be the result of the larger
    number of interconnect "hops" between some nodes in this platform and how
    the tasks for a given load end up distributed over the nodes' cpus and
    memories--a stochastic NUMA effect.

    The variability in the results are less pronounced [on the same platform]
    with Shanghai processors and with mainline kernels. With 31-rc6 on
    Shanghai processors and 288 file systems on 288 fibre attached storage
    volumes, the curves [jpm vs load] are both quite flat with the patched
    kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
    jpm] than the unpatched kernel.

    Profiling indicated that the "slow" runs were incurring high[er]
    contention on an anon_vma lock in vma_adjust(), apparently called from the
    sbrk() system call.

    The patch:

    A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
    anon_vma lock when we're only adjusting the end of a vma, as is the case
    for brk(). The comment questions whether it's worth while to optimize for
    this case. Apparently, on the newer, larger x86_64 platforms, with
    interesting NUMA topologies, it is worth while--especially considering
    that the patch [if correct!] is quite simple.

    We can detect this condition--no overlap with next vma--by noting a NULL
    "importer". The anon_vma pointer will also be NULL in this case, so
    simply avoid loading vma->anon_vma to avoid the lock.

    However, we DO need to take the anon_vma lock when we're inserting a vma
    ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
    need to check for 'insert', as well. And Hugh points out that we should
    also take it when adjusting vm_start (so that rmap.c can rely upon
    vma_address() while it holds the anon_vma lock).

    akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
    might be -stable material even though thiss isn't a regression: "this
    issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
    severe on large machine (4*8*2 cpu)"

    [hugh.dickins@tiscali.co.uk: test vma start too]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Eric Whitney
    Tested-by: "Zhang, Yanmin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
    CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
    more sense of it by removing CONFIG_TMPFS altogether, always enabling its
    code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
    CONFIG_TMPFS off that we'd better leave that as is.

    But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
    make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
    shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

    And a selfish change, to prevent the world from being rebuilt when I
    switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
    header files is mm.h shmem_lock() - give that a shmem.c stub instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
    the bit VM_LOCKED which is set by calc_vm_flag_bits().

    So there is no need to reset it again, just remove it.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
    __read_mostly in memory.c: to help them share a cacheline, since they're
    very often tested together in vm_normal_page().

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
    those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

    Contrary to how I'd imagined it, there's nothing ugly about this, just a
    zero_pfn test built into one or another block of vm_normal_page().

    But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
    my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
    ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
    that: it would have to take vm_flags to weed out some cases.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
    and add more comments, as suggested by Mel Gorman.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I'm still reluctant to clutter __get_user_pages() with another flag, just
    to avoid touching ZERO_PAGE count in mlock(); though we can add that later
    if it shows up as an issue in practice.

    But when mlocking, we can test page->mapping slightly earlier, to avoid
    the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
    didn't lock_page in olden ZERO_PAGE days, so we might have regressed.

    And when munlocking, it turns out that FOLL_DUMP coincidentally does
    what's needed to avoid all updates to ZERO_PAGE, so use that here also.
    Plus add comment suggested by KAMEZAWA Hiroyuki.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __get_user_pages() has been taking its own GUP flags, then processing
    them into FOLL flags for follow_page(). Though oddly named, the FOLL
    flags are more widely used, so pass them to __get_user_pages() now.
    Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

    (The patch to __get_user_pages() looks peculiar, with both gup_flags
    and foll_flags: the gup_flags remain constant; but as before there's
    an exceptional case, out of scope of the patch, in which foll_flags
    per page have FOLL_WRITE masked off.)

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
    advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
    using in 2.6.24. And there were a couple of regression reports on LKML.

    Following suggestions from Linus, reinstate do_anonymous_page() use of
    the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
    with (map)count updates - let vm_normal_page() regard it as abnormal.

    Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
    most powerpc): that's not essential, but minimizes additional branches
    (keeping them in the unlikely pte_special case); and incidentally
    excludes mips (some models of which needed eight colours of ZERO_PAGE
    to avoid costly exceptions).

    Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
    callers won't want to make exceptions for it, so increment its count
    there. Changes to mlock and migration? happily seems not needed.

    In most places it's quicker to check pfn than struct page address:
    prepare a __read_mostly zero_pfn for that. Does get_dump_page()
    still need its ZERO_PAGE check? probably not, but keep it anyway.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_anonymous_page() has been wrong to dirty the pte regardless.
    If it's not going to mark the pte writable, then it won't help
    to mark it dirty here, and clogs up memory with pages which will
    need swap instead of being thrown away. Especially wrong if no
    overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
    we could exceed the limit and OOM despite no overcommit.

    Signed-off-by: Hugh Dickins
    Cc:
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • follow_hugetlb_page() shouldn't be guessing about the coredump case
    either: pass the foll_flags down to it, instead of just the write bit.

    Remove that obscure huge_zeropage_ok() test. The decision is easy,
    though unlike the non-huge case - here vm_ops->fault is always set.
    But we know that a fault would serve up zeroes, unless there's
    already a hugetlbfs pagecache page to back the range.

    (Alternatively, since hugetlb pages aren't swapped out under pressure,
    you could save more dump space by arguing that a page not yet faulted
    into this process cannot be relevant to the dump; but that would be
    more surprising.)

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The "FOLL_ANON optimization" and its use_zero_page() test have caused
    confusion and bugs: why does it test VM_SHARED? for the very good but
    unsatisfying reason that VMware crashed without. As we look to maybe
    reinstating anonymous use of the ZERO_PAGE, we need to sort this out.

    Easily done: it's silly for __get_user_pages() and follow_page() to
    be guessing whether it's safe to assume that they're being used for
    a coredump (which can take a shortcut snapshot where other uses must
    handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.

    get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In preparation for the next patch, add a simple get_dump_page(addr)
    interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
    get_user_pages() directly. They're not interested in errors: they
    just want to use holes as much as possible, to save space and make
    sure that the data is aligned where the headers said it would be.

    Oh, and don't use that horrid DUMP_SEEK(off) macro!

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
    flags added solely to prevent __get_user_pages() from doing some of
    what it usually does, in the munlock case: we can now remove them.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Hiroaki Wakabayashi points out that when mlock() has been interrupted
    by SIGKILL, the subsequent munlock() takes unnecessarily long because
    its use of __get_user_pages() insists on faulting in all the pages
    which mlock() never reached.

    It's worse than slowness if mlock() is terminated by Out Of Memory kill:
    the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
    pages which mlock() could not find memory for; so innocent bystanders are
    killed too, and perhaps the system hangs.

    __get_user_pages() does a lot that's silly for munlock(): so remove the
    munlock option from __mlock_vma_pages_range(), and use a simple loop of
    follow_page()s in munlock_vma_pages_range() instead; ignoring absent
    pages, and not marking present pages as accessed or dirty.

    (Change munlock() to only go so far as mlock() reached? That does not
    work out, given the convention that mlock() claims complete success even
    when it has to give up early - in part so that an underlying file can be
    extended later, and those pages locked which earlier would give SIGBUS.)

    Signed-off-by: Hugh Dickins
    Cc:
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Reviewed-by: Hiroaki Wakabayashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When round-robin freeing pages from the PCP lists, empty lists may be
    encountered. In the event one of the lists has more pages than another,
    there may be numerous checks for list_empty() which is undesirable. This
    patch maintains a count of pages to free which is incremented when empty
    lists are encountered. The intention is that more pages will then be
    freed from fuller lists than the empty ones reducing the number of empty
    list checks in the free path.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The following two patches remove searching in the page allocator fast-path
    by maintaining multiple free-lists in the per-cpu structure. At the time
    the search was introduced, increasing the per-cpu structures would waste a
    lot of memory as per-cpu structures were statically allocated at
    compile-time. This is no longer the case.

    The patches are as follows. They are based on mmotm-2009-08-27.

    Patch 1 adds multiple lists to struct per_cpu_pages, one per
    migratetype that can be stored on the PCP lists.

    Patch 2 notes that the pcpu drain path check empty lists multiple times. The
    patch reduces the number of checks by maintaining a count of free
    lists encountered. Lists containing pages will then free multiple
    pages in batch

    The patches were tested with kernbench, netperf udp/tcp, hackbench and
    sysbench. The netperf tests were not bound to any CPU in particular and
    were run such that the results should be 99% confidence that the reported
    results are within 1% of the estimated mean. sysbench was run with a
    postgres background and read-only tests. Similar to netperf, it was run
    multiple times so that it's 99% confidence results are within 1%. The
    patches were tested on x86, x86-64 and ppc64 as

    x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.34% to 2.28% gain
    netperf-tcp - 0.45% to 1.22% gain
    hackbench - Small variances, very close to noise
    sysbench - Very small gains

    x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.83% to 10.42% gains
    netperf-tcp - No conclusive until buffer >= PAGE_SIZE
    4096 +15.83%
    8192 + 0.34% (not significant)
    16384 + 1%
    hackbench - Small gains, very close to noise
    sysbench - 0.79% to 1.6% gain

    ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 2-3% gain for almost all buffer sizes tested
    netperf-tcp - losses on small buffers, gains on larger buffers
    possibly indicates some bad caching effect.
    hackbench - No significant difference
    sysbench - 2-4% gain

    This patch:

    Currently the per-cpu page allocator searches the PCP list for pages of
    the correct migrate-type to reduce the possibility of pages being
    inappropriate placed from a fragmentation perspective. This search is
    potentially expensive in a fast-path and undesirable. Splitting the
    per-cpu list into multiple lists increases the size of a per-cpu structure
    and this was potentially a major problem at the time the search was
    introduced. These problem has been mitigated as now only the necessary
    number of structures is allocated for the running system.

    This patch replaces a list search in the per-cpu allocator with one list
    per migrate type. The potential snag with this approach is when bulk
    freeing pages. We round-robin free pages based on migrate type which has
    little bearing on the cache hotness of the page and potentially checks
    empty lists repeatedly in the event the majority of PCP pages are of one
    type.

    Signed-off-by: Mel Gorman
    Acked-by: Nick Piggin
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Current oom_kill doesn't only kill the victim process, but also kill all
    thas shread the same mm. it mean vfork parent will be killed.

    This is definitely incorrect. another process have another oom_adj. we
    shouldn't ignore their oom_adj (it might have OOM_DISABLE).

    following caller hit the minefield.

    ===============================
    switch (constraint) {
    case CONSTRAINT_MEMORY_POLICY:
    oom_kill_process(current, gfp_mask, order, 0, NULL,
    "No available memory (MPOL_BIND)");
    break;

    Note: force_sig(SIGKILL) send SIGKILL to all thread in the process.
    We don't need to care multi thread in here.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • oom-killer kills a process, not task. Then oom_score should be calculated
    as per-process too. it makes consistency more and makes speed up
    select_bad_process().

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, OOM logic callflow is here.

    __out_of_memory()
    select_bad_process() for each task
    badness() calculate badness of one task
    oom_kill_process() search child
    oom_kill_task() kill target task and mm shared tasks with it

    example, process-A have two thread, thread-A and thread-B and it have very
    fat memory and each thread have following oom_adj and oom_score.

    thread-A: oom_adj = OOM_DISABLE, oom_score = 0
    thread-B: oom_adj = 0, oom_score = very-high

    Then, select_bad_process() select thread-B, but oom_kill_task() refuse
    kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
    call select_bad_process() again. but select_bad_process() select the same
    task. It mean kernel fall in livelock.

    The fact is, select_bad_process() must select killable task. otherwise
    OOM logic go into livelock.

    And root cause is, oom_adj shouldn't be per-thread value. it should be
    per-process value because OOM-killer kill a process, not thread. Thus
    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct signal_struct. it naturally prevent
    select_bad_process() choose wrong task.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit 084f71ae5c(kill page_queue_congested()) removed
    page_queue_congested(). Remove the page_queue_congested() comment in
    vmscan pageout() too.

    Signed-off-by: Vincent Li
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
    which case shrink_list() _still_ calls isolate_pages() with the much
    larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
    rate by up to 32 times.

    For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
    So when shrink_zone() expects to scan 4 pages in the active/inactive list,
    the active list will be scanned 4 pages, while the inactive list will be
    (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
    the balance between the two lists.

    It can further impact the scan of anon active list, due to the anon
    active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

    inactive anon list over scanned => inactive_anon_is_low() == TRUE
    => shrink_active_list()
    => active anon list over scanned

    So the end result may be

    - anon inactive => over scanned
    - anon active => over scanned (maybe not as much)
    - file inactive => over scanned
    - file active => under scanned (relatively)

    The accesses to nr_saved_scan are not lock protected and so not 100%
    accurate, however we can tolerate small errors and the resulted small
    imbalanced scan rates between zones.

    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The name `zone_nr_pages' can be mis-read as zone's (total) number pages,
    but it actually returns zone's LRU list number pages.

    Signed-off-by: Vincent Li
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • This is being done by allowing boot time allocations to specify that they
    may want a sub-page sized amount of memory.

    Overall this seems more consistent with the other hash table allocations,
    and allows making two supposedly mm-only variables really mm-only
    (nr_{kernel,all}_pages).

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • Sizing of memory allocations shouldn't depend on the number of physical
    pages found in a system, as that generally includes (perhaps a huge amount
    of) non-RAM pages. The amount of what actually is usable as storage
    should instead be used as a basis here.

    Some of the calculations (i.e. those not intending to use high memory)
    should likely even use (totalram_pages - totalhigh_pages).

    Signed-off-by: Jan Beulich
    Acked-by: Rusty Russell
    Acked-by: Ingo Molnar
    Cc: Dave Airlie
    Cc: Kyle McMartin
    Cc: Jeremy Fitzhardinge
    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • Sizing of memory allocations shouldn't depend on the number of physical
    pages found in a system, as that generally includes (perhaps a huge amount
    of) non-RAM pages. The amount of what actually is usable as storage
    should instead be used as a basis here.

    In line with that, the memory hotplug code should update num_physpages in
    a way that it retains its original (post-boot) meaning; in particular,
    decreasing the value should at best be done with great care - this patch
    doesn't try to ever decrease this value at all as it doesn't really seem
    meaningful to do so.

    Signed-off-by: Jan Beulich
    Acked-by: Rusty Russell
    Cc: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich