24 Sep, 2009

16 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would
    be useful for users to show swap usage in memory.stat file, because they
    don't need calculate memsw.usage - res.usage to know swap usage.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Reduce the resource counter overhead (mostly spinlock) associated with the
    root cgroup. This is a part of the several patches to reduce mem cgroup
    overhead. I had posted other approaches earlier (including using percpu
    counters). Those patches will be a natural addition and will be added
    iteratively on top of these.

    The patch stops resource counter accounting for the root cgroup. The data
    for display is derived from the statisitcs we maintain via
    mem_cgroup_charge_statistics (which is more scalable). What happens today
    is that, we do double accounting, once using res_counter_charge() and once
    using memory_cgroup_charge_statistics(). For the root, since we don't
    implement limits any more, we don't need to track every charge via
    res_counter_charge() and check for limit being exceeded and reclaim.

    The main mem->res usage_in_bytes can be derived by summing the cache and
    rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
    MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
    additional data about swapped out memory. This patch adds a
    MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
    MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
    recursively when hierarchy is enabled.

    The tests results I see on a 24 way show that

    1. The lock contention disappears from /proc/lock_stats
    2. The results of the test are comparable to running with
    cgroup_disable=memory.

    Here is a sample of my program runs

    Without Patch

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7192804.124144 task-clock-msecs # 23.937 CPUs
    424691 context-switches # 0.000 M/sec
    267 CPU-migrations # 0.000 M/sec
    28498113 page-faults # 0.004 M/sec
    5826093739340 cycles # 809.989 M/sec
    408883496292 instructions # 0.070 IPC
    7057079452 cache-references # 0.981 M/sec
    3036086243 cache-misses # 0.422 M/sec

    300.485365680 seconds time elapsed

    With cgroup_disable=memory

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7182183.546587 task-clock-msecs # 23.915 CPUs
    425458 context-switches # 0.000 M/sec
    203 CPU-migrations # 0.000 M/sec
    92545093 page-faults # 0.013 M/sec
    6034363609986 cycles # 840.185 M/sec
    437204346785 instructions # 0.072 IPC
    6636073192 cache-references # 0.924 M/sec
    2358117732 cache-misses # 0.328 M/sec

    300.320905827 seconds time elapsed

    With this patch applied

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7191619.223977 task-clock-msecs # 23.955 CPUs
    422579 context-switches # 0.000 M/sec
    88 CPU-migrations # 0.000 M/sec
    91946060 page-faults # 0.013 M/sec
    5957054385619 cycles # 828.333 M/sec
    1058117350365 instructions # 0.178 IPC
    9161776218 cache-references # 1.274 M/sec
    1920494280 cache-misses # 0.267 M/sec

    300.218764862 seconds time elapsed

    Data from Prarit (kernel compile with make -j64 on a 64
    CPU/32G machine)

    For a single run

    Without patch

    real 27m8.988s
    user 87m24.916s
    sys 382m6.037s

    With patch

    real 4m18.607s
    user 84m58.943s
    sys 50m52.682s

    With config turned off

    real 4m54.972s
    user 90m13.456s
    sys 50m19.711s

    NOTE: The data looks counterintuitive due to the increased performance
    with the patch, even over the config being turned off. We probably need
    more runs, but so far all testing has shown that the patches definitely
    help.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Balbir Singh
    Cc: Prarit Bhargava
    Cc: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Implement reclaim from groups over their soft limit

    Permit reclaim from memory cgroups on contention (via the direct reclaim
    path).

    memory cgroup soft limit reclaim finds the group that exceeds its soft
    limit by the largest number of pages and reclaims pages from it and then
    reinserts the cgroup into its correct place in the rbtree.

    Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
    loops in case all swap is turned off. The code has been refactored and
    the loop check (loop < 2) has been enhanced for soft limits. For soft
    limits, we try to do more targetted reclaim. Instead of bailing out after
    two loops, the routine now reclaims memory proportional to the size by
    which the soft limit is exceeded. The proportion has been empirically
    determined.

    [akpm@linux-foundation.org: build fix]
    [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
    [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Refactor mem_cgroup_hierarchical_reclaim()

    Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
    flags, so that new parameters don't have to be passed as we make the
    reclaim routine more flexible

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Organize cgroups over soft limit in a RB-Tree

    Introduce an RB-Tree for storing memory cgroups that are over their soft
    limit. The overall goal is to

    1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
    We are careful about updates, updates take place only after a particular
    time interval has passed
    2. We remove the node from the RB-Tree when the usage goes below the soft
    limit

    The next set of patches will exploit the RB-Tree to get the group that is
    over its soft limit by the largest amount and reclaim from it, when we
    face memory contention.

    [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
    Signed-off-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add an interface to allow get/set of soft limits. Soft limits for memory
    plus swap controller (memsw) is currently not supported. Resource
    counters have been enhanced to support soft limits and new type
    RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
    directly set and do not need any reclaim or checks before setting them to
    a newer value.

    Kamezawa-San raised a question as to whether soft limit should belong to
    res_counter. Since all resources understand the basic concepts of hard
    and soft limits, it is justified to add soft limits here. Soft limits are
    a generic resource usage feature, even file system quotas support soft
    limits.

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Change the memory cgroup to remove the overhead associated with accounting
    all pages in the root cgroup. As a side-effect, we can no longer set a
    memory hard limit in the root cgroup.

    A new flag to track whether the page has been accounted or not has been
    added as well. Flags are now set atomically for page_cgroup,
    pcg_default_flags is now obsolete and removed.

    [akpm@linux-foundation.org: fix a few documentation glitches]
    Signed-off-by: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Now that ksm is in mainline it is better to change the default values to
    better fit to most of the users.

    This patch change the ksm default values to be:

    ksm_thread_pages_to_scan = 100 (instead of 200)
    ksm_thread_sleep_millisecs = 20 (like before)
    ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
    disabled by default)
    ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

    The important aspect of this patch is: it disables ksm by default, and sets
    the number of the kernel_pages that can be allocated to be a reasonable
    number.

    Signed-off-by: Izik Eidus
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus
     
  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • This slipped past the previous sweeps.

    Signed-off-by: Rusty Russell
    Acked-by: Christoph Lameter

    Rusty Russell
     
  • My 58fa879e1e640a1856f736b418984ebeccee1c95 "mm: FOLL flags for GUP flags"
    broke CONFIG_NOMMU build by forgetting to update nommu.c foll_flags type:

    mm/nommu.c:171: error: conflicting types for `__get_user_pages'
    mm/internal.h:254: error: previous declaration of `__get_user_pages' was here
    make[1]: *** [mm/nommu.o] Error 1

    My 03f6462a3ae78f36eb1f0ee8b4d5ae2f7859c1d5 "mm: move highest_memmap_pfn"
    broke CONFIG_NOMMU build by forgetting to add a nommu.c highest_memmap_pfn:

    mm/built-in.o: In function `memmap_init_zone':
    (.meminit.text+0x326): undefined reference to `highest_memmap_pfn'
    mm/built-in.o: In function `memmap_init_zone':
    (.meminit.text+0x32d): undefined reference to `highest_memmap_pfn'

    Fix both breakages, and give myself 30 lashes (ouch!)

    Reported-by: Michal Simek
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Sep, 2009

3 commits

  • Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
    This is handled only in x86-64. This patch make it more generic. And we
    can use vread/vwrite to access the area. Fix it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Originally, walk_memory_resource() was introduced to traverse all memory
    of "System RAM" for detecting memory hotplug/unplug range. For doing so,
    flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
    memory hotplug.

    But for using other purpose, /proc/kcore, this may includes some firmware
    area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
    check strict to find out busy "System RAM".

    Note: PPC64 keeps their own walk_memory_resouce(), which walk through
    ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
    patch makes no difference in behavior, finally.

    And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
    Because pfn_valid() just show "there is memmap or not* and cannot be used
    for "there is physical memory or not", this function is useful in generic
    to scan physical memory range.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Cc: Américo Wang
    Cc: David Rientjes
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

22 Sep, 2009

21 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    trivial: fix typo in aic7xxx comment
    trivial: fix comment typo in drivers/ata/pata_hpt37x.c
    trivial: typo in kernel-parameters.txt
    trivial: fix typo in tracing documentation
    trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
    trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
    trivial: remove unnecessary semicolons
    trivial: Fix duplicated word "options" in comment
    trivial: kbuild: remove extraneous blank line after declaration of usage()
    trivial: improve help text for mm debug config options
    trivial: doc: hpfall: accept disk device to unload as argument
    trivial: doc: hpfall: reduce risk that hpfall can do harm
    trivial: SubmittingPatches: Fix reference to renumbered step
    trivial: fix typos "man[ae]g?ment" -> "management"
    trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
    trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
    trivial: fix missing printk space in amd_k7_smp_check
    trivial: fix typo s/ketymap/keymap/ in comment
    trivial: fix typo "to to" in multiple files
    trivial: fix typos in comments s/DGBU/DBGU/
    ...

    Linus Torvalds
     
  • Some architectures (like the Blackfin arch) implement some of the
    "simpler" features that one would expect out of a MMU such as memory
    protection.

    In our case, we actually get read/write/exec protection down to the page
    boundary so processes can't stomp on each other let alone the kernel.

    There is a performance decrease (which depends greatly on the workload)
    however as the hardware/software interaction was not optimized at design
    time.

    Signed-off-by: Bernd Schmidt
    Signed-off-by: Bryan Wu
    Signed-off-by: Mike Frysinger
    Acked-by: David Howells
    Acked-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bernd Schmidt
     
  • When the mm being switched to matches the active mm, we don't need to
    increment and then drop the mm count. In a simple benchmark this happens
    in about 50% of time. Making that conditional reduces contention on that
    cacheline on SMP systems.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Anyone who wants to do copy to/from user from a kernel thread, needs
    use_mm (like what fs/aio has). Move that into mm/, to make reusing and
    exporting easier down the line, and make aio use it. Next intended user,
    besides aio, will be vhost-net.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Fixes the following kmemcheck false positive (the compiler is using
    a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):

    [ 0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
    [ 0.352000] CPU0 attaching NULL sched-domain.
    [ 0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
    memory (9f8020fc)
    [ 0.361000]
    a44240820000000041f6998100000000000000000000000000000000ff030000
    [ 0.368000] i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
    u
    [ 0.375000] ^
    [ 0.376000]
    [ 0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
    [ 0.378000] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
    [ 0.379000] EIP is at shmem_fill_super+0xb5/0x120
    [ 0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
    [ 0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
    [ 0.382000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    [ 0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
    [ 0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 0.385000] DR6: ffff4ff0 DR7: 00000400
    [ 0.386000] [] get_sb_nodev+0x3c/0x80
    [ 0.388000] [] shmem_get_sb+0x14/0x20
    [ 0.390000] [] vfs_kern_mount+0x4f/0x120
    [ 0.392000] [] init_tmpfs+0x7e/0xb0
    [ 0.394000] [] do_basic_setup+0x17/0x30
    [ 0.396000] [] kernel_init+0x57/0xa0
    [ 0.398000] [] kernel_thread_helper+0x7/0x10
    [ 0.400000] [] 0xffffffff
    [ 0.402000] khelper used greatest stack depth: 2820 bytes left
    [ 0.407000] calling init_mmap_min_addr+0x0/0x10 @ 1
    [ 0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs

    Reported-by: Ingo Molnar
    Analysed-by: Vegard Nossum
    Signed-off-by: Pekka Enberg
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Add a flag for mmap that will be used to request a huge page region that
    will look like anonymous memory to userspace. This is accomplished by
    using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
    MAP_ANONYMOUS and so must be specified with it. The region will behave
    the same as a MAP_ANONYMOUS region using small pages.

    [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
    Signed-off-by: Eric B Munson
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
    drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).

    Move these codes to a more proper place to save cycles for shared
    anonymous mapping.

    Signed-off-by: Huang Shijie
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • We noticed very erratic behavior [throughput] with the AIM7 shared
    workload running on recent distro [SLES11] and mainline kernels on an
    8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
    [2.6.27.19+] with Barcelona processors, as we increased the load [10s of
    thousands of tasks], the throughput would vary between two "plateaus"--one
    at ~65K jobs per minute and one at ~130K jpm. The simple patch below
    causes the results to smooth out at the ~130k plateau.

    But wait, there's more:

    We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
    This could be the result of the larger number of cpus on the larger
    platform--a scalability issue--or it could be the result of the larger
    number of interconnect "hops" between some nodes in this platform and how
    the tasks for a given load end up distributed over the nodes' cpus and
    memories--a stochastic NUMA effect.

    The variability in the results are less pronounced [on the same platform]
    with Shanghai processors and with mainline kernels. With 31-rc6 on
    Shanghai processors and 288 file systems on 288 fibre attached storage
    volumes, the curves [jpm vs load] are both quite flat with the patched
    kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
    jpm] than the unpatched kernel.

    Profiling indicated that the "slow" runs were incurring high[er]
    contention on an anon_vma lock in vma_adjust(), apparently called from the
    sbrk() system call.

    The patch:

    A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
    anon_vma lock when we're only adjusting the end of a vma, as is the case
    for brk(). The comment questions whether it's worth while to optimize for
    this case. Apparently, on the newer, larger x86_64 platforms, with
    interesting NUMA topologies, it is worth while--especially considering
    that the patch [if correct!] is quite simple.

    We can detect this condition--no overlap with next vma--by noting a NULL
    "importer". The anon_vma pointer will also be NULL in this case, so
    simply avoid loading vma->anon_vma to avoid the lock.

    However, we DO need to take the anon_vma lock when we're inserting a vma
    ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
    need to check for 'insert', as well. And Hugh points out that we should
    also take it when adjusting vm_start (so that rmap.c can rely upon
    vma_address() while it holds the anon_vma lock).

    akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
    might be -stable material even though thiss isn't a regression: "this
    issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
    severe on large machine (4*8*2 cpu)"

    [hugh.dickins@tiscali.co.uk: test vma start too]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Eric Whitney
    Tested-by: "Zhang, Yanmin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
    CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
    more sense of it by removing CONFIG_TMPFS altogether, always enabling its
    code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
    CONFIG_TMPFS off that we'd better leave that as is.

    But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
    make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
    shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

    And a selfish change, to prevent the world from being rebuilt when I
    switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
    header files is mm.h shmem_lock() - give that a shmem.c stub instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
    the bit VM_LOCKED which is set by calc_vm_flag_bits().

    So there is no need to reset it again, just remove it.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
    __read_mostly in memory.c: to help them share a cacheline, since they're
    very often tested together in vm_normal_page().

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
    those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

    Contrary to how I'd imagined it, there's nothing ugly about this, just a
    zero_pfn test built into one or another block of vm_normal_page().

    But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
    my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
    ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
    that: it would have to take vm_flags to weed out some cases.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
    and add more comments, as suggested by Mel Gorman.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I'm still reluctant to clutter __get_user_pages() with another flag, just
    to avoid touching ZERO_PAGE count in mlock(); though we can add that later
    if it shows up as an issue in practice.

    But when mlocking, we can test page->mapping slightly earlier, to avoid
    the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
    didn't lock_page in olden ZERO_PAGE days, so we might have regressed.

    And when munlocking, it turns out that FOLL_DUMP coincidentally does
    what's needed to avoid all updates to ZERO_PAGE, so use that here also.
    Plus add comment suggested by KAMEZAWA Hiroyuki.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __get_user_pages() has been taking its own GUP flags, then processing
    them into FOLL flags for follow_page(). Though oddly named, the FOLL
    flags are more widely used, so pass them to __get_user_pages() now.
    Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

    (The patch to __get_user_pages() looks peculiar, with both gup_flags
    and foll_flags: the gup_flags remain constant; but as before there's
    an exceptional case, out of scope of the patch, in which foll_flags
    per page have FOLL_WRITE masked off.)

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
    advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
    using in 2.6.24. And there were a couple of regression reports on LKML.

    Following suggestions from Linus, reinstate do_anonymous_page() use of
    the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
    with (map)count updates - let vm_normal_page() regard it as abnormal.

    Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
    most powerpc): that's not essential, but minimizes additional branches
    (keeping them in the unlikely pte_special case); and incidentally
    excludes mips (some models of which needed eight colours of ZERO_PAGE
    to avoid costly exceptions).

    Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
    callers won't want to make exceptions for it, so increment its count
    there. Changes to mlock and migration? happily seems not needed.

    In most places it's quicker to check pfn than struct page address:
    prepare a __read_mostly zero_pfn for that. Does get_dump_page()
    still need its ZERO_PAGE check? probably not, but keep it anyway.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_anonymous_page() has been wrong to dirty the pte regardless.
    If it's not going to mark the pte writable, then it won't help
    to mark it dirty here, and clogs up memory with pages which will
    need swap instead of being thrown away. Especially wrong if no
    overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
    we could exceed the limit and OOM despite no overcommit.

    Signed-off-by: Hugh Dickins
    Cc:
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • follow_hugetlb_page() shouldn't be guessing about the coredump case
    either: pass the foll_flags down to it, instead of just the write bit.

    Remove that obscure huge_zeropage_ok() test. The decision is easy,
    though unlike the non-huge case - here vm_ops->fault is always set.
    But we know that a fault would serve up zeroes, unless there's
    already a hugetlbfs pagecache page to back the range.

    (Alternatively, since hugetlb pages aren't swapped out under pressure,
    you could save more dump space by arguing that a page not yet faulted
    into this process cannot be relevant to the dump; but that would be
    more surprising.)

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The "FOLL_ANON optimization" and its use_zero_page() test have caused
    confusion and bugs: why does it test VM_SHARED? for the very good but
    unsatisfying reason that VMware crashed without. As we look to maybe
    reinstating anonymous use of the ZERO_PAGE, we need to sort this out.

    Easily done: it's silly for __get_user_pages() and follow_page() to
    be guessing whether it's safe to assume that they're being used for
    a coredump (which can take a shortcut snapshot where other uses must
    handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.

    get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In preparation for the next patch, add a simple get_dump_page(addr)
    interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
    get_user_pages() directly. They're not interested in errors: they
    just want to use holes as much as possible, to save space and make
    sure that the data is aligned where the headers said it would be.

    Oh, and don't use that horrid DUMP_SEEK(off) macro!

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
    flags added solely to prevent __get_user_pages() from doing some of
    what it usually does, in the munlock case: we can now remove them.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins