11 Dec, 2008

5 commits

  • Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
    to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use
    less stack exposing a bug in slub's list_locations() -
    kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
    beyond the end of page provided.

    The 100 slop which list_locations() allows at end of page looks roughly
    enough for all the other stuff it might print after the symbol before
    it checks again: break out KSYM_SYMBOL_LEN earlier than before.

    Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
    need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
    where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
    them.

    [akpm@linux-foundation.org: ftrace.h needs module.h]
    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc Miles Lane
    Acked-by: Pekka Enberg
    Acked-by: Steven Rostedt
    Acked-by: Frederic Weisbecker
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat()
    gets the page address from user-space and puts the corresponding status
    back while holding the mmap_sem for read. There is no need to hold
    mmap_sem there while some page-faults may occur.

    This patch adds a temporary address and status buffer so as to only
    hold mmap_sem while working on these kernel buffers. This is
    implemented by extracting do_pages_stat_array() out of do_pages_stat().

    Signed-off-by: Brice Goglin
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Fix a total bootup freeze on ia64.

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Li Zefan
    Reported-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Currently, lru_add_drain_all() has two version.
    (1) use schedule_on_each_cpu()
    (2) don't use schedule_on_each_cpu()

    Gerald Schaefer reported it doesn't work well on SMP (not NUMA) S390
    machine.

    offline_pages() calls lru_add_drain_all() followed by drain_all_pages().
    While drain_all_pages() works on each cpu, lru_add_drain_all() only runs
    on the current cpu for architectures w/o CONFIG_NUMA. This let us run
    into the BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during
    memory hotplug stress test on s390. The page in question was still on the
    pcp list, because of a race with lru_add_drain_all() and drain_all_pages()
    on different cpus.

    Actually, Almost machine has CONFIG_UNEVICTABLE_LRU=y. Then almost machine use
    (1) version lru_add_drain_all although the machine is UP.

    Then this ifdef is not valueable.
    simple removing is better.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Acked-by: Gerald Schaefer
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • On second thoughts, this is just going to disturb people while telling us
    things which we already knew.

    Cc: Peter Korsgaard
    Cc: Peter Zijlstra
    Cc: Kay Sievers
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

03 Dec, 2008

2 commits

  • Count the insertion of new pages in the statistics used to drive the
    pageout scanning code. This should help the kernel quickly evict
    streaming file IO.

    We count on the fact that new file pages start on the inactive file LRU
    and new anonymous pages start on the active anon list. This means
    streaming file IO will increment the recent scanned file statistic, while
    leaving the recent rotated file statistic alone, driving pageout scanning
    to the file LRUs.

    Pageout activity does its own list manipulation.

    Signed-off-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Tested-by: Gene Heskett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Devices which share the same queue, like floppies and mtd devices, get
    registered multiple times in the bdi interface, but bdi accounts only the
    last registered device of the devices sharing one queue.

    On remove, all earlier registered devices leak, stay around in sysfs, and
    cause "duplicate filename" errors if the devices are re-created.

    This prevents the creation of multiple bdi interfaces per queue, and the
    bdi device will carry the dev_t name of the block device which is the
    first one registered, of the pool of devices using the same queue.

    [akpm@linux-foundation.org: add a WARN_ON so we know which drivers are misbehaving]
    Tested-by: Peter Korsgaard
    Acked-by: Peter Zijlstra
    Signed-off-by: Kay Sievers
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kay Sievers
     

02 Dec, 2008

2 commits

  • Fixes for memcg/memory hotplug.

    While memory hotplug allocate/free memmap, page_cgroup doesn't free
    page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
    (Because freeing bootmem requires special care.)

    Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
    by memory hotplug, page_cgroup->page == page is no longer true.

    But current MEM_ONLINE handler doesn't check it and update
    page_cgroup->page if it's not necessary to allocate page_cgroup. (This
    was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.)

    And I noticed that MEM_ONLINE can be called against "part of section".
    So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing
    used page_cgroup) Don't rollback at CANCEL.

    One more, current memory hotplug notifier is stopped by slub because it
    sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never
    be called. (low priority than slub now.)

    I think this slub's behavior is not intentional(BUG). and fixes it.

    Another way to be considered about page_cgroup allocation:
    - free page_cgroup at OFFLINE even if it's from bootmem
    and remove specieal handler. But it requires more changes.

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041

    Signed-off-by: KAMEZAWA Hiruyoki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Tested-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Jim Radford has reported that the vmap subsystem rewrite was sometimes
    causing his VIVT ARM system to behave strangely (seemed like going into
    infinite loops trying to fault in pages to userspace).

    We determined that the problem was most likely due to a cache aliasing
    issue. flush_cache_vunmap was only being called at the moment the page
    tables were to be taken down, however with lazy unmapping, this can happen
    after the page has subsequently been freed and allocated for something
    else. The dangling alias may still have dirty data attached to it.

    The fix for this problem is to do the cache flushing when the caller has
    called vunmap -- it would be a bug for them to write anything else to the
    mapping at that point.

    That appeared to solve Jim's problems.

    Reported-by: Jim Radford
    Signed-off-by: Nick Piggin
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Dec, 2008

2 commits


20 Nov, 2008

7 commits

  • Fix the old comment on the scan ratio calculations.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • In the past, GFP_NOFS (but of course not GFP_NOIO) was allowed to reclaim
    by writing to swap. That got partially broken in 2.6.23, when may_enter_fs
    initialization was moved up before the allocation of swap, so its
    PageSwapCache test was failing the first time around,

    Fix it by setting may_enter_fs when add_to_swap() succeeds with
    __GFP_IO. In fact, check __GFP_IO before calling add_to_swap():
    allocating swap we're not ready to use just increases disk seeking.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Page migration's writeout() has got understandably confused by the nasty
    AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has
    unlocked the page, so writeout() then needs to relock it.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Current vmalloc restart search for a free area in case we can't find one.
    The reason is there are areas which are lazily freed, and could be
    possibly freed now. However, current implementation start searching the
    tree from the last failing address, which is pretty much by definition at
    the end of address space. So, we fail.

    The proposal of this patch is to restart the search from the beginning of
    the requested vstart address. This fixes the regression in running KVM
    virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349,
    caused by commit db64fe02258f1507e13fe5212a989922323685ce.

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • An initial vmalloc failure should start off a synchronous flush of lazy
    areas, in case someone is in progress flushing them already, which could
    cause us to return an allocation failure even if there is plenty of KVA
    free.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix off by one bug in the KVA allocator that can leave gaps in the address
    space.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • After adding a node into the machine, top cpuset's mems isn't updated.

    By reviewing the code, we found that the update function

    cpuset_track_online_nodes()

    was invoked after node_states[N_ONLINE] changes. It is wrong because
    N_ONLINE just means node has pgdat, and if node has/added memory, we use
    N_HIGH_MEMORY. So, We should invoke the update function after
    node_states[N_HIGH_MEMORY] changes, just like its commit says.

    This patch fixes it. And we use notifier of memory hotplug instead of
    direct calling of cpuset_track_online_nodes().

    Signed-off-by: Miao Xie
    Acked-by: Yasunori Goto
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Linus Torvalds

    Miao Xie
     

17 Nov, 2008

1 commit

  • Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y):
    mm/mlock.c: In function `__mlock_vma_pages_range':
    mm/mlock.c:165: warning: `ret' might be used uninitialized in this function

    Signed-off-by: Helge Deller
    [ It isn't ever really used uninitialized, since no caller should ever
    call this function with an empty range. But the compiler is correct
    that from a local analysis standpoint that is impossible to see, and
    fixing the warning is appropriate. ]
    Signed-off-by: Linus Torvalds

    Helge Deller
     

16 Nov, 2008

1 commit

  • Hugh Dickins reported show_page_path() is buggy and unsafe because

    - lack dput() against d_find_alias()
    - don't concern vma->vm_mm->owner == NULL
    - lack lock_page()

    it was only for debugging, so rather than trying to fix it, just remove
    it now.

    Reported-by: Hugh Dickins
    Signed-off-by: Hugh Dickins
    Signed-off-by: KOSAKI Motohiro
    CC: Lee Schermerhorn
    CC: Rik van Riel
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

13 Nov, 2008

5 commits

  • The start pfn calculation in page_cgroup's memory hotplug notifier chain
    is wrong.

    Tested-by: Badari Pulavarty
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • lockdep warns about following message at boot time on one of my test
    machine. Then, schedule_on_each_cpu() sholdn't be called when the task
    have mmap_sem.

    Actually, lru_add_drain_all() exist to prevent the unevictalble pages
    stay on reclaimable lru list. but currenct unevictable code can rescue
    unevictable pages although it stay on reclaimable list.

    So removing is better.

    In addition, this patch add lru_add_drain_all() to sys_mlock() and
    sys_mlockall(). it isn't must. but it reduce the failure of moving to
    unevictable list. its failure can rescue in vmscan later. but reducing
    is better.

    Note, if above rescuing happend, the Mlocked and the Unevictable field
    mismatching happend in /proc/meminfo. but it doesn't cause any real
    trouble.

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.28-rc2-mm1 #2
    -------------------------------------------------------
    lvm/1103 is trying to acquire lock:
    (&cpu_hotplug.lock){--..}, at: [] get_online_cpus+0x29/0x50

    but task is already holding lock:
    (&mm->mmap_sem){----}, at: [] sys_mlockall+0x4e/0xb0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){----}:
    [] check_noncircular+0x82/0x110
    [] might_fault+0x4a/0xa0
    [] validate_chain+0xb11/0x1070
    [] might_fault+0x4a/0xa0
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0 (*) grab mmap_sem
    [] might_fault+0x4a/0xa0
    [] might_fault+0x7b/0xa0
    [] might_fault+0x4a/0xa0
    [] copy_to_user+0x30/0x60
    [] filldir+0x7c/0xd0
    [] sysfs_readdir+0x11a/0x1f0 (*) grab sysfs_mutex
    [] filldir+0x0/0xd0
    [] filldir+0x0/0xd0
    [] vfs_readdir+0x86/0xa0 (*) grab i_mutex
    [] sys_getdents+0x6b/0xc0
    [] syscall_call+0x7/0xb
    [] 0xffffffff

    -> #2 (sysfs_mutex){--..}:
    [] check_noncircular+0x82/0x110
    [] sysfs_addrm_start+0x2c/0xc0
    [] validate_chain+0xb11/0x1070
    [] sysfs_addrm_start+0x2c/0xc0
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0 (*) grab sysfs_mutex
    [] sysfs_addrm_start+0x2c/0xc0
    [] mutex_lock_nested+0xa5/0x2f0
    [] sysfs_addrm_start+0x2c/0xc0
    [] sysfs_addrm_start+0x2c/0xc0
    [] sysfs_addrm_start+0x2c/0xc0
    [] create_dir+0x3f/0x90
    [] sysfs_create_dir+0x29/0x50
    [] _spin_unlock+0x25/0x40
    [] kobject_add_internal+0xcd/0x1a0
    [] kobject_set_name_vargs+0x3a/0x50
    [] kobject_init_and_add+0x2d/0x40
    [] sysfs_slab_add+0xd2/0x180
    [] sysfs_add_func+0x0/0x70
    [] sysfs_add_func+0x5c/0x70 (*) grab slub_lock
    [] run_workqueue+0x172/0x200
    [] run_workqueue+0x10f/0x200
    [] worker_thread+0x0/0xf0
    [] worker_thread+0x9c/0xf0
    [] autoremove_wake_function+0x0/0x50
    [] worker_thread+0x0/0xf0
    [] kthread+0x42/0x70
    [] kthread+0x0/0x70
    [] kernel_thread_helper+0x7/0x1c
    [] 0xffffffff

    -> #1 (slub_lock){----}:
    [] check_noncircular+0xd/0x110
    [] slab_cpuup_callback+0x11f/0x1d0
    [] validate_chain+0xb11/0x1070
    [] slab_cpuup_callback+0x11f/0x1d0
    [] mark_lock+0x35d/0xd00
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0
    [] slab_cpuup_callback+0x11f/0x1d0
    [] down_read+0x43/0x80
    [] slab_cpuup_callback+0x11f/0x1d0 (*) grab slub_lock
    [] slab_cpuup_callback+0x11f/0x1d0
    [] notifier_call_chain+0x3c/0x70
    [] _cpu_up+0x84/0x110
    [] cpu_up+0x4b/0x70 (*) grab cpu_hotplug.lock
    [] kernel_init+0x0/0x170
    [] kernel_init+0xb5/0x170
    [] kernel_init+0x0/0x170
    [] kernel_thread_helper+0x7/0x1c
    [] 0xffffffff

    -> #0 (&cpu_hotplug.lock){--..}:
    [] validate_chain+0x5af/0x1070
    [] dev_status+0x0/0x50
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0
    [] get_online_cpus+0x29/0x50
    [] mutex_lock_nested+0xa5/0x2f0
    [] get_online_cpus+0x29/0x50
    [] get_online_cpus+0x29/0x50
    [] lru_add_drain_per_cpu+0x0/0x10
    [] get_online_cpus+0x29/0x50 (*) grab cpu_hotplug.lock
    [] schedule_on_each_cpu+0x32/0xe0
    [] __mlock_vma_pages_range+0x85/0x2c0
    [] __lock_acquire+0x285/0xa10
    [] vma_merge+0xa9/0x1d0
    [] mlock_fixup+0x180/0x200
    [] do_mlockall+0x78/0x90 (*) grab mmap_sem
    [] sys_mlockall+0x81/0xb0
    [] syscall_call+0x7/0xb
    [] 0xffffffff

    other info that might help us debug this:

    1 lock held by lvm/1103:
    #0: (&mm->mmap_sem){----}, at: [] sys_mlockall+0x4e/0xb0

    stack backtrace:
    Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
    Call Trace:
    [] print_circular_bug_tail+0x7c/0xd0
    [] validate_chain+0x5af/0x1070
    [] dev_status+0x0/0x50
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0
    [] get_online_cpus+0x29/0x50
    [] mutex_lock_nested+0xa5/0x2f0
    [] get_online_cpus+0x29/0x50
    [] get_online_cpus+0x29/0x50
    [] lru_add_drain_per_cpu+0x0/0x10
    [] get_online_cpus+0x29/0x50
    [] schedule_on_each_cpu+0x32/0xe0
    [] __mlock_vma_pages_range+0x85/0x2c0
    [] __lock_acquire+0x285/0xa10
    [] vma_merge+0xa9/0x1d0
    [] mlock_fixup+0x180/0x200
    [] do_mlockall+0x78/0x90
    [] sys_mlockall+0x81/0xb0
    [] syscall_call+0x7/0xb

    Signed-off-by: KOSAKI Motohiro
    Tested-by: Kamalesh Babulal
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If all allowable memory is unreclaimable, it is possible to loop forever
    in the page allocator for ~__GFP_NORETRY allocations.

    During this time, it is also possible for a task's cpuset to expand its
    set of allowable nodes so that it now includes free memory. The cached
    copy of this set, current->mems_allowed, is stale, however, since there
    has not been a subsequent call to cpuset_update_task_memory_state().

    The cached copy of the set of allowable nodes is now updated in the page
    allocator's slow path so the additional memory is available to
    get_page_from_freelist().

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: David Rientjes
    Cc: Paul Menage
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Oops. Part of the hugetlb private reservation code was not fully
    converted to use hstates.

    When a huge page must be unmapped from VMAs due to a failed COW,
    HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of
    the page size being used. This works if the VMA is using the default
    huge page size. Otherwise we might unmap too much, too little, or
    trigger a BUG_ON. Rare but serious -- fix it.

    Signed-off-by: Adam Litke
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • The STACK_GROWSUP case of stack expansion was missing a test for 'prev',
    which got removed by commit cb8f488c33539f096580e202f5438a809195008f
    ("mmap.c: deinline a few functions") by mistake.

    I found my original email in "sent" folder. The patch in that mail
    does NOT remove !prev. That change had beed added by someone else.

    Ok, I think we are not much interested in who did it, let's
    fix it for good.

    [ "It looks like this was caused by me fixing rejects. That was the
    fancy include-lots-of-context-so-it-wont-apply patch." - akpm ]

    Reported-and-bisected-by: Helge Deller
    Signed-off-by: Denys Vlasenko
    Cc: Andrew Morton
    Cc: Jiri Kosina
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     

07 Nov, 2008

10 commits

  • Xen can end up calling vm_unmap_aliases() before vmalloc_init() has
    been called. In this case its safe to make it a simple no-op.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Linux Memory Management List
    Cc: Nick Piggin
    Signed-off-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • * master.kernel.org:/home/rmk/linux-2.6-arm:
    [ARM] xsc3: fix xsc3_l2_inv_range
    [ARM] mm: fix page table initialization
    [ARM] fix naming of MODULE_START / MODULE_END
    ARM: OMAP: Fix define for twl4030 irqs
    ARM: OMAP: Fix get_irqnr_and_base to clear spurious interrupt bits
    ARM: OMAP: Fix debugfs_create_*'s error checking method for arm/plat-omap
    ARM: OMAP: Fix compiler warnings in gpmc.c
    [ARM] fix VFP+softfloat binaries

    Linus Torvalds
     
  • My last bugfix here (adding zone->lock) introduced a new problem: Using
    page_zone(pfn_to_page(pfn)) to get the zone after the for() loop is wrong.
    pfn will then be >= end_pfn, which may be in a different zone or not
    present at all. This may lead to an addressing exception in page_zone()
    or spin_lock_irqsave().

    Now I use __first_valid_page() again after the loop to find a valid page
    for page_zone().

    Signed-off-by: Gerald Schaefer
    Acked-by: Nathan Fontenot
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Paramter @mem has been removed since v2.6.26, now delete it's comment.

    Signed-off-by: Qinghuang Feng
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qinghuang Feng
     
  • It's insufficient to simply compare node ids when warning about offnode
    page_structs since it's possible to still have local affinity.

    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Move the migrate_prep outside the mmap_sem for the following system calls

    1. sys_move_pages
    2. sys_migrate_pages
    3. sys_mbind()

    It really does not matter when we flush the lru. The system is free to
    add pages onto the lru even during migration which will make the page
    migration either skip the page (mbind, migrate_pages) or return a busy
    state (move_pages).

    Fixes this lockdep warning (and potential deadlock):

    Some VM place has
    mmap_sem -> kevent_wq via lru_add_drain_all()

    net/core/dev.c::dev_ioctl() has
    rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault.

    linkwatch_event has
    kevent_wq -> rtnl_lock

    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Reported-by: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump
    task state information for thread group leaders. The kernel log gets
    quickly overwhelmed on machines with a massive number of threads by
    dumping non-thread group leaders.

    Reviewed-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • As we can determine exactly when a gigantic page is in use we can optimise
    the common regular page cases by pulling out gigantic page initialisation
    into its own function. As gigantic pages are never released to buddy we
    do not need a destructor. This effectivly reverts the previous change to
    the main buddy allocator. It also adds a paranoid check to ensure we
    never release gigantic pages from hugetlbfs to the main buddy.

    Signed-off-by: Andy Whitcroft
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • When working with hugepages, hugetlbfs assumes that those hugepages are
    smaller than MAX_ORDER. Specifically it assumes that the mem_map is
    contigious and uses that to optimise access to the elements of the mem_map
    that represent the hugepage. Gigantic pages (such as 16GB pages on
    powerpc) by definition are of greater order than MAX_ORDER (larger than
    MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of
    the buddy alloctor guarentees for the contiguity of the mem_map, which
    ensures that the mem_map is at least contigious for maximmally aligned
    areas of MAX_ORDER_NR_PAGES pages.

    This patch adds new mem_map accessors and iterator helpers which handle
    any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to
    implement gigantic page versions of copy_huge_page and clear_huge_page,
    and to allow follow_hugetlb_page handle gigantic pages.

    Signed-off-by: Andy Whitcroft
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • As of 73bdf0a60e607f4b8ecc5aec597105976565a84f, the kernel needs
    to know where modules are located in the virtual address space.
    On ARM, we located this region between MODULE_START and MODULE_END.
    Unfortunately, everyone else calls it MODULES_VADDR and MODULES_END.
    Update ARM to use the same naming, so is_vmalloc_or_module_addr()
    can work properly. Also update the comment on mm/vmalloc.c to
    reflect that ARM also places modules in a separate region from the
    vmalloc space.

    Signed-off-by: Russell King

    Russell King
     

31 Oct, 2008

3 commits

  • Junjiro R. Okajima reported a problem where knfsd crashes if you are
    using it to export shmemfs objects and run strict overcommit. In this
    situation the current->mm based modifier to the overcommit goes through a
    NULL pointer.

    We could simply check for NULL and skip the modifier but we've caught
    other real bugs in the past from mm being NULL here - cases where we did
    need a valid mm set up (eg the exec bug about a year ago).

    To preserve the checks and get the logic we want shuffle the checking
    around and add a new helper to the vm_ security wrappers

    Also fix a current->mm reference in nommu that should use the passed mm

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix build]
    Reported-by: Junjiro R. Okajima
    Acked-by: James Morris
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Delete excess kernel-doc notation in mm/ subdirectory.
    Actually this is a kernel-doc notation fix.

    Warning(/var/linsrc/linux-2.6.27-git10//mm/vmalloc.c:902): Excess function parameter or struct member 'returns' description in 'vm_map_ram'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Nothing uses prepare_write or commit_write. Remove them from the tree
    completely.

    [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
    Signed-off-by: Nick Piggin
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

24 Oct, 2008

1 commit

  • * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
    proc: remove fs/proc/proc_misc.c
    proc: move /proc/vmcore creation to fs/proc/vmcore.c
    proc: move pagecount stuff to fs/proc/page.c
    proc: move all /proc/kcore stuff to fs/proc/kcore.c
    proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
    proc: move /proc/modules boilerplate to kernel/module.c
    proc: move /proc/diskstats boilerplate to block/genhd.c
    proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
    proc: move /proc/vmstat boilerplate to mm/vmstat.c
    proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
    proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
    proc: move /proc/vmallocinfo to mm/vmalloc.c
    proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
    proc: move /proc/slab_allocators boilerplate to mm/slab.c
    proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
    proc: move /proc/stat to fs/proc/stat.c
    proc: move rest of /proc/partitions code to block/genhd.c
    proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
    proc: move /proc/devices code to fs/proc/devices.c
    proc: move rest of /proc/locks to fs/locks.c
    ...

    Linus Torvalds
     

23 Oct, 2008

1 commit

  • page_cgroup_init() is called from mem_cgroup_init(). But at this
    point, we cannot call alloc_bootmem().
    (and this caused panic at boot.)

    This patch moves page_cgroup_init() to init/main.c.

    Time table is following:
    ==
    parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
    ....
    cgroup_init_early() # "early" init of cgroup.
    ....
    setup_arch() # memmap is allocated.
    ...
    page_cgroup_init();
    mem_init(); # we cannot call alloc_bootmem after this.
    ....
    cgroup_init() # mem_cgroup is initialized.
    ==

    Before page_cgroup_init(), mem_map must be initialized. So,
    I added page_cgroup_init() to init/main.c directly.

    (*) maybe this is not very clean but
    - cgroup_init_early() is too early
    - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
    use of vmalloc area in x86-32 is important and we should avoid very large
    vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
    directly to init/main.c

    [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
    [akpm@linux-foundation.org: fix build]
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki