16 Aug, 2010

1 commit

  • This commit makes the stack guard page somewhat less visible to user
    space. It does this by:

    - not showing the guard page in /proc//maps

    It looks like lvm-tools will actually read /proc/self/maps to figure
    out where all its mappings are, and effectively do a specialized
    "mlockall()" in user space. By not showing the guard page as part of
    the mapping (by just adding PAGE_SIZE to the start for grows-up
    pages), lvm-tools ends up not being aware of it.

    - by also teaching the _real_ mlock() functionality not to try to lock
    the guard page.

    That would just expand the mapping down to create a new guard page,
    so there really is no point in trying to lock it in place.

    It would perhaps be nice to show the guard page specially in
    /proc//maps (or at least mark grow-down segments some way), but
    let's not open ourselves up to more breakage by user space from programs
    that depends on the exact deails of the 'maps' file.

    Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
    source code to see what was going on with the whole new warning.

    Reported-and-tested-by: François Valenduc
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Aug, 2010

2 commits

  • Remove leading /** from non-kernel-doc function comments to prevent
    kernel-doc warnings.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • We do in fact need to unmap the page table _before_ doing the whole
    stack guard page logic, because if it is needed (mainly 32-bit x86 with
    PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it
    will do a kmap_atomic/kunmap_atomic.

    And those kmaps will create an atomic region that we cannot do
    allocations in. However, the whole stack expand code will need to do
    anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an
    atomic region.

    Now, a better model might actually be to do the anon_vma_prepare() when
    _creating_ a VM_GROWSDOWN segment, and not have to worry about any of
    this at page fault time. But in the meantime, this is the
    straightforward fix for the issue.

    See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details.

    Reported-by: Wylda
    Reported-by: Sedat Dilek
    Reported-by: Mike Pagano
    Reported-by: François Valenduc
    Tested-by: Ed Tomlinson
    Cc: Pekka Enberg
    Cc: Greg KH
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Aug, 2010

2 commits

  • Remove an extraneous no_printk() in mm/nommu.c that got missed when the
    function got generalised from several things that used it in commit
    12fdff3fc248 ("Add a dummy printk function for the maintenance of unused
    printks").

    Without this, the following error is observed:

    mm/nommu.c:41: error: conflicting types for 'no_printk'
    include/linux/kernel.h:314: error: previous definition of 'no_printk' was here

    Reported-by: Michal Simek
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • .. which didn't show up in my tests because it's a no-op on x86-64 and
    most other architectures. But we enter the function with the last-level
    page table mapped, and should unmap it at exit.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Aug, 2010

3 commits

  • This is a rather minimally invasive patch to solve the problem of the
    user stack growing into a memory mapped area below it. Whenever we fill
    the first page of the stack segment, expand the segment down by one
    page.

    Now, admittedly some odd application might _want_ the stack to grow down
    into the preceding memory mapping, and so we may at some point need to
    make this a process tunable (some people might also want to have more
    than a single page of guarding), but let's try the minimal approach
    first.

    Tested with trivial application that maps a single page just below the
    stack, and then starts recursing. Without this, we will get a SIGSEGV
    _after_ the stack has smashed the mapping. With this patch, we'll get a
    nice SIGBUS just as the stack touches the page just above the mapping.

    Requested-by: Keith Packard
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
    hwpoison: rename CONFIG
    HWPOISON, hugetlb: support hwpoison injection for hugepage
    HWPOISON, hugetlb: detect hwpoison in hugetlb code
    HWPOISON, hugetlb: isolate corrupted hugepage
    HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
    HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
    HWPOISON, hugetlb: enable error handling path for hugepage
    hugetlb, rmap: add reverse mapping for hugepage
    hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

    Fix up trivial conflicts in mm/memory-failure.c

    Linus Torvalds
     
  • * 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    x86: Detect whether we should use Xen SWIOTLB.
    pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
    swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
    xen/mmu: inhibit vmap aliases rather than trying to clear them out
    vmap: add flag to allow lazy unmap to be disabled at runtime
    xen: Add xen_create_contiguous_region
    xen: Rename the balloon lock
    xen: Allow unprivileged Xen domains to create iomap pages
    xen: use _PAGE_IOMAP in ioremap to do machine mappings

    Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
    driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
    include/xen/xen-ops.h

    Linus Torvalds
     

12 Aug, 2010

4 commits

  • Document global_dirty_limits() and bdi_dirty_limit().

    Signed-off-by: Wu Fengguang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
    that the latter can be avoided when under global dirty background
    threshold (which is the normal state for most systems).

    Signed-off-by: Wu Fengguang
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Reducing the number of times balance_dirty_pages calls global_page_state
    reduces the cache references and so improves write performance on a
    variety of workloads.

    'perf stats' of simple fio write tests shows the reduction in cache
    access. Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
    with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
    times, dropping the fasted & slowest values then taking the average &
    standard deviation

    average (s.d.) in millions (10^6)
    2.6.31-rc8 648.6 (14.6)
    +patch 620.1 (16.5)

    Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
    the counters to apply the dirty_threshold and moving this check up into
    balance_dirty_pages where it has already read the counters.

    Also by rearrange the for loop to only contain one copy of the limit tests
    allows the pdflush test after the loop to use the local copies of the
    counters rather than rereading them.

    In the common case with no throttling it now calls global_page_state 5
    fewer times and bdi_stat 2 fewer.

    Fengguang:

    This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
    with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
    avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
    accurate we don't need to do routinely clip. A simple dirty limit check
    would be enough.

    The check is necessary because, in principle we should throttle everything
    calling balance_dirty_pages() when we're over the total limit, as said by
    Peter.

    We now set and clear dirty_exceeded not only based on bdi dirty limits,
    but also on the global dirty limit. The global limit check is added in
    place of clip_bdi_dirty_limit() for safety and not intended as a behavior
    change. The bdi limits should be tight enough to keep all dirty pages
    under the global limit at most time; occasional small exceeding should be
    OK though. The change makes the logic more obvious: the global limit is
    the ultimate goal and shall be always imposed.

    We may now start background writeback work based on outdated conditions.
    That's safe because the bdi flush thread will (and have to) double check
    the states. It reduces overall overheads because the test based on old
    states still have good chance to be right.

    [akpm@linux-foundation.org] fix uninitialized dirty_exceeded
    Signed-off-by: Richard Kennedy
    Signed-off-by: Wu Fengguang
    Cc: Jan Kara
    Acked-by: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Fix a fatal kernel-doc error due to a #define coming between a function's
    kernel-doc notation and the function signature. (kernel-doc cannot handle
    this)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

11 Aug, 2010

23 commits

  • We have zone_to_nid(). this patch convert all existing users of
    zone->zone_pgdat->node_id.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid
    and zid can be calculated from zone. So remove it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Balbir Singh
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus
    it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
    use it at all.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Balbir Singh
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • sc.nr_reclaimed and sc.nr_scanned have already been initialized few lines
    above "struct scan_control sc = {}" statement.

    So, This patch remove this unnecessary code.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0.
    It mean shrink_zone() only scan 32 pages and immediately return even if
    it doesn't reclaim any pages.

    This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Balbir Singh
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Now, memory cgroup increments css(cgroup subsys state)'s reference count
    per a charged page. And the reference count is kept until the page is
    uncharged. But this has 2 bad effect.

    1. Because css_get/put calls atomic_inc()/dec, heavy call of them
    on large smp will not scale well.
    2. Because css's refcnt cannot be in a state as "ready-to-release",
    cgroup's notify_on_release handler can't work with memcg.
    3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.

    This has been a problem since the 1st merge of memcg.

    This is a trial to remove css's refcnt per a page. Even if we remove
    refcnt, pre_destroy() does enough synchronization as
    - check res->usage == 0.
    - check no pages on LRU.

    This patch removes css's refcnt per page. Even after this patch, at the
    1st look, it seems css_get() is still called in try_charge().

    But the logic is.

    - If a memcg of mm->owner is cached one, consume_stock() will work.
    At success, return immediately.
    - If consume_stock returns false, css_get() is called and go to
    slow path which may be blocked. At the end of slow path,
    css_put() is called and restart from the start if necessary.

    So, in the fast path, we don't call css_get() and can avoid access to
    shared counter. This patch can make the most possible case fast.

    Here is a result of multi-threaded page fault benchmark.

    [Before]
    25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c
    9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When the OOM killer scans task, it check a task is under memcg or
    not when it's called via memcg's context.

    But, as Oleg pointed out, a thread group leader may have NULL ->mm
    and task_in_mem_cgroup() may do wrong decision. We have to use
    find_lock_task_mm() in memcg as generic OOM-Killer does.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_charge_common() is always called with @mem = NULL, so it's
    meaningless. This patch removes it.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • - try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
    don't have to call them in task_in_mem_cgroup().
    - *mz is not used in __mem_cgroup_uncharge_common().
    - we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
    after we've cleared PCG_MIGRATION of @oldpage.
    - remove empty comment.
    - remove redundant empty line in mem_cgroup_cache_charge().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Now, for checking a memcg is under task-account-moving, we do css_tryget()
    against mc.to and mc.from. But this is just complicating things. This
    patch makes the check easier.

    This patch adds a spinlock to move_charge_struct and guard modification of
    mc.to and mc.from. By this, we don't have to think about complicated
    races arount this not-critical path.

    [balbir@linux.vnet.ibm.com: don't crash on a null memcg being passed]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_try_charge() has a big loop in it and seems to be hard to read.
    Most of routines are for slow path. This patch moves codes out from the
    loop and make it clear what's done.

    Summary:
    - refactoring a function to detect a memcg is under acccount move or not.
    - refactoring a function to wait for the end of moving task acct.
    - refactoring a main loop('s slow path) as a function and make it clear
    why we retry or quit by return code.
    - add fatal_signal_pending() check for bypassing charge loops.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch fixes possible deadlock in hugepage lock_page()
    by adding missing unlock_page().

    libhugetlbfs test will hit this bug when the next patch in this
    patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • CONFIG_HUGETLBFS controls hugetlbfs interface code.
    OTOH, CONFIG_HUGETLB_PAGE controls hugepage management code.
    So we should use CONFIG_HUGETLB_PAGE here.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch enables hwpoison injection through debug/hwpoison interfaces,
    with which we can test memory error handling for free or reserved
    hugepages (which cannot be tested by madvise() injector).

    [AK: Export PageHuge too for the injection module]
    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch enables to block access to hwpoisoned hugepage and
    also enables to block unmapping for it.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • If error hugepage is not in-use, we can fully recovery from error
    by dequeuing it from freelist, so return RECOVERY.
    Otherwise whether or not we can recovery depends on user processes,
    so return DELAYED.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • For now all pages in the error hugepage are considered as hwpoisoned,
    so count all of them in mce_bad_pages.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • To avoid race condition between concurrent memory errors on identified
    hugepage, we atomically test and set PG_hwpoison bit on the head page.
    All pages in the error hugepage are considered as hwpoisoned
    for now, so set and clear all PG_hwpoison bits in the hugepage
    with page lock of the head page held.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch just enables handling path. Real containing and
    recovering operation will be implemented in following patches.

    Dependency:
    "hugetlb, rmap: add reverse mapping for hugepage."

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch adds reverse mapping feature for hugepage by introducing
    mapcount for shared/private-mapped hugepage and anon_vma for
    private-mapped hugepage.

    While hugepage is not currently swappable, reverse mapping can be useful
    for memory error handler.

    Without this patch, memory error handler cannot identify processes
    using the bad hugepage nor unmap it from them. That is:
    - for shared hugepage:
    we can collect processes using a hugepage through pagecache,
    but can not unmap the hugepage because of the lack of mapcount.
    - for privately mapped hugepage:
    we can neither collect processes nor unmap the hugepage.
    This patch solves these problems.

    This patch include the bug fix given by commit 23be7468e8, so reverts it.

    Dependency:
    "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

    ChangeLog since May 24.
    - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
    - move functions setting up anon_vma for hugepage into mm/rmap.c.

    ChangeLog since May 13.
    - rebased to 2.6.34
    - fix logic error (in case that private mapping and shared mapping coexist)
    - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
    from linear_page_index()
    - define and use linear_hugepage_index() instead of compound_order()
    - use page_move_anon_rmap() in hugetlb_cow()
    - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
    - revert commit 24be7468 completely

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
    block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
    xen-blkfront: fix missing out label
    blkdev: fix blkdev_issue_zeroout return value
    block: update request stacking methods to support discards
    block: fix missing export of blk_types.h
    writeback: fix bad _bh spinlock nesting
    drbd: revert "delay probes", feature is being re-implemented differently
    drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
    drbd: Disable delay probes for the upcomming release
    writeback: cleanup bdi_register
    writeback: add new tracepoints
    writeback: remove unnecessary init_timer call
    writeback: optimize periodic bdi thread wakeups
    writeback: prevent unnecessary bdi threads wakeups
    writeback: move bdi threads exiting logic to the forker thread
    writeback: restructure bdi forker loop a little
    writeback: move last_active to bdi
    writeback: do not remove bdi from bdi_list
    writeback: simplify bdi code a little
    writeback: do not lose wake-ups in bdi threads
    ...

    Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
    drivers/scsi/scsi_error.c as per Jens.

    Linus Torvalds
     
  • * 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
    kmemleak: Fix typo in the comment
    lib/scatterlist: Hook sg_kmalloc into kmemleak (v2)
    kmemleak: Add DocBook style comments to kmemleak.c
    kmemleak: Introduce a default off mode for kmemleak
    kmemleak: Show more information for objects found by alias

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

5 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc: fix build with make 3.82
    Revert "Input: appletouch - fix integer overflow issue"
    memblock: Fix memblock_is_region_reserved() to return a boolean
    powerpc: Trim defconfigs
    powerpc: fix i8042 module build error
    sound/soc: mpc5200_psc_ac97: Use gpio pins for cold reset
    powerpc/5200: add mpc5200_psc_ac97_gpio_reset

    Linus Torvalds
     
  • When taking a memory snapshot in hibernate_snapshot(), all (directly
    called) memory allocations use GFP_ATOMIC. Hence swap misusage during
    hibernation never occurs.

    But from a pessimistic point of view, there is no guarantee that no page
    allcation has __GFP_WAIT. It is better to have a global indication "we
    enter hibernation, don't use swap!".

    This patch tries to freeze new-swap-allocation during hibernation. (All
    user processes are frozenm so swapin is not a concern).

    This way, no updates will happen to swap_map[] between
    hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free()
    is called. We can be assured that swap corruption will not occur.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Ondrej Zary
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Since 2.6.31, swap_map[]'s refcounting was changed to show that a used
    swap entry is just for swap-cache, can be reused. Then, while scanning
    free entry in swap_map[], a swap entry may be able to be reclaimed and
    reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap
    entry if necessary").

    But this caused deta corruption at resume. The scenario is

    - Assume a clean-swap cache, but mapped.

    - at hibernation_snapshot[], clean-swap-cache is saved as
    clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.

    - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save
    image, and break the contents.

    After resume:

    - the memory reclaim runs and finds clean-not-referenced-swap-cache and
    discards it because it's marked as clean. But here, the contents on
    disk and swap-cache is inconsistent.

    Hance memory is corrupted.

    This patch avoids the bug by not reclaiming swap-entry during hibernation.
    This is a quick fix for backporting.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Rafael J. Wysocki
    Reported-by: Ondreg Zary
    Tested-by: Ondreg Zary
    Tested-by: Andrea Gelmini
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Use compile-allocated memory instead of dynamic allocated memory for
    mm_slots_hash.

    Use hash_ptr() instead divisions for bucket calculation.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Izik Eidus
    Cc: Avi Kivity
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Fix "system goes unresponsive under memory pressure and lots of
    dirty/writeback pages" bug.

    http://lkml.org/lkml/2010/4/4/86

    In the above thread, Andreas Mohr described that

    Invoking any command locked up for minutes (note that I'm
    talking about attempted additional I/O to the _other_,
    _unaffected_ main system HDD - such as loading some shell
    binaries -, NOT the external SSD18M!!).

    This happens when the two conditions are both meet:
    - under memory pressure
    - writing heavily to a slow device

    OOM also happens in Andreas' system. The OOM trace shows that 3 processes
    are stuck in wait_on_page_writeback() in the direct reclaim path. One in
    do_fork() and the other two in unix_stream_sendmsg(). They are blocked on
    this condition:

    (sc->order && priority < DEF_PRIORITY - 2)

    which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
    also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
    permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
    for the order-1 fork() allocation runs into a range of 512KB
    hard-to-reclaim LRU pages, it will be stalled.

    It's a severe problem in three ways.

    Firstly, it can easily happen in daily desktop usage. vmscan priority can
    easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if
    the system has 50% globally reclaimable pages, it still has good
    opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a
    simple dd can easily create a big range (up to 20%) of dirty pages in the
    LRU lists. And order-1 to order-3 allocations are more than common with
    SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
    slab caches. For example, the order-1 radix_tree_node slab cache may
    stall applications at swap-in time; the order-3 inode cache on most
    filesystems may stall applications when trying to read some file; the
    order-2 proc_inode_cache may stall applications when trying to open a
    /proc file.

    Secondly, once triggered, it will stall unrelated processes (not doing IO
    at all) in the system. This "one slow USB device stalls the whole system"
    avalanching effect is very bad.

    Thirdly, once stalled, the stall time could be intolerable long for the
    users. When there are 20MB queued writeback pages and USB 1.1 is writing
    them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
    Not to mention it may be called multiple times.

    So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
    DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is
    20%, it will hardly be triggered by pure dirty pages. We'd better treat
    PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
    uncomfortably long (easily goes beyond 1s).

    The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
    which are easy to satisfy in 1TB memory boxes. So, although 6.25% of
    memory could be an awful lot of pages to scan on a system with 1TB of
    memory, it won't really have to busy scan that much.

    Andreas tested an older version of this patch and reported that it mostly
    fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will
    fix it further in the next patch.

    Reported-by: Andreas Mohr
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Signed-off-by: Wu Fengguang
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang