04 Nov, 2009

2 commits


03 Nov, 2009

1 commit

  • In try_to_unuse(), swcount is a local copy of *swap_map, including the
    SWAP_HAS_CACHE bit; but a wrong comparison against swap_count(*swap_map),
    which masks off the SWAP_HAS_CACHE bit, succeeded where it should fail.

    That had the effect of resetting the mm from which to start searching
    for the next swap page, to an irrelevant mm instead of to an mm in which
    this swap page had been found: which may increase search time by ~20%.
    But we're used to swapoff being slow, so never noticed the slowdown.

    Remove that one spurious use of swap_count(): Bo Liu thought it merely
    redundant, Hugh rewrote the description since it was measurably wrong.

    Signed-off-by: Bo Liu
    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Bo Liu
     

01 Nov, 2009

1 commit

  • Don't pass NULL pointers to fput() in the error handling paths of the NOMMU
    do_mmap_pgoff() as it can't handle it.

    The following can be used as a test program:

    int main() { static long long a[1024 * 1024 * 20] = { 0 }; return a;}

    Without the patch, the code oopses in atomic_long_dec_and_test() as called by
    fput() after the kernel complains that it can't allocate that big a chunk of
    memory. With the patch, the kernel just complains about the allocation size
    and then the program segfaults during execve() as execve() can't complete the
    allocation of all the new ELF program segments.

    Reported-by: Robin Getz
    Signed-off-by: David Howells
    Acked-by: Robin Getz
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     

30 Oct, 2009

2 commits


29 Oct, 2009

12 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule
    powerpc: Minor cleanup to lib/Kconfig.debug
    powerpc: Minor cleanup to sound/ppc/Kconfig
    powerpc: Minor cleanup to init/Kconfig
    powerpc: Limit memory hotplug support to PPC64 Book-3S machines
    powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
    powerpc: Fix compile errors found by new ppc64e_defconfig
    powerpc: Add a Book-3E 64-bit defconfig
    powerpc/booke: Fix xmon single step on PowerPC Book-E
    powerpc: Align vDSO base address
    powerpc: Fix segment mapping in vdso32
    powerpc/iseries: Remove compiler version dependent hack
    powerpc/perf_events: Fix priority of MSR HV vs PR bits
    powerpc/5200: Update defconfigs
    drivers/serial/mpc52xx_uart.c: Use UPIO_MEM rather than SERIAL_IO_MEM
    powerpc/boot/dts: drop obsolete 'fsl5200-clocking'
    of: Remove nested function
    mpc5200: support for the MAN mpc5200 based board mucmc52
    mpc5200: support for the MAN mpc5200 based board uc101

    Linus Torvalds
     
  • * 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    HWPOISON: fix invalid page count in printk output
    HWPOISON: Allow schedule_on_each_cpu() from keventd
    HWPOISON: fix/proc/meminfo alignment
    HWPOISON: fix oops on ksm pages
    HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
    HWPOISON: return early on non-LRU pages
    HWPOISON: Add brief hwpoison description to Documentation
    HWPOISON: Clean up PR_MCE_KILL interface

    Linus Torvalds
     
  • There are some places where we do like:

    pte = pte_map();
    do {
    (do break in some conditions)
    } while (pte++, ...);
    pte_unmap(pte - 1);

    But if the loop breaks at the first loop, pte_unmap() unmaps invalid pte.

    This patch is a fix for this problem.

    Signed-off-by: Daisuke Nishimura
    Reviewd-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Currently, sparsemem is only available if EXPERIMENTAL is enabled.
    However, it hasn't ever been marked experimental.

    It's been about four years since sparsemem was merged, and we have
    platforms which depend on it; allow architectures to decide whether
    sparsemem should be the default memory model.

    Signed-off-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • Isolators putting a page back to the LRU do not hold the page lock, and if
    the page is mlocked, another thread might munlock it concurrently.

    Expecting this, the putback code re-checks the evictability of a page when
    it just moved it to the unevictable list in order to correct its decision.

    The problem, however, is that ordering is not garuanteed between setting
    PG_lru when moving the page to the list and checking PG_mlocked
    afterwards:

    #0: #1

    spin_lock()
    if (TestClearPageMlocked())
    if (PageLRU())
    move to evictable list
    SetPageLRU()
    spin_unlock()
    if (!PageMlocked())
    move to evictable list

    The PageMlocked() check may get reordered before SetPageLRU() in #0,
    resulting in #0 not moving the still mlocked page, and in #1 failing to
    isolate and move the page as well. The page is now stranded on the
    unevictable list.

    The race condition is very unlikely. The consequence currently is one
    page falling off the reclaim grid and eventually getting freed with
    PG_unevictable set, which triggers a warning in the page allocator.

    TestClearPageMlocked() in #1 already provides full memory barrier
    semantics.

    This patch adds an explicit full barrier to force ordering between
    SetPageLRU() and PageMlocked() so that either one of the competitors
    rescues the page.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If migrate_prep is failed, new variable is leaked. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If mbind() receives an invalid address, do_mbind leaks a page. The
    following test program detects this leak.

    This patch fixes it.

    migrate_efault.c
    =======================================
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;

    static void* make_hole_mapping(void)
    {

    void* addr;

    addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
    MAP_ANON|MAP_PRIVATE, 0, 0);
    if (addr == MAP_FAILED)
    return NULL;

    /* make page populate */
    memset(addr, 0, pagesize*3);

    /* make memory hole */
    munmap(addr+pagesize, pagesize);

    return addr;
    }

    int main(int argc, char** argv)
    {
    void* addr;
    int ch;
    int node;
    struct bitmask *nmask = numa_allocate_nodemask();
    int err;
    int node_set = 0;

    while ((ch = getopt(argc, argv, "n:")) != -1){
    switch (ch){
    case 'n':
    node = strtol(optarg, NULL, 0);
    numa_bitmask_setbit(nmask, node);
    node_set = 1;
    break;
    default:
    ;
    }
    }
    argc -= optind;
    argv += optind;

    if (!node_set)
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    addr = make_hole_mapping();

    err = mbind(addr, pagesize*3, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL);
    if (err)
    perror("mbind ");

    return 0;
    }
    =======================================

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It is possible to have !Anon but SwapBacked pages, and some apps could
    create huge number of such pages with MAP_SHARED|MAP_ANONYMOUS. These
    pages go into the ANON lru list, and hence shall not be protected: we only
    care mapped executable files. Failing to do so may trigger OOM.

    Tested-by: Christian Borntraeger
    Reviewed-by: Rik van Riel
    Signed-off-by: Wu Fengguang
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Revert

    commit 71de1ccbe1fb40203edd3beb473f8580d917d2ca
    Author: KOSAKI Motohiro
    AuthorDate: Mon Sep 21 17:01:31 2009 -0700
    Commit: Linus Torvalds
    CommitDate: Tue Sep 22 07:17:27 2009 -0700

    mm: oom analysis: add buffer cache information to show_free_areas()

    show_free_areas() is called during page allocation failures, and page
    allocation failures can occur in any calling context.

    But nr_blockdev_pages() takes VFS locks which should not be taken from
    hard IRQ context (at least). The result is lockdep warnings (and
    deadlockability) during page allocation failures.

    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • commit 8aa7e847d (Fix congestion_wait() sync/async vs read/write
    confusion) replace WRITE with BLK_RW_ASYNC. Unfortunately, concurrent mm
    development made the unchanged place accidentally.

    This patch fixes it too.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Jens Axboe
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Memory failure on a KSM page currently oopses on its NULL anon_vma in
    page_lock_anon_vma(): that may not be much worse than the consequence of
    ignoring it, but it is better to be consistent with how ZERO_PAGE and
    hugetlb pages and other awkward cases are treated. Just skip it.

    We could fix it for 2.6.32 at the KSM end, by putting a dummy anon_vma
    pointer in there; but that would get harder next time, when KSM will put a
    pointer to something else there (and I'm not currently planning to do any
    work to open that up to memory_failure). So I would prefer this simple
    PageKsm test, until the other exceptions are handled.

    Signed-off-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When the bdi is being removed, we have to ensure that no super_blocks
    currently have that cached in sb->s_bdi. Normally this is ensured by
    the sb having a longer life span than the bdi, but if the device is
    suddenly yanked, we have to kill this reference. sb->s_bdi is pointed
    to freed memory at that point.

    This fixes a problem with sync(1) hanging when a USB stick is pulled
    without cleanly umounting it first.

    Reported-by: Pavel Machek
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Oct, 2009

1 commit

  • pcpu_alloc() and pcpu_extend_area_map() perform a series of
    spin_lock_irq()/spin_unlock_irq() calls, which make them unsafe
    with respect to being called from contexts which have IRQs off.

    This patch converts the code to perform save/restore of flags instead,
    making pcpu_alloc() (or __alloc_percpu() respectively) to be called
    from early kernel startup stage, where IRQs are off.

    This is needed for proper initialization of per-cpu rq_weight data from
    sched_init().

    tj: added comment explaining why irqsave/restore is used in alloc path.

    Signed-off-by: Jiri Kosina
    Acked-by: Ingo Molnar
    Signed-off-by: Tejun Heo

    Jiri Kosina
     

27 Oct, 2009

1 commit


19 Oct, 2009

4 commits

  • The madvise injector already holds a reference when passing in a page
    to the memory-failure code. The code corrects for this additional reference
    for its checks, but the final printk output didn't. Fix that.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Memory failure on a KSM page currently oopses on its NULL anon_vma in
    page_lock_anon_vma(): that may not be much worse than the consequence
    of ignoring it, but it is better to be consistent with how ZERO_PAGE
    and hugetlb pages and other awkward cases are treated. Just skip it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andi Kleen

    Hugh Dickins
     
  • When returning due to a poisoned page drop the page count.

    It wasn't a fatal problem because noone cares about the page count
    on a poisoned page (except when it wraps), but it's cleaner to fix it.

    Pointed out by Linus.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Right now we have some trouble with non atomic access
    to page flags when locking the page. To plug this hole
    for now, limit error recovery to LRU pages for now.

    This could be better fixed by defining a suitable protocol,
    but let's go this simple way for now

    This avoids unnecessary races with __set_page_locked() and
    __SetPageSlab*() and maybe more non-atomic page flag operations.

    This loses isolated pages which are currently in page reclaim, but these
    are relatively limited compared to the total memory.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen
    [AK: new description, bug fixes, cleanups]

    Wu Fengguang
     

14 Oct, 2009

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cciss: Add cciss_allow_hpsa module parameter
    cciss: Fix multiple calls to pci_release_regions
    blk-settings: fix function parameter kernel-doc notation
    writeback: kill space in debugfs item name
    writeback: account IO throttling wait as iowait
    elv_iosched_store(): fix strstrip() misuse
    cfq-iosched: avoid probable slice overrun when idling
    cfq-iosched: apply bool value where we return 0/1
    cfq-iosched: fix think time allowed for seekers
    cfq-iosched: fix the slice residual sign
    cfq-iosched: abstract out the 'may this cfqq dispatch' logic
    block: use proper BLK_RW_ASYNC in blk_queue_start_tag()
    block: Seperate read and write statistics of in_flight requests v2
    block: get rid of kblock_schedule_delayed_work()
    cfq-iosched: fix possible problem with jiffies wraparound
    cfq-iosched: fix issue with rq-rq merging and fifo list ordering

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix compile warnings

    Linus Torvalds
     

12 Oct, 2009

2 commits

  • Fix the following two compile warnings which show up on i386.

    mm/percpu.c:1873: warning: comparison of distinct pointer types lacks a cast
    mm/percpu.c:1879: warning: format '%lx' expects type 'long unsigned int', but argument 2 has type 'size_t'

    Signed-off-by: Tejun Heo
    Reported-by: Randy Dunlap

    Tejun Heo
     
  • After m68k's task_thread_info() doesn't refer to current,
    it's possible to remove sched.h from interrupt.h and not break m68k!
    Many thanks to Heiko Carstens for allowing this.

    Signed-off-by: Alexey Dobriyan

    Alexey Dobriyan
     

10 Oct, 2009

2 commits


09 Oct, 2009

3 commits


08 Oct, 2009

3 commits

  • When a vmalloc'd area is mmap'd into userspace, some kind of
    co-ordination is necessary for this to work on platforms with cpu
    D-caches which can have aliases.

    Otherwise kernel side writes won't be seen properly in userspace
    and vice versa.

    If the kernel side mapping and the user side one have the same
    alignment, modulo SHMLBA, this can work as long as VM_SHARED is
    shared of VMA and for all current users this is true. VM_SHARED
    will force SHMLBA alignment of the user side mmap on platforms with
    D-cache aliasing matters.

    The bulk of this patch is just making it so that a specific
    alignment can be passed down into __get_vm_area_node(). All
    existing callers pass in '1' which preserves existing behavior.
    vmalloc_user() gives SHMLBA for the alignment.

    As a side effect this should get the video media drivers and other
    vmalloc_user() users into more working shape on such systems.

    Signed-off-by: David S. Miller
    Acked-by: Peter Zijlstra
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    David Miller
     
  • fix the following 'make includecheck' warning:

    mm/vmalloc.c: linux/highmem.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • Adjust the max_kernel_pages default to a quarter of totalram_pages,
    instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
    highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
    pinning 32MB of lowmem with their rmap_items, so no need for the more
    obscure calculation (nor for its own special init function).

    There is no way for the user to switch KSM on if CONFIG_SYSFS is not
    enabled, so in that case default run to KSM_RUN_MERGE.

    Update KSM Documentation and Kconfig to reflect the new defaults.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Oct, 2009

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits)
    Revert "Seperate read and write statistics of in_flight requests"
    cfq-iosched: don't delay async queue if it hasn't dispatched at all
    block: Topology ioctls
    cfq-iosched: use assigned slice sync value, not default
    cfq-iosched: rename 'desktop' sysfs entry to 'low_latency'
    cfq-iosched: implement slower async initiate and queue ramp up
    cfq-iosched: delay async IO dispatch, if sync IO was just done
    cfq-iosched: add a knob for desktop interactiveness
    Add a tracepoint for block request remapping
    block: allow large discard requests
    block: use normal I/O path for discard requests
    swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL
    fs/bio.c: move EXPORT* macros to line after function
    Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
    cciss: fix build when !PROC_FS
    block: Do not clamp max_hw_sectors for stacking devices
    block: Set max_sectors correctly for stacking devices
    cciss: cciss_host_attr_groups should be const
    cciss: Dynamically allocate the drive_info_struct for each logical drive.
    cciss: Add usage_count attribute to each logical drive in /sys
    ...

    Linus Torvalds
     

02 Oct, 2009

3 commits

  • In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
    and it takes res_counter's spin_lock every time.

    This patch removes unnecessary calls for res_count_soft_limit_excess.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch clean up/fixes for memcg's uncharge soft limit path.

    Problems:
    Now, res_counter_charge()/uncharge() handles softlimit information at
    charge/uncharge and softlimit-check is done when event counter per memcg
    goes over limit. Now, event counter per memcg is updated only when
    memory usage is over soft limit. Here, considering hierarchical memcg
    management, ancesotors should be taken care of.

    Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
    This is not good.

    Prolems:
    1. memcg's event counter incremented only when softlimit hits. That's bad.
    It makes event counter hard to be reused for other purpose.

    2. At uncharge, only the lowest level rescounter is handled. This is bug.
    Because ancesotor's event counter is not incremented, children should
    take care of them.

    3. res_counter_uncharge()'s 3rd argument is NULL in most case.
    ops under res_counter->lock should be small. No "if" sentense is better.

    Fixes:
    * Removed soft_limit_xx poitner and checks in charge and uncharge.
    Do-check-only-when-necessary scheme works enough well without them.

    * make event-counter of memcg incremented at every charge/uncharge.
    (per-cpu area will be accessed soon anyway)

    * All ancestors are checked at soft-limit-check. This is necessary because
    ancesotor's event counter may never be modified. Then, they should be
    checked at the same time.

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
    with incremnted mz->mem->css's refcnt. Then, the caller of this function
    has to call css_put(mz->mem->css).

    But, mz can be !NULL even if "not found" i.e. without css_get(). By
    this, css->refcnt will go down to minus.

    This may cause various things...one of results will be
    initite-loop in css_tryget() as this.

    INFO: RCU detected CPU 0 stall (t=10000 jiffies)
    sending NMI to all CPUs:
    NMI backtrace for cpu 0
    CPU 0:

    <> [] trace_hardirqs_off+0xd/0x10
    [] flat_send_IPI_mask+0x90/0xb0
    [] flat_send_IPI_all+0x69/0x70
    [] arch_trigger_all_cpu_backtrace+0x62/0xa0
    [] __rcu_pending+0x7e/0x370
    [] rcu_check_callbacks+0x47/0x130
    [] update_process_times+0x46/0x70
    [] tick_sched_timer+0x60/0x160
    [] ? tick_sched_timer+0x0/0x160
    [] __run_hrtimer+0xba/0x150
    [] hrtimer_interrupt+0xd5/0x1b0
    [] ? trace_hardirqs_off_thunk+0x3a/0x3c
    [] smp_apic_timer_interrupt+0x6d/0x9b
    [] apic_timer_interrupt+0x13/0x20
    [] ? mem_cgroup_walk_tree+0x156/0x180
    [] ? mem_cgroup_walk_tree+0x73/0x180
    [] ? mem_cgroup_walk_tree+0x32/0x180
    [] ? mem_cgroup_get_local_stat+0x0/0x110
    [] ? mem_control_stat_show+0x14b/0x330
    [] ? cgroup_seqfile_show+0x3d/0x60

    Above shows CPU0 caught in css_tryget()'s inifinite loop because
    of bad refcnt.

    This is a fix to set mz=NULL at the top of retry path.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki