10 May, 2011

2 commits

  • Commit a626ca6a6564 ("vm: fix vm_pgoff wrap in stack expansion") fixed
    the case of an expanding mapping causing vm_pgoff wrapping when you had
    downward stack expansion. But there was another case where IA64 and
    PA-RISC expand mappings: upward expansion.

    This fixes that case too.

    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Linux kernel excludes guard page when performing mlock on a VMA with
    down-growing stack. However, some architectures have up-growing stack
    and locking the guard page should be excluded in this case too.

    This patch fixes lvm2 on PA-RISC (and possibly other architectures with
    up-growing stack). lvm2 calculates number of used pages when locking and
    when unlocking and reports an internal error if the numbers mismatch.

    [ Patch changed fairly extensively to also fix /proc//maps for the
    grows-up case, and to move things around a bit to clean it all up and
    share the infrstructure with the /proc bits.

    Tested on ia64 that has both grow-up and grow-down segments - Linus ]

    Signed-off-by: Mikulas Patocka
    Tested-by: Tony Luck
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

05 May, 2011

2 commits

  • The logic in __get_user_pages() used to skip the stack guard page lookup
    whenever the caller wasn't interested in seeing what the actual page
    was. But Michel Lespinasse points out that there are cases where we
    don't care about the physical page itself (so 'pages' may be NULL), but
    do want to make sure a page is mapped into the virtual address space.

    So using the existence of the "pages" array as an indication of whether
    to look up the guard page or not isn't actually so great, and we really
    should just use the FOLL_MLOCK bit. But because that bit was only set
    for the VM_LOCKED case (and not all vma's necessarily have it, even for
    mlock()), we couldn't do that originally.

    Fix that by moving the VM_LOCKED check deeper into the call-chain, which
    actually simplifies many things. Now mlock() gets simpler, and we can
    also check for FOLL_MLOCK in __get_user_pages() and the code ends up
    much more straightforward.

    Reported-and-reviewed-by: Michel Lespinasse
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The SLUB allocator use of the cmpxchg_double logic was wrong: it
    actually needs the irq-safe one.

    That happens automatically when we use the native unlocked 'cmpxchg8b'
    instruction, but when compiling the kernel for older x86 CPUs that do
    not support that instruction, we fall back to the generic emulation
    code.

    And if you don't specify that you want the irq-safe version, the generic
    code ends up just open-coding the cmpxchg8b equivalent without any
    protection against interrupts or preemption. Which definitely doesn't
    work for SLUB.

    This was reported by Werner Landgraf , who saw
    instability with his distro-kernel that was compiled to support pretty
    much everything under the sun. Most big Linux distributions tend to
    compile for PPro and later, and would never have noticed this problem.

    This also fixes the prototypes for the irqsafe cmpxchg_double functions
    to use 'bool' like they should.

    [ Btw, that whole "generic code defaults to no protection" design just
    sounds stupid - if the code needs no protection, there is no reason to
    use "cmpxchg_double" to begin with. So we should probably just remove
    the unprotected version entirely as pointless. - Linus ]

    Signed-off-by: Thomas Gleixner
    Reported-and-tested-by: werner
    Acked-and-tested-by: Ingo Molnar
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Jens Axboe
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

29 Apr, 2011

3 commits

  • With transparent hugepage support, handle_mm_fault() has to be careful
    that a normal PMD has been established before handling a PTE fault. To
    achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
    pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is
    called once it is known the PMD is safe.

    pte_alloc_map() is smart enough to check if a PTE is already present
    before calling __pte_alloc but this check was lost. As a consequence,
    PTEs may be allocated unnecessarily and the page table lock taken. Thi
    useless PTE does get cleaned up but it's a performance hit which is
    visible in page_test from aim9.

    This patch simply re-adds the check normally done by pte_alloc_map to
    check if the PTE needs to be allocated before taking the page table lock.
    The effect is noticable in page_test from aim9.

    AIM9
    2.6.38-vanilla 2.6.38-checkptenone
    creat-clo 446.10 ( 0.00%) 424.47 (-5.10%)
    page_test 38.10 ( 0.00%) 42.04 ( 9.37%)
    brk_test 52.45 ( 0.00%) 51.57 (-1.71%)
    exec_test 382.00 ( 0.00%) 456.90 (16.39%)
    fork_test 60.11 ( 0.00%) 67.79 (11.34%)
    MMTests Statistics: duration
    Total Elapsed Time (seconds) 611.90 612.22

    (While this affects 2.6.38, it is a performance rather than a functional
    bug and normally outside the rules -stable. While the big performance
    differences are to a microbench, the difference in fork and exec
    performance may be significant enough that -stable wants to consider the
    patch)

    Reported-by: Raz Ben Yehuda
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • PTE pages eat up memory just like anything else, but we do not account for
    them in any way in the OOM scores. They are also _guaranteed_ to get
    freed up when a process is OOM killed, while RSS is not.

    Reported-by: Dave Hansen
    Signed-off-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The huge_memory.c THP page fault was allowed to run if vm_ops was null
    (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
    setup a special vma->vm_ops and it would fallback to regular anonymous
    memory) but other THP logics weren't fully activated for vmas with vm_file
    not NULL (/dev/zero has a not NULL vma->vm_file).

    So this removes the vm_file checks so that /dev/zero also can safely use
    THP (the other albeit safer approach to fix this bug would have been to
    prevent the THP initial page fault to run if vm_file was set).

    After removing the vm_file checks, this also makes huge_memory.c stricter
    in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
    check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
    under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
    only be allowed to exist before the first page fault, and in turn when
    vma->anon_vma is null (so preventing khugepaged registration). So I tend
    to think the previous comment saying if vm_file was set, VM_PFNMAP might
    have been set and we could still be registered in khugepaged (despite
    anon_vma was not NULL to be registered in khugepaged) was too paranoid.
    The is_linear_pfn_mapping check is also I think superfluous (as described
    by comment) but under DEBUG_VM it is safe to stay.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682

    Signed-off-by: Andrea Arcangeli
    Reported-by: Caspar Zhang
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Apr, 2011

10 commits

  • The conventional format for boolean attributes in sysfs is numeric ("0" or
    "1" followed by new-line). Any boolean attribute can then be read and
    written using a generic function. Using the strings "yes [no]", "[yes]
    no" (read), "yes" and "no" (write) will frustrate this.

    [akpm@linux-foundation.org: use kstrtoul()]
    [akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
    Signed-off-by: Ben Hutchings
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Tested-by: David Rientjes
    Cc: NeilBrown
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     
  • This is an almost-revert of commit 93b43fa ("oom: give the dying task a
    higher priority").

    That commit dramatically improved oom killer logic when a fork-bomb
    occurs. But I've found that it has nasty corner case. Now cpu cgroup has
    strange default RT runtime. It's 0! That said, if a process under cpu
    cgroup promote RT scheduling class, the process never run at all.

    If an admin inserts a !RT process into a cpu cgroup by setting
    rtruntime=0, usually it runs perfectly because a !RT task isn't affected
    by the rtruntime knob. But if it promotes an RT task via an explicit
    setscheduler() syscall or an OOM, the task can't run at all. In short,
    the oom killer doesn't work at all if admins are using cpu cgroup and don't
    touch the rtruntime knob.

    Eventually, kernel may hang up when oom kill occur. I and the original
    author Luis agreed to disable this logic.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Luis Claudio R. Goncalves
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • all_unreclaimable check in direct reclaim has been introduced at 2.6.19
    by following commit.

    2006 Sep 25; commit 408d8544; oom: use unreclaimable info

    And it went through strange history. firstly, following commit broke
    the logic unintentionally.

    2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
    costly-order allocations

    Two years later, I've found obvious meaningless code fragment and
    restored original intention by following commit.

    2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
    return value when priority==0

    But, the logic didn't works when 32bit highmem system goes hibernation
    and Minchan slightly changed the algorithm and fixed it .

    2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
    in direct reclaim path

    But, recently, Andrey Vagin found the new corner case. Look,

    struct zone {
    ..
    int all_unreclaimable;
    ..
    unsigned long pages_scanned;
    ..
    }

    zone->all_unreclaimable and zone->pages_scanned are neigher atomic
    variables nor protected by lock. Therefore zones can become a state of
    zone->page_scanned=0 and zone->all_unreclaimable=1. In this case, current
    all_unreclaimable() return false even though zone->all_unreclaimabe=1.

    This resulted in the kernel hanging up when executing a loop of the form

    1. fork
    2. mmap
    3. touch memory
    4. read memory
    5. munmmap

    as described in
    http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725

    Is this ignorable minor issue? No. Unfortunately, x86 has very small dma
    zone and it become zone->all_unreclamble=1 easily. and if it become
    all_unreclaimable=1, it never restore all_unreclaimable=0. Why? if
    all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
    a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan at
    all!

    Eventually, oom-killer never works on such systems. That said, we can't
    use zone->pages_scanned for this purpose. This patch restore
    all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
    to add oom_killer_disabled check to avoid reintroduce the issue of commit
    d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").

    Reported-by: Andrey Vagin
    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In __access_remote_vm() we need to check that we have found the right
    vma, not the following vma before we try to access it. Otherwise we
    might call the vma's access routine with an address which does not fall
    inside the vma.

    It was discovered on a current kernel but with an unreleased driver,
    from memory it was strace leading to a kernel bad access, but it
    obviously depends on what the access implementation does.

    Looking at other access implementations I only see:

    $ git grep -A 5 vm_operations|grep access
    arch/powerpc/platforms/cell/spufs/file.c- .access = spufs_mem_mmap_access,
    arch/x86/pci/i386.c- .access = generic_access_phys,
    drivers/char/mem.c- .access = generic_access_phys
    fs/sysfs/bin.c- .access = bin_access,

    The spufs one looks like it might behave badly given the wrong vma, it
    assumes vma->vm_file->private_data is a spu_context, and looks like it
    would probably blow up pretty quickly if it wasn't.

    generic_access_phys() only uses the vma to check vm_flags and get the
    mm, and then walks page tables using the address. So it should bail on
    the vm_flags check, or at worst let you access some other VM_IO mapping.

    And bin_access() just proxies to another access implementation.

    Signed-off-by: Michael Ellerman
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • 5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
    tried to get the whole logic of brk randomization for legacy
    (libc5-based) applications finally right.

    It turns out that the way to detect whether brk has actually been
    randomized in the end or not introduced by that patch still doesn't work
    for those binaries, as reported by Geert:

    : /sbin/init from my old m68k ramdisk exists prematurely.
    :
    : Before the patch:
    :
    : | brk(0x80005c8e) = 0x80006000
    :
    : After the patch:
    :
    : | brk(0x80005c8e) = 0x80005c8e
    :
    : Old libc5 considers brk() to have failed if the return value is not
    : identical to the requested value.

    I don't like it, but currently see no better option than a bit flag in
    task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
    case.

    Signed-off-by: Jiri Kosina
    Tested-by: Geert Uytterhoeven
    Reported-by: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • If you fill up a tmpfs, df was showing

    tmpfs 460800 - - - /tmp

    because of an off-by-one in the max_blocks checks. Fix it so df shows

    tmpfs 460800 460800 0 100% /tmp

    Signed-off-by: Hugh Dickins
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I found it difficult to make sense of transparent huge pages without
    having any counters for its actions. Add some counters to vmstat for
    allocation of transparent hugepages and fallback to smaller pages.

    Optional patch, but useful for development and understanding the system.

    Contains improvements from Andrea Arcangeli and Johannes Weiner

    [akpm@linux-foundation.org: coding-style fixes]
    [hannes@cmpxchg.org: fix vmstat_text[] entries]
    Signed-off-by: Andi Kleen
    Acked-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The memory hotplug case involves calling to build_all_zonelists() which
    in turns calls in to setup_zone_pageset(). The latter is marked
    __meminit while build_all_zonelists() itself has no particular
    annotation. build_all_zonelists() is only handed a non-NULL pointer in
    the case of memory hotplug through an existing __meminit path, so the
    setup_zone_pageset() reference is always safe.

    The options as such are either to flag build_all_zonelists() as __ref (as
    per __build_all_zonelists()), or to simply discard the __meminit
    annotation from setup_zone_pageset().

    Signed-off-by: Paul Mundt
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     
  • If CONFIG_FLATMEM is enabled pfn is calculated in online_page() more than
    once. It is possible to optimize that and use value established at
    beginning of that function.

    Signed-off-by: Daniel Kiper
    Acked-by: Dave Hansen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Reviewed-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     

13 Apr, 2011

2 commits

  • Commit 982134ba6261 ("mm: avoid wrapping vm_pgoff in mremap()") fixed
    the case of a expanding mapping causing vm_pgoff wrapping when you used
    mremap. But there was another case where we expand mappings hiding in
    plain sight: the automatic stack expansion.

    This fixes that case too.

    This one also found by Robert Święcki, using his nasty system call
    fuzzer tool. Good job.

    Reported-and-tested-by: Robert Święcki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit 53a7706d5ed8 ("mlock: do not hold mmap_sem for extended periods
    of time") changed mlock() to care about the exact number of pages that
    __get_user_pages() had brought it. Before, it would only care about
    errors.

    And that doesn't work, because we also handled one page specially in
    __mlock_vma_pages_range(), namely the stack guard page. So when that
    case was handled, the number of pages that the function returned was off
    by one. In particular, it could be zero, and then the caller would end
    up not making any progress at all.

    Rather than try to fix up that off-by-one error for the mlock case
    specially, this just moves the logic to handle the stack guard page
    into__get_user_pages() itself, thus making all the counts come out
    right automatically.

    Reported-by: Robert Święcki
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Apr, 2011

1 commit


07 Apr, 2011

1 commit

  • The normal mmap paths all avoid creating a mapping where the pgoff
    inside the mapping could wrap around due to overflow. However, an
    expanding mremap() can take such a non-wrapping mapping and make it
    bigger and cause a wrapping condition.

    Noticed by Robert Swiecki when running a system call fuzzer, where it
    caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A
    vma dumping patch by Hugh then pinpointed the crazy wrapped case.

    Reported-and-tested-by: Robert Swiecki
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2011

1 commit


30 Mar, 2011

1 commit

  • * 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv:
    FRV: Use generic show_interrupts()
    FRV: Convert genirq namespace
    frv: Select GENERIC_HARDIRQS_NO_DEPRECATED
    frv: Convert cpu irq_chip to new functions
    frv: Convert mb93493 irq_chip to new functions
    frv: Convert mb93093 irq_chip to new function
    frv: Convert mb93091 irq_chip to new functions
    frv: Fix typo from __do_IRQ overhaul
    frv: Remove stale irq_chip.end
    FRV: Do some cleanups
    FRV: Missing node arg in alloc_thread_info_node() macro
    NOMMU: implement access_remote_vm
    NOMMU: support SMP dynamic percpu_alloc
    NOMMU: percpu should use is_vmalloc_addr().

    Linus Torvalds
     

29 Mar, 2011

1 commit

  • Recent vm changes brought in a new function which the core procfs code
    utilizes. So implement it for nommu systems too to avoid link failures.

    Signed-off-by: Mike Frysinger
    Signed-off-by: David Howells
    Tested-by: Simon Horman
    Tested-by: Ithamar Adema
    Acked-by: Greg Ungerer

    Mike Frysinger
     

28 Mar, 2011

2 commits

  • per_cpu_ptr_to_phys() uses VMALLOC_START and VMALLOC_END to determine if an
    address is in the vmalloc() region or not. This is incorrect on NOMMU as
    there is no real vmalloc() capability (vmalloc() is emulated by kmalloc()).

    The correct way to do this is to use is_vmalloc_addr(). This encapsulates the
    vmalloc() region test in MMU mode and just returns 0 in NOMMU mode.

    On FRV in NOMMU mode, the percpu compilation fails without this patch:

    mm/percpu.c: In function 'per_cpu_ptr_to_phys':
    mm/percpu.c:1011: error: 'VMALLOC_START' undeclared (first use in this function)
    mm/percpu.c:1011: error: (Each undeclared identifier is reported only once
    mm/percpu.c:1011: error: for each function it appears in.)
    mm/percpu.c:1012: error: 'VMALLOC_END' undeclared (first use in this function)
    mm/percpu.c:1018: warning: control reaches end of non-void function

    Signed-off-by: David Howells

    David Howells
     
  • Fix mm/memory.c incorrect kernel-doc function notation:

    Warning(mm/memory.c:3718): Cannot understand * @access_remote_vm - access another process' address space
    on line 3718 - I thought it was a doc line

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

25 Mar, 2011

8 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    SLUB: Write to per cpu data when allocating it
    slub: Fix debugobjects with lockless fastpath

    Linus Torvalds
     
  • Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from
    meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which
    resulted in build warnings on all architectures that implement their own
    versions of show_mem():

    lib/lib.a(show_mem.o): In function `show_mem':
    show_mem.c:(.text+0x1f4): multiple definition of `show_mem'
    arch/sparc/mm/built-in.o:(.text+0xd70): first defined here

    The fix is to remove __show_mem() and add its argument to show_mem() in
    all implementations to prevent this breakage.

    Architectures that implement their own show_mem() actually don't do
    anything with the argument yet, but they could be made to filter nodes
    that aren't allowed in the current context in the future just like the
    generic implementation.

    Reported-by: Stephen Rothwell
    Reported-by: James Bottomley
    Suggested-by: Andrew Morton
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It turns out that the cmpxchg16b emulation has to access vmalloced
    percpu memory with interrupts disabled. If the memory has never
    been touched before then the fault necessary to establish the
    mapping will not to occur and the kernel will fail on boot.

    Fix that by reusing the CONFIG_PREEMPT code that writes the
    cpu number into a field on every cpu. Writing to the per cpu
    area before causes the mapping to be established before we get
    to a cmpxchg16b emulation.

    Tested-by: Ingo Molnar
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • On Thu, 24 Mar 2011, Ingo Molnar wrote:
    > RIP: 0010:[] [] get_next_timer_interrupt+0x119/0x260

    That's a typical timer crash, but you were unable to debug it with
    debugobjects because commit d3f661d6 broke those.

    Cc: Christoph Lameter
    Tested-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Pekka Enberg

    Thomas Gleixner
     
  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

6 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    deal with races in /proc/*/{syscall,stack,personality}
    proc: enable writing to /proc/pid/mem
    proc: make check_mem_permission() return an mm_struct on success
    proc: hold cred_guard_mutex in check_mem_permission()
    proc: disable mem_write after exec
    mm: implement access_remote_vm
    mm: factor out main logic of access_process_vm
    mm: use mm_struct to resolve gate vma's in __get_user_pages
    mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
    mm: arch: make in_gate_area take an mm_struct instead of a task_struct
    mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
    x86: mark associated mm when running a task in 32 bit compatibility mode
    x86: add context tag to mark mm when running a task in 32-bit compatibility mode
    auxv: require the target to be tracable (or yourself)
    close race in /proc/*/environ
    report errors in /proc/*/*map* sanely
    pagemap: close races with suid execve
    make sessionid permissions in /proc/*/task/* match those in /proc/*
    fix leaks in path_lookupat()

    Fix up trivial conflicts in fs/proc/base.c

    Linus Torvalds
     
  • …p_elfcorehdr and saved_max_pfn

    The Xen PV drivers in a crashed HVM guest can not connect to the dom0
    backend drivers because both frontend and backend drivers are still in
    connected state. To run the connection reset function only in case of a
    crashdump, the is_kdump_kernel() function needs to be available for the PV
    driver modules.

    Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into
    kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel()
    usable for modules.

    Leave 'elfcorehdr' as early_param(). This changes powerpc from __setup()
    to early_param(). It adds an address range check from x86 also on ia64
    and powerpc.

    [akpm@linux-foundation.org: additional #includes]
    [akpm@linux-foundation.org: remove elfcorehdr_addr export]
    [akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes]
    Signed-off-by: Olaf Hering <olaf@aepfle.de>
    Cc: Russell King <rmk@arm.linux.org.uk>
    Cc: "Luck, Tony" <tony.luck@intel.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mundt <lethal@linux-sh.org>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Olaf Hering
     
  • When a memcg is oom and current has already received a SIGKILL, then give
    it access to memory reserves with a higher scheduling priority so that it
    may quickly exit and free its memory.

    This is identical to the global oom killer and is done even before
    checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
    selected is guaranteed to have come from userspace; the thread only needs
    access to memory reserves to exit and thus we don't unnecessarily panic
    the machine until the kernel has no last resort to free memory.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • fs/fuse/dev.c::fuse_try_move_page() does

    (1) remove a page by ->steal()
    (2) re-add the page to page cache
    (3) link the page to LRU if it was not on LRU at (1)

    This implies the page is _on_ LRU when it's added to radix-tree. So, the
    page is added to memory cgroup while it's on LRU. because LRU is lazy and
    no one flushs it.

    This is the same behavior as SwapCache and needs special care as
    - remove page from LRU before overwrite pc->mem_cgroup.
    - add page to LRU after overwrite pc->mem_cgroup.

    And we need to taking care of pagevec.

    If PageLRU(page) is set before we add PCG_USED bit, the page will not be
    added to memcg's LRU (in short period). So, regardlress of PageLRU(page)
    value before commit_charge(), we need to check PageLRU(page) after
    commit_charge().

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Miklos Szeredi
    Cc: Balbir Singh
    Reported-by: Daniel Poelzleithner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
    PageReserved because we never store the array on reserved pages (neither
    alloc_pages_exact nor vmalloc use those pages).

    So we can replace the check by a BUG_ON.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we are allocating a single page_cgroup array per memory section
    (stored in mem_section->base) when CONFIG_SPARSEMEM is selected. This is
    correct but memory inefficient solution because the allocated memory
    (unless we fall back to vmalloc) is not kmalloc friendly:

    - 32b - 16384 entries (20B per entry) fit into 327680B so the
    524288B slab cache is used
    - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
    - 64b - 32768 entries (40B per entry) fit into 2097152 cache

    This is ~37% wasted space per memory section and it sumps up for the whole
    memory. On a x86_64 machine it is something like 6MB per 1GB of RAM.

    We can reduce the internal fragmentation by using alloc_pages_exact which
    allocates PAGE_SIZE aligned blocks so we will get down to
    Cc: Dave Hansen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko