25 Dec, 2010

1 commit


23 Dec, 2010

3 commits

  • GCC complained about update_mmu_cache() not being defined in migrate.c.
    Including seems to solve the problem.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     
  • Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong. If it's
    going to do that then it must break out if signal_pending(), otherwise
    it's pretty much guaranteed to degenerate into a busywait loop. Plus we
    *do* want these processes to appear in D state and to contribute to load
    average.

    So it should be TASK_UNINTERRUPTIBLE. -- Andrew Morton

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • del_page_from_lru_list() already called mem_cgroup_del_lru(). So we must
    not call it again. It adds unnecessary overhead.

    It was not a runtime bug because the TestClearPageCgroupAcctLRU() early in
    mem_cgroup_del_lru_list() will prevent any double-deletion, etc.

    Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

22 Dec, 2010

1 commit


16 Dec, 2010

1 commit

  • The install_special_mapping routine (used, for example, to setup the
    vdso) skips the security check before insert_vm_struct, allowing a local
    attacker to bypass the mmap_min_addr security restriction by limiting
    the available pages for special mappings.

    bprm_mm_init() also skips the check, and although I don't think this can
    be used to bypass any restrictions, I don't see any reason not to have
    the security check.

    $ uname -m
    x86_64
    $ cat /proc/sys/vm/mmap_min_addr
    65536
    $ cat install_special_mapping.s
    section .bss
    resb BSS_SIZE
    section .text
    global _start
    _start:
    mov eax, __NR_pause
    int 0x80
    $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
    $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
    $ ./install_special_mapping &
    [1] 14303
    $ cat /proc/14303/maps
    0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
    00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
    00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

    It's worth noting that Red Hat are shipping with mmap_min_addr set to
    4096.

    Signed-off-by: Tavis Ormandy
    Acked-by: Kees Cook
    Acked-by: Robert Swiecki
    [ Changed to not drop the error code - akpm ]
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Tavis Ormandy
     

15 Dec, 2010

1 commit

  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    NFS: Fix panic after nfs_umount()
    nfs: remove extraneous and problematic calls to nfs_clear_request
    nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
    NFS: Fix fcntl F_GETLK not reporting some conflicts
    nfs: Discard ACL cache on mode update
    NFS: Readdir cleanups
    NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
    NFS: Fix a memory leak in nfs_readdir
    Call the filesystem back whenever a page is removed from the page cache
    NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler

    Linus Torvalds
     

07 Dec, 2010

2 commits

  • * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    PM / Hibernate: Fix memory corruption related to swap
    PM / Hibernate: Use async I/O when reading compressed hibernation image

    Linus Torvalds
     
  • There is a problem that swap pages allocated before the creation of
    a hibernation image can be released and used for storing the contents
    of different memory pages while the image is being saved. Since the
    kernel stored in the image doesn't know of that, it causes memory
    corruption to occur after resume from hibernation, especially on
    systems with relatively small RAM that need to swap often.

    This issue can be addressed by keeping the GFP_IOFS bits clear
    in gfp_allowed_mask during the entire hibernation, including the
    saving of the image, until the system is finally turned off or
    the hibernation is aborted. Unfortunately, for this purpose
    it's necessary to rework the way in which the hibernate and
    suspend code manipulates gfp_allowed_mask.

    This change is based on an earlier patch from Hugh Dickins.

    Signed-off-by: Rafael J. Wysocki
    Reported-by: Ondrej Zary
    Acked-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org

    Rafael J. Wysocki
     

06 Dec, 2010

1 commit


04 Dec, 2010

1 commit

  • Commit f7cb1933621bce66a77f690776a16fe3ebbc4d58 ("SLUB: Pass active
    and inactive redzone flags instead of boolean to debug functions")
    missed two instances of check_object(). This caused a lot of warnings
    during 'slabinfo -v' finally leading to a crash:

    BUG ext4_xattr: Freepointer corrupt
    ...
    BUG buffer_head: Freepointer corrupt
    ...
    BUG ext4_alloc_context: Freepointer corrupt
    ...
    ...
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] file_sb_list_del+0x1c/0x35
    PGD 79d78067 PUD 79e67067 PMD 0
    Oops: 0002 [#1] SMP
    last sysfs file: /sys/kernel/slab/:t-0000192/validate

    This patch fixes the problem by converting the two missed instances.

    Acked-by: Christoph Lameter
    Signed-off-by: Tero Roponen
    Signed-off-by: Pekka Enberg

    Tero Roponen
     

03 Dec, 2010

6 commits

  • commit 62b61f611e ("ksm: memory hotremove migration only") caused the
    following new lockdep warning.

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    -------------------------------------------------------
    bash/1621 is trying to acquire lock:
    ((memory_chain).rwsem){.+.+.+}, at: []
    __blocking_notifier_call_chain+0x69/0xc0

    but task is already holding lock:
    (ksm_thread_mutex){+.+.+.}, at: []
    ksm_memory_callback+0x3a/0xc0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (ksm_thread_mutex){+.+.+.}:
    [] lock_acquire+0xaa/0x140
    [] __mutex_lock_common+0x44/0x3f0
    [] mutex_lock_nested+0x48/0x60
    [] ksm_memory_callback+0x3a/0xc0
    [] notifier_call_chain+0x8c/0xe0
    [] __blocking_notifier_call_chain+0x7e/0xc0
    [] blocking_notifier_call_chain+0x16/0x20
    [] memory_notify+0x1b/0x20
    [] remove_memory+0x1cc/0x5f0
    [] memory_block_change_state+0xfd/0x1a0
    [] store_mem_state+0xe2/0xf0
    [] sysdev_store+0x20/0x30
    [] sysfs_write_file+0xe6/0x170
    [] vfs_write+0xc8/0x190
    [] sys_write+0x54/0x90
    [] system_call_fastpath+0x16/0x1b

    -> #0 ((memory_chain).rwsem){.+.+.+}:
    [] __lock_acquire+0x155a/0x1600
    [] lock_acquire+0xaa/0x140
    [] down_read+0x51/0xa0
    [] __blocking_notifier_call_chain+0x69/0xc0
    [] blocking_notifier_call_chain+0x16/0x20
    [] memory_notify+0x1b/0x20
    [] remove_memory+0x56e/0x5f0
    [] memory_block_change_state+0xfd/0x1a0
    [] store_mem_state+0xe2/0xf0
    [] sysdev_store+0x20/0x30
    [] sysfs_write_file+0xe6/0x170
    [] vfs_write+0xc8/0x190
    [] sys_write+0x54/0x90
    [] system_call_fastpath+0x16/0x1b

    But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex
    have an outer lock (mem_hotplug_mutex). So they cannot deadlock.

    Thus, This patch annotate ksm_thread_mutex is not deadlock source.

    [akpm@linux-foundation.org: update comment, from Hugh]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently hwpoison is using lock_system_sleep() to prevent a race with
    memory hotplug. However lock_system_sleep() is a no-op if
    CONFIG_HIBERNATION=n. Therefore we need a new lock.

    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Cc: Kamezawa Hiroyuki
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • On stock 2.6.37-rc4, running:

    # mount lilith:/export /mnt/lilith
    # find /mnt/lilith/ -type f -print0 | xargs -0 file

    crashes the machine fairly quickly under Xen. Often it results in oops
    messages, but the couple of times I tried just now, it just hung quietly
    and made Xen print some rude messages:

    (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
    3000000000000000) for mfn 1d7058 (pfn 18fa7)
    (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
    (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
    1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
    (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04

    Which means the domain tried to map a pagetable page RW, which would
    allow it to map arbitrary memory, so Xen stopped it. This is because
    vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
    finished with them, and those pages got recycled as pagetable pages
    while still having these RW aliases.

    Removing those mappings immediately removes the Xen-visible aliases, and
    so it has no problem with those pages being reused as pagetable pages.
    Deferring the TLB flush doesn't upset Xen because it can flush the TLB
    itself as needed to maintain its invariants.

    When unmapping a region in the vmalloc space, clear the ptes
    immediately. There's no point in deferring this because there's no
    amortization benefit.

    The TLBs are left dirty, and they are flushed lazily to amortize the
    cost of the IPIs.

    This specific motivation for this patch is an oops-causing regression
    since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
    of vm_map_ram() introduced in 56e4ebf877b60 ("NFS: readdir with vmapped
    pages") . XFS also uses vm_map_ram() and could cause similar problems.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Nick Piggin
    Cc: Bryan Schumaker
    Cc: Trond Myklebust
    Cc: Alex Elder
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • The nr_dirty_[background_]threshold fields are misplaced before the
    numa_* fields, and users will read strange values.

    This is the right order. Before patch, nr_dirty_background_threshold
    will read as 0 (the value from numa_miss).

    numa_hit 128501
    numa_miss 0
    numa_foreign 0
    numa_interleave 7388
    numa_local 128501
    numa_other 0
    nr_dirty_threshold 144291
    nr_dirty_background_threshold 72145

    Signed-off-by: Wu Fengguang
    Cc: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
    free_pid() reclaiming pid.

    Signed-off-by: Zeng Zhaoming
    Cc: "Paul E. McKenney"
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zeng Zhaoming
     
  • Have hugetlb_fault() call unlock_page(page) only if it had previously
    called lock_page(page).

    Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
    resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
    unlock_page() having been called by hugetlb_fault() when page ==
    pagecache_page. This patch remedied the problem.

    Signed-off-by: Dean Nelson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     

02 Dec, 2010

1 commit

  • NFS needs to be able to release objects that are stored in the page
    cache once the page itself is no longer visible from the page cache.

    This patch adds a callback to the address space operations that allows
    filesystems to perform page cleanups once the page has been removed
    from the page cache.

    Original patch by: Linus Torvalds
    [trondmy: cover the cases of invalidate_inode_pages2() and
    truncate_inode_pages()]
    Signed-off-by: Trond Myklebust

    Linus Torvalds
     

25 Nov, 2010

6 commits

  • Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range()") introduces a check if a vma is a hugetlbfs one and
    later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
    moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
    left behind and its result is not used anywhere else in the function.

    The side-effect of caching vma for @addr inside walk->mm is neither
    utilized in walk_page_range() nor in called functions.

    Signed-off-by: David Sterba
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Andy Whitcroft
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Matt Mackall
    Acked-by: Mel Gorman
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Sterba
     
  • … under stop_machine_run()

    During memory hotplug, build_allzonelists() may be called under
    stop_machine_run(). In this function, setup_zone_pageset() is called.
    But it's bug because it will do page allocation under stop_machine_run().

    Here is a report from Alok Kataria.

    BUG: sleeping function called from invalid context at kernel/mutex.c:94
    in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
    Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
    Call Trace:
    [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
    [<ffffffff81468245>] mutex_lock+0x24/0x50
    [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
    [<ffffffff81048888>] ? load_balance+0xbe/0x60e
    [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
    [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
    [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
    [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
    [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
    [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
    [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
    [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
    [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
    [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
    [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
    [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
    [<ffffffff81065f29>] kthread+0x7f/0x87
    [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
    [<ffffffff81065eaa>] ? kthread+0x0/0x87
    [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
    Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
    Policy zone: Normal

    This patch tries to fix the issue by moving setup_zone_pageset() out from
    stop_machine_run(). It's obviously not necessary to be called under
    stop_machine_run().

    [akpm@linux-foundation.org: remove unneeded local]
    Reported-by: Alok Kataria <akataria@vmware.com>
    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Petr Vandrovec <petr@vmware.com>
    Cc: Pekka Enberg <penberg@cs.helsinki.fi>
    Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    configuration option and then it is turned on by default. There is a boot
    option (noswapaccount) which can disable this feature.

    This makes it hard for distributors to enable the configuration option as
    this feature leads to a bigger memory consumption and this is a no-go for
    general purpose distribution kernel. On the other hand swap accounting
    may be very usuful for some workloads.

    This patch adds a new configuration option which controls the default
    behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED). If the option is selected
    then the feature is turned on by default.

    It also adds a new boot parameter swapaccount[=1|0] which enhances the
    original noswapaccount parameter semantic by means of enable/disable logic
    (defaults to 1 if no value is provided to be still consistent with
    noswapaccount).

    The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
    enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)

    Signed-off-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
    mlock does it). This means it can cause deadlock if it races with move charge:

    Ex.1)
    move charge | try charge
    --------------------------------------+------------------------------
    mem_cgroup_can_attach() | down_write(&mmap_sem)
    mc.moving_task = current | ..
    mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
    mem_cgroup_count_precharge() | prepare_to_wait()
    down_read(&mmap_sem) | if (mc.moving_task)
    -> cannot aquire the lock | -> true
    | schedule()

    Ex.2)
    move charge | try charge
    --------------------------------------+------------------------------
    mem_cgroup_can_attach() |
    mc.moving_task = current |
    mem_cgroup_precharge_mc() |
    mem_cgroup_count_precharge() |
    down_read(&mmap_sem) |
    .. |
    up_read(&mmap_sem) |
    | down_write(&mmap_sem)
    mem_cgroup_move_task() | ..
    mem_cgroup_move_charge() | __mem_cgroup_try_charge()
    down_read(&mmap_sem) | prepare_to_wait()
    -> cannot aquire the lock | if (mc.moving_task)
    | -> true
    | schedule()

    To avoid this deadlock, we do all the move charge works (both can_attach() and
    attach()) under one mmap_sem section.
    And after this patch, we set/clear mc.moving_task outside mc.lock, because we
    use the lock only to check mc.from/to.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Fix this:

    kernel BUG at mm/memcontrol.c:2155!
    invalid opcode: 0000 [#1]
    last sysfs file:

    Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
    EIP: 0060:[] EFLAGS: 00000246 CPU: 0
    EIP is at mem_cgroup_move_account+0xe2/0xf0
    EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
    ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
    DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
    Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
    Stack:
    00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
    08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
    c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
    Call Trace:
    [] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
    [] ? mem_cgroup_move_charge_pte_range+0x0/0x130
    [] ? walk_page_range+0xee/0x1d0
    [] ? mem_cgroup_move_task+0x66/0x90
    [] ? mem_cgroup_move_charge_pte_range+0x0/0x130
    [] ? mem_cgroup_move_task+0x0/0x90
    [] ? cgroup_attach_task+0x136/0x200
    [] ? cgroup_tasks_write+0x48/0xc0
    [] ? cgroup_file_write+0xde/0x220
    [] ? do_page_fault+0x17d/0x3f0
    [] ? alloc_fd+0x2d/0xd0
    [] ? cgroup_file_write+0x0/0x220
    [] ? vfs_write+0x92/0xc0
    [] ? sys_write+0x41/0x70
    [] ? syscall_call+0x7/0xb
    Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
    EIP: [] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
    ---[ end trace 7daa1582159b6532 ]---

    lock_page_cgroup and unlock_page_cgroup are implemented using
    bit_spinlock. bit_spinlock doesn't touch the bit if we are on non-SMP
    machine, so we can't use the bit to check whether the lock was taken.

    Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
    of PageCgroupLocked to fix it.

    [akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Depending on processor speed, page size, and the amount of memory a
    process is allowed to amass, cleanup of a large VM may freeze the system
    for many seconds. This can result in a watchdog timeout.

    Make sure other tasks receive some service when cleaning up large VMs.

    Signed-off-by: Steven J. Magnani
    Cc: Greg Ungerer
    Reviewed-by: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven J. Magnani
     

15 Nov, 2010

1 commit


14 Nov, 2010

1 commit

  • There are two places, that do not release the slub_lock.

    Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable
    sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal
    of slab caches during boot).

    Acked-by: Christoph Lameter
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Pekka Enberg

    Pavel Emelyanov
     

12 Nov, 2010

4 commits

  • Salman Qazi describes the following radix-tree bug:

    In the following case, we get can get a deadlock:

    0. The radix tree contains two items, one has the index 0.
    1. The reader (in this case find_get_pages) takes the rcu_read_lock.
    2. The reader acquires slot(s) for item(s) including the index 0 item.
    3. The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
    3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
    4. The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
    5. The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

    The fix is to re-use the same "indirect" pointer case that requires a slot
    lookup retry into a general "retry the lookup" bit.

    Signed-off-by: Nick Piggin
    Reported-by: Salman Qazi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • nr_dirty and nr_congested are increased only when the page is dirty. So
    if all pages are clean, both them will be zero. In this case, we should
    not mark the zone congested.

    Signed-off-by: Shaohua Li
    Reviewed-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • 70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
    ran into a NULL dereference in here:

    int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    unsigned long from)
    {
    ----> struct inode *inode = page->mapping->host;

    It looks like page->mapping was the culprit. (xmon trace is below).
    After closer examination, I realized that do_generic_file_read() does a
    find_get_page(), and eventually locks the page before calling
    block_is_partially_uptodate(). However, it doesn't revalidate the
    page->mapping after the page is locked. So, there's a small window
    between the find_get_page() and ->is_partially_uptodate() where the page
    could get truncated and page->mapping cleared.

    We _have_ a reference, so it can't get reclaimed, but it certainly
    can be truncated.

    I think the correct thing is to check page->mapping after the
    trylock_page(), and jump out if it got truncated. This patch has been
    running in the test environment for a month or so now, and we have not
    seen this bug pop up again.

    xmon info:

    1f:mon> e
    cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
    pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
    lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
    sp: c0000002ae36f9f0
    msr: 8000000000009032
    dar: 0
    dsisr: 40000000
    current = 0xc000000378f99e30
    paca = 0xc000000000f66300
    pid = 21946, comm = bash
    1f:mon> r
    R00 = 0025c0500000006d R16 = 0000000000000000
    R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
    R02 = c000000000e8cd80 R18 = ffffffffffffffff
    R03 = c0000000031d0f88 R19 = 0000000000000001
    R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
    R05 = 0000000000000000 R21 = c0000002ae36fa68
    R06 = 0000000000000000 R22 = 0000000000000000
    R07 = 0000000000000001 R23 = c0000002ae36fbb0
    R08 = 0000000000000002 R24 = 0000000000000000
    R09 = 0000000000000000 R25 = c000000362cd3a80
    R10 = 0000000000000000 R26 = 0000000000000002
    R11 = c0000000001e7b60 R27 = 0000000000000000
    R12 = 0000000042000484 R28 = 0000000000000001
    R13 = c000000000f66300 R29 = c0000003bb97b9b8
    R14 = 0000000000000001 R30 = c000000000e28a08
    R15 = 000000000000ffff R31 = c0000000031d0f88
    pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
    lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
    msr = 8000000000009032 cr = 22000488
    ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
    dar = 0000000000000000 dsisr = 40000000
    1f:mon> t
    [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
    [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
    [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
    [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
    [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
    [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
    --- Exception: c00 (System Call) at 00000080a840bc54
    SP (fffca15df30) is in userspace
    1f:mon> di c0000000001e7a6c
    c0000000001e7a6c e9290000 ld r9,0(r9)
    c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
    c0000000001e7a74 e9440008 ld r10,8(r4)
    c0000000001e7a78 78a80020 clrldi r8,r5,32
    c0000000001e7a7c 3c000001 lis r0,1
    c0000000001e7a80 812900a8 lwz r9,168(r9)
    c0000000001e7a84 39600001 li r11,1
    c0000000001e7a88 7c080050 subf r0,r8,r0
    c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
    c0000000001e7a90 7d6b4830 slw r11,r11,r9
    c0000000001e7a94 796b0020 clrldi r11,r11,32
    c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
    c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
    c0000000001e7aa0 7d004214 add r8,r0,r8
    c0000000001e7aa4 79080020 clrldi r8,r8,32
    c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100

    Signed-off-by: Dave Hansen
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The original code had a null dereference if alloc_percpu() failed. This
    was introduced in commit 711d3d2c9bc3 ("memcg: cpu hotplug aware percpu
    count updates")

    Signed-off-by: Dan Carpenter
    Reviewed-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

10 Nov, 2010

1 commit

  • As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to
    mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to
    merging. Fix the problem by moving perf_event_mmap() hook to
    mprotect_fixup().

    Note: there's another successful return path from mprotect_fixup() if old
    flags equal to new flags. We don't, however, need to call
    perf_event_mmap() there because 'perf' already knows the VMA is
    executable.

    Reported-by: Dave Jones
    Analyzed-by: Linus Torvalds
    Cc: Ingo Molnar
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

04 Nov, 2010

1 commit

  • Fix regression introduced by commit 79da826aee6 ("writeback: report
    dirty thresholds in /proc/vmstat").

    The incorrect pointer arithmetic can result in problems like this:

    BUG: unable to handle kernel paging request at 07c06d16
    IP: [] strnlen+0x6/0x20
    Call Trace:
    [] ? string+0x39/0xe0
    [] ? __wake_up_common+0x4b/0x80
    [] ? vsnprintf+0x1ec/0x380
    [] ? seq_printf+0x2e/0x60
    [] ? vmstat_show+0x26/0x30
    [] ? seq_read+0xa6/0x380
    [] ? seq_read+0x0/0x380
    [] ? proc_reg_read+0x5f/0x90
    [] ? vfs_read+0xa1/0x140
    [] ? proc_reg_read+0x0/0x90
    [] ? sys_read+0x41/0x70
    [] ? sysenter_do_call+0x12/0x26

    Reported-by: Tetsuo Handa
    Cc: Michael Rubin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

03 Nov, 2010

1 commit


30 Oct, 2010

1 commit

  • Normal syscall audit doesn't catch 5th argument of syscall. It also
    doesn't catch the contents of userland structures pointed to be
    syscall argument, so for both old and new mmap(2) ABI it doesn't
    record the descriptor we are mapping. For old one it also misses
    flags.

    Signed-off-by: Al Viro

    Al Viro
     

29 Oct, 2010

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • When a node contains only HighMem memory, slab_node(MPOL_BIND)
    dereferences a NULL pointer.

    [ This code seems to go back all the way to commit 19770b32609b: "mm:
    filter based on a nodemask as well as a gfp_mask". Which was back in
    April 2008, and it got merged into 2.6.26. - Linus ]

    Signed-off-by: Eric Dumazet
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

28 Oct, 2010

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: (44 commits)
    MN10300: Save frame pointer in thread_info struct rather than global var
    MN10300: Change "Matsushita" to "Panasonic".
    MN10300: Create a defconfig for the ASB2364 board
    MN10300: Update the ASB2303 defconfig
    MN10300: ASB2364: Add support for SMSC911X and SMC911X
    MN10300: ASB2364: Handle the IRQ multiplexer in the FPGA
    MN10300: Generic time support
    MN10300: Specify an ELF HWCAP flag for MN10300 Atomic Operations Unit support
    MN10300: Map userspace atomic op regs as a vmalloc page
    MN10300: And Panasonic AM34 subarch and implement SMP
    MN10300: Delete idle_timestamp from irq_cpustat_t
    MN10300: Make various interrupt priority settings configurable
    MN10300: Optimise do_csum()
    MN10300: Implement atomic ops using atomic ops unit
    MN10300: Make the FPU operate in non-lazy mode under SMP
    MN10300: SMP TLB flushing
    MN10300: Use the [ID]PTEL2 registers rather than [ID]PTEL for TLB control
    MN10300: Make the use of PIDR to mark TLB entries controllable
    MN10300: Rename __flush_tlb*() to local_flush_tlb*()
    MN10300: AM34 erratum requires MMUCTR read and write on exception entry
    ...

    Linus Torvalds
     
  • Replace iterated page_cache_release() with release_pages(), which is
    faster and shorter.

    Needs release_pages() to be exported to modules.

    Suggested-by: Andrew Morton
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch extracts the core logic from mem_cgroup_update_file_mapped() as
    mem_cgroup_update_file_stat() and adds a wrapper.

    As a planned future update, memory cgroup has to count dirty pages to
    implement dirty_ratio/limit. And more, the number of dirty pages is
    required to kick flusher thread to start writeback. (Now, no kick.)

    This patch is preparation for it and makes other statistics implementation
    clearer. Just a clean up.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Reviewed-by: Greg Thelen
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • An event counter MEM_CGROUP_ON_MOVE is used for quick check whether file
    stat update can be done in async manner or not. Now, it use percpu
    counter and for_each_possible_cpu to update.

    This patch replaces for_each_possible_cpu to for_each_online_cpu and adds
    necessary synchronization logic at CPU HOTPLUG.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki