22 Dec, 2011

1 commit


21 Dec, 2011

3 commits

  • Static storage is not required for the struct vmap_area in
    __get_vm_area_node.

    Removing "static" to store this variable on the stack instead.

    Signed-off-by: Kautuk Consul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • An integer overflow will happen on 64bit archs if task's sum of rss,
    swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
    commit

    f755a04 oom: use pte pages in OOM score

    where the oom score computation was divided into several steps and it's no
    longer computed as one expression in unsigned long(rss, swapents, nr_pte
    are unsigned long), where the result value assigned to points(int) is in
    range(1..1000). So there could be an int overflow while computing

    176 points *= 1000;

    and points may have negative value. Meaning the oom score for a mem hog task
    will be one.

    196 if (points
    Acked-by: KOSAKI Motohiro
    Acked-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frantisek Hrbata
     
  • If the request is to create non-root group and we fail to meet it, we
    should leave the root unchanged.

    Signed-off-by: Hillf Danton
    Acked-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

16 Dec, 2011

1 commit

  • per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
    case to the page boundary, which is bogus for any non-page-aligned
    address.

    This affects the only in-tree user of this function - sysfs handler
    for per-cpu 'crash_notes' physical address. The trouble is that the
    crash_notes per-cpu variable is not page-aligned:

    crash_notes = 0xc08e8ed4
    PER-CPU OFFSET VALUES:
    CPU 0: 3711f000
    CPU 1: 37129000
    CPU 2: 37133000
    CPU 3: 3713d000

    So, the per-cpu addresses are:
    crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
    crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
    crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
    crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

    However, /sys/devices/system/cpu/cpu*/crash_notes says:
    /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
    /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
    /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
    /sys/devices/system/cpu/cpu3/crash_notes: 36b39000

    As you can see, all values are rounded down to a page
    boundary. Consequently, this is where kexec sets up the NOTE segments,
    and thus where the secondary kernel is looking for them. However, when
    the first kernel crashes, it saves the notes to the unaligned
    addresses, where they are not found.

    Fix it by adding offset_in_page() to the translated page address.

    -tj: Combined Eugene's and Petr's commit messages.

    Signed-off-by: Eugene Surovegin
    Signed-off-by: Tejun Heo
    Reported-by: Petr Tesarik
    Cc: stable@kernel.org

    Eugene Surovegin
     

14 Dec, 2011

1 commit


09 Dec, 2011

7 commits

  • Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
    /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
    it is fully initialised. Unfortunately, it did not check that
    __vmalloc_area_node() successfully populated the area. In the event of
    allocation failure, the vmalloc area is freed but the pointer to freed
    memory is inserted into the vmlist leading to a a crash later in
    get_vmalloc_info().

    This patch adds a check for ____vmalloc_area_node() failure within
    __vmalloc_node_range. It does not use "goto fail" as in the previous
    error path as a warning was already displayed by __vmalloc_area_node()
    before it called vfree in its failure path.

    Credit goes to Luciano Chavez for doing all the real work of identifying
    exactly where the problem was.

    Signed-off-by: Mel Gorman
    Reported-by: Luciano Chavez
    Tested-by: Luciano Chavez
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: [3.1.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • setup_zone_migrate_reserve() expects that zone->start_pfn starts at
    pageblock_nr_pages aligned pfn otherwise we could access beyond an
    existing memblock resulting in the following panic if
    CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

    IP: [] setup_zone_migrate_reserve+0xcd/0x180
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53
    Oops: 0000 [#1] SMP
    Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    EIP: 0060:[] EFLAGS: 00010006 CPU: 0
    EIP is at setup_zone_migrate_reserve+0xcd/0x180
    EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
    ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
    Call Trace:
    [] __setup_per_zone_wmarks+0xec/0x160
    [] setup_per_zone_wmarks+0xf/0x20
    [] init_per_zone_wmark_min+0x27/0x86
    [] do_one_initcall+0x2b/0x160
    [] kernel_init+0xbe/0x157
    [] kernel_thread_helper+0x6/0xd
    Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
    EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
    CR2: 00000000000012b4

    We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
    highstart_pfn = 0x36ffe.

    The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
    in a pageblock is reserved before marking it MIGRATE_RESERVE").

    Make sure that start_pfn is always aligned to pageblock_nr_pages to
    ensure that pfn_valid s always called at the start of each pageblock.
    Architectures with holes in pageblocks will be correctly handled by
    pfn_valid_within in pageblock_is_reserved.

    Signed-off-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Tested-by: Dang Bo
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Arve Hjnnevg
    Cc: KOSAKI Motohiro
    Cc: John Stultz
    Cc: Dave Hansen
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Avoid unlocking and unlocked page if we failed to lock it.

    Signed-off-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times. But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized. So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

    kernel BUG at include/linux/mm.h:386!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
    RIP gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Youquan Song
     
  • khugepaged can sometimes cause suspend to fail, requiring that the user
    retry the suspend operation.

    Use wait_event_freezable_timeout() instead of
    schedule_timeout_interruptible() to avoid missing freezer wakeups. A
    try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
    tight loop too in case of the allocation failing repeatedly, and
    wait_event_freezable_timeout will provide it too.

    khugepaged would still freeze just fine by trying again the next minute
    but it's better if it freezes immediately.

    Reported-by: Jiri Slaby
    Signed-off-by: Andrea Arcangeli
    Tested-by: Jiri Slaby
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: "Srivatsa S. Bhat"
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Use atomic-long operations instead of looping around cmpxchg().

    [akpm@linux-foundation.org: massage atomic.h inclusions]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A shrinker function can return -1, means that it cannot do anything
    without a risk of deadlock. For example prune_super() does this if it
    cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
    interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
    according to this. So the next time around this shrinker can cause
    really big pressure. Let's skip such shrinkers instead.

    Also make total_scan signed, otherwise the check (total_scan < 0) below
    never works.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Dec, 2011

3 commits

  • Some trace shows lots of bdi_dirty=0 lines where it's actually some
    small value if w/o the accounting errors in the per-cpu bdi stats.

    In this case the max pause time should really be set to the smallest
    (non-zero) value to avoid IO queue underrun and improve throughput.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • On a system with 1 local mount and 1 NFS mount, if the NFS server
    becomes not responding when dd to the NFS mount, the NFS dirty pages may
    exceed the global dirty limit and _every_ task involving writing will be
    blocked. The whole system appears unresponsive.

    The workaround is to permit through the bdi's that only has a small
    number of dirty pages. The number chosen (bdi_stat_error pages) is not
    enough to enable the local disk to run in optimal throughput, however is
    enough to make the system responsive on a broken NFS mount. The user can
    then kill the dirtiers on the NFS mount and increase the global dirty
    limit to bring up the local disk's throughput.

    It risks allowing dirty pages to grow much larger than the global dirty
    limit when there are 1000+ mounts, however that's very unlikely to happen,
    especially in low memory profiles.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • We do "floating proportions" to let active devices to grow its target
    share of dirty pages and stalled/inactive devices to decrease its target
    share over time.

    It works well except in the case of "an inactive disk suddenly goes
    busy", where the initial target share may be too small. To mitigate
    this, bdi_position_ratio() has the below line to raise a small
    bdi_thresh when it's safe to do so, so that the disk be feed with enough
    dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

    bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

    balance_dirty_pages() normally does negative feedback control which
    adjusts ratelimit to balance the bdi dirty pages around the target.
    In some extreme cases when that is not enough, it will have to block
    the tasks completely until the bdi dirty pages drop below bdi_thresh.

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

05 Dec, 2011

1 commit

  • Commit 30765b92 ("slab, lockdep: Annotate the locks before using
    them") moves the init_lock_keys() call from after g_cpucache_up =
    FULL, to before it. And overlooks the fact that init_node_lock_keys()
    tests for it and ignores everything !FULL.

    Introduce a LATE stage and change the lockdep test to be
    Cc: Pekka Enberg
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Dec, 2011

1 commit

  • Currently write(2) to a file is not interruptible by any signal.
    Sometimes this is desirable, e.g. when you want to quickly kill a
    process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
    task in balance_dirty_pages() killable"), it's necessary to abort the
    current write accordingly to avoid it quickly dirtying lots more pages
    at unthrottled rate.

    This patch makes write interruptible by SIGKILL. We do not allow write
    to be interruptible by any other signal because that has larger
    potential of screwing some badly written applications.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Acked-by: Matthew Wilcox
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

30 Nov, 2011

1 commit


29 Nov, 2011

1 commit


24 Nov, 2011

3 commits

  • show_slab_objects() can trigger NULL dereferences or memory corruption.

    Another cpu can change its c->page to NULL or c->node to NUMA_NO_NODE
    while we use them.

    Use ACCESS_ONCE(c->page) and ACCESS_ONCE(c->node) to make sure this
    cannot happen.

    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Eric Dumazet
    Signed-off-by: Pekka Enberg

    Eric Dumazet
     
  • The cmpxchg must be irq safe. The fallback for this_cpu_cmpxchg only
    disables preemption which results in per cpu partial page operation
    potentially failing on non x86 platforms.

    This patch fixes the following problem reported by Christian Kujau:

    I seem to hit it with heavy disk & cpu IO is in progress on this
    PowerBook
    G4. Full dmesg & .config: http://nerdbynature.de/bits/3.2.0-rc1/oops/

    I've enabled some debug options and now it really points to slub.c:2166

    http://nerdbynature.de/bits/3.2.0-rc1/oops/oops4m.jpg

    With debug options enabled I'm currently in the xmon debugger, not sure
    what to make of it yet, I'll try to get something useful out of it :)

    Reported-by: Christian Kujau
    Tested-by: Christian Kujau
    Acked-by: Eric Dumazet
    Acked-by: David Rientjes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Add comments about current per_cpu_ptr_to_phys implementation to
    explain why the logic is more complicated than necessary.

    -tj: relocated comment into kerneldoc comment

    Signed-off-by: Dave Young
    Signed-off-by: Tejun Heo

    Dave Young
     

23 Nov, 2011

3 commits

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: remove vm_dirties and task->dirties
    writeback: hard throttle 1000+ dd on a slow USB stick
    mm: Make task in balance_dirty_pages() killable

    Linus Torvalds
     
  • Percpu allocator recorded the cpus which map to the first and last
    units in pcpu_first/last_unit_cpu respectively and used them to
    determine the address range of a chunk - e.g. it assumed that the
    first unit has the lowest address in a chunk while the last unit has
    the highest address.

    This simply isn't true. Groups in a chunk can have arbitrary positive
    or negative offsets from the previous one and there is no guarantee
    that the first unit occupies the lowest offset while the last one the
    highest.

    Fix it by actually comparing unit offsets to determine cpus occupying
    the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
    to pcpu_low/high_unit_cpu to avoid confusion.

    The chunk address range is used to flush cache on vmalloc area
    map/unmap and decide whether a given address is in the first chunk by
    per_cpu_ptr_to_phys() and the bug was discovered by invalid
    per_cpu_ptr_to_phys() translation for crash_note.

    Kudos to Dave Young for tracking down the problem.

    Signed-off-by: Tejun Heo
    Reported-by: WANG Cong
    Reported-by: Dave Young
    Tested-by: Dave Young
    LKML-Reference:
    Cc: stable @kernel.org

    Tejun Heo
     
  • Currently pcpu_mem_alloc() is implemented always return zeroed memory.
    So rename it to make user like pcpu_get_pages_and_bitmap() know don't
    reinit it.

    Signed-off-by: Bob Liu
    Reviewed-by: Pekka Enberg
    Reviewed-by: Michal Hocko
    Signed-off-by: Tejun Heo

    Bob Liu
     

18 Nov, 2011

2 commits

  • …kernel/git/konrad/xen

    * 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen-gntalloc: signedness bug in add_grefs()
    xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()
    xen-gntdev: integer overflow in gntdev_alloc_map()
    xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.
    xen/balloon: Avoid OOM when requesting highmem
    xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI
    xen: map foreign pages for shared rings by updating the PTEs directly

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-block:
    block: add missed trace_block_plug
    paride: fix potential information leak in pg_read()
    bio: change some signed vars to unsigned
    block: avoid unnecessary plug list flush
    cciss: auto engage SCSI mid layer at driver load time
    loop: cleanup set_status interface
    include/linux/bio.h: use a static inline function for bio_integrity_clone()
    loop: prevent information leak after failed read
    block: Always check length of all iov entries in blk_rq_map_user_iov()
    The Windows driver .inf disables ASPM on all cciss devices. Do the same.
    backing-dev: ensure wakeup_timer is deleted
    block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"

    Linus Torvalds
     

17 Nov, 2011

3 commits

  • They are not used any more.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
    on every 1 4KB-page, which means it cannot throttle a task under
    4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
    10MB/s USB stick, its bdi dirty pages could grow out of control.

    Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
    means a limit of 4KB/s.

    They can eventually be safeguarded by the global limit check
    (nr_dirty < dirty_thresh). However if someone is also writing to an
    HDD at the same time, it'll get poor HDD write performance.

    We at least want to maintain good write performance for other devices
    when one device is attacked by some "massive parallel" workload, or
    suffers from slow write bandwidth, or somehow get stalled due to some
    error condition (eg. NFS server not responding).

    For a stalled device, we need to completely block its dirtiers, too,
    before its bdi dirty pages grow all the way up to the global limit and
    leave no space for the other functional devices.

    So change the loop exit condition to

    /*
    * Always enforce global dirty limit; also enforce bdi dirty limit
    * if the normal max_pause sleeps cannot keep things under control.
    */
    if (nr_dirty < dirty_thresh &&
    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
    break;

    which can be further simplified to

    if (task_ratelimit)
    break;

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • When mapping a foreign page with xenbus_map_ring_valloc() with the
    GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
    pass a pointer to the PTE (in init_mm).

    After the page is mapped, the usual fault mechanism can be used to
    update additional MMs. This allows the vmalloc_sync_all() to be
    removed from alloc_vm_area().

    Signed-off-by: David Vrabel
    Acked-by: Andrew Morton
    [v1: Squashed fix by Michal for no-mmu case]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Michal Simek

    David Vrabel
     

16 Nov, 2011

5 commits

  • There is no reason why task in balance_dirty_pages() shouldn't be killable
    and it helps in recovering from some error conditions (like when filesystem
    goes in error state and cannot accept writeback anymore but we still want to
    kill processes using it to be able to unmount it).

    There will be follow up patches to further abort the generic_perform_write()
    and other filesystem write loops, to avoid large write + SIGKILL combination
    exceeding the dirty limit and possibly strange OOM.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Reviewed-by: Neil Brown
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     
  • If we fail to prepare an anon_vma, the {new, old}_page should be released,
    or they will leak.

    Signed-off-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Commit c9f01245 ("oom: remove oom_disable_count") has removed the
    oom_disable_count counter which has been used for early break out from
    oom_badness so we could never select a task with oom_score_adj set to
    OOM_SCORE_ADJ_MIN (oom disabled).

    Now that the counter is gone we are always going through heuristics
    calculation and we always return a non zero positive value. This means
    that we can end up killing a task with OOM disabled because it is
    indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks
    with 3% usage of memory or tasks with oom_score_adj set but OOM enabled.

    Let's break out early if the task should have OOM disabled.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep reports there is potential deadlock for slub node list_lock.
    discard_slab() is called with the lock hold in unfreeze_partials(),
    which could trigger a slab allocation, which could hold the lock again.

    discard_slab() doesn't need hold the lock actually, if the slab is
    already removed from partial list.

    Acked-by: Christoph Lameter
    Reported-and-tested-by: Yong Zhang
    Reported-and-tested-by: Julie Sullivan
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • unfreeze_partials() needs add the page to partial list tail, since such page
    hasn't too many free objects. We now explictly use DEACTIVATE_TO_TAIL for this,
    while DEACTIVATE_TO_TAIL != 1. This will cause performance regression (eg, more
    lock contention in node->list_lock) without below fix.

    Signed-off-by: Shaohua Li
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Pekka Enberg

    Shaohua Li
     

11 Nov, 2011

1 commit

  • bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
    from all super_blocks and then del_timer_sync() the writeback timer.

    However, this can race with __mark_inode_dirty(), leading to
    bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
    we're unregistering, after we've called del_timer_sync().

    This can end up with the bdi being freed with an active timer inside it,
    as in the case of the following dump after the removal of an SD card.

    Fix this by redoing the del_timer_sync() in bdi_destory().

    ------------[ cut here ]------------
    WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
    ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
    Modules linked in:
    Backtrace:
    [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
    [] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
    [] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
    r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
    r3:00000009
    [] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
    r3:c02b1b29 r2:c02bc662
    [] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
    r6:c7964000 r5:00000001 r4:c7964000
    [] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
    [] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
    [] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
    r5:c79641f0 r4:c796420c
    [] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
    r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
    [] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
    r5:c79643b4 r4:c79641f0
    [] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
    r4:c7964000
    [] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
    r5:00000000 r4:c794c400
    [] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
    r5:c794c400 r4:c0322824
    [] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
    r5:c78d5e00 r4:c74083c0
    [] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
    r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
    r3:00000000
    [] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
    r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
    [] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
    r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
    [] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
    r5:c794dc00 r4:c794dc00
    [] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
    r5:c794dc00 r4:c7942300
    [] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
    r6:c7942300 r5:00000000 r4:00000000 r3:00000000
    [] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
    ---[ end trace e5c83c92ada51c76 ]---

    Cc: stable@kernel.org
    Signed-off-by: Rabin Vincent
    Signed-off-by: Linus Walleij
    Signed-off-by: Jens Axboe

    Rabin Vincent
     

07 Nov, 2011

3 commits

  • In balance_dirty_pages() task_ratelimit may be not initialized
    (initialization skiped by goto pause), and then used when calling
    tracing hook.

    Fix it by moving the task_ratelimit assignment before goto pause.

    Reported-by: Witold Baryluk
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds