26 Sep, 2010

1 commit

  • Thomas Pollet noticed that the remap_file_pages() system call in
    fremap.c has a potential overflow in the first part of the if statement
    below, which could cause it to process bogus input parameters.
    Specifically the pgoff + size parameters could be wrap thereby
    preventing the system call from failing when it should.

    Reported-by: Thomas Pollet
    Signed-off-by: Larry Woodman
    Signed-off-by: Linus Torvalds

    Larry Woodman
     

25 Sep, 2010

1 commit

  • Thomas Pollet points out that the 'end' variable is broken. It was
    computed based on start/size before they were page-aligned, and as such
    doesn't actually match any of the other actions we take. The overflow
    test on end was also redundant, since we had already tested it with the
    properly aligned version.

    So just get rid of it entirely. The one remaining use for that broken
    variable can just use 'start+size' like all the other cases already did.

    Reported-by: Thomas Pollet
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Sep, 2010

4 commits


23 Sep, 2010

6 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix pcpu_last_unit_cpu

    Linus Torvalds
     
  • If __split_vma fails because of an out of memory condition the
    anon_vma_chain isn't teardown and freed potentially leading to rmap walks
    accessing freed vma information plus there's a memleak.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to
    limit as much information as possible that it should emit.

    The tasklist dump should be filtered to only those tasks that are eligible
    for oom kill. This is already done for memcg ooms, but this patch extends
    it to both cpuset and mempolicy ooms as well as init.

    In addition to suppressing irrelevant information, this also reduces
    confusion since users currently don't know which tasks in the tasklist
    aren't eligible for kill (such as those attached to cpusets or bound to
    mempolicies with a disjoint set of mems or nodes, respectively) since that
    information is not shown.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • M. Vefa Bicakci reported 2.6.35 kernel hang up when hibernation on his
    32bit 3GB mem machine.
    (https://bugzilla.kernel.org/show_bug.cgi?id=16771). Also he bisected
    the regression to

    commit bb21c7ce18eff8e6e7877ca1d06c6db719376e3c
    Author: KOSAKI Motohiro
    Date: Fri Jun 4 14:15:05 2010 -0700

    vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure

    At first impression, this seemed very strange because the above commit
    only chenged function return value and hibernate_preallocate_memory()
    ignore return value of shrink_all_memory(). But it's related.

    Now, page allocation from hibernation code may enter infinite loop if the
    system has highmem. The reasons are that vmscan don't care enough OOM
    case when oom_killer_disabled.

    The problem sequence is following as.

    1. hibernation
    2. oom_disable
    3. alloc_pages
    4. do_try_to_free_pages
    if (scanning_global_lru(sc) && !all_unreclaimable)
    return 1;

    If kswapd is not freozen, it would set zone->all_unreclaimable to 1 and
    then shrink_zones maybe return true(ie, all_unreclaimable is true). So at
    last, alloc_pages could go to _nopage_. If it is, it should have no
    problem.

    This patch adds all_unreclaimable check to protect in direct reclaim path,
    too. It can care of hibernation OOM case and help bailout
    all_unreclaimable case slightly.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Minchan Kim
    Reported-by: M. Vefa Bicakci
    Reported-by:
    Reviewed-by: Johannes Weiner
    Tested-by:
    Acked-by: Rafael J. Wysocki
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • A task's badness score is roughly a proportion of its rss and swap
    compared to the system's capacity. The scale ranges from 0 to 1000 with
    the highest score chosen for kill. Thus, this scale operates on a
    resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus,
    so the badness score of an admin task using 3% of memory, for example,
    would still be 0.

    It's possible that an exceptionally large number of tasks will combine to
    exhaust all resources but never have a single task that uses more than
    0.1% of RAM and swap (or 3.0% for admin tasks).

    This patch ensures that the badness score of any eligible task is never 0
    so the machine doesn't unnecessarily panic because it cannot find a task
    to kill.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Cc: Nitin Gupta
    Cc: Pekka Enberg
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends
    char: Mark /dev/zero and /dev/kmem as not capable of writeback
    bdi: Initialize noop_backing_dev_info properly
    cfq-iosched: fix a kernel OOPs when usb key is inserted
    block: fix blk_rq_map_kern bio direction flag
    cciss: freeing uninitialized data on error path

    Linus Torvalds
     

22 Sep, 2010

1 commit


21 Sep, 2010

2 commits

  • pcpu_first/last_unit_cpu are used to track which cpu has the first and
    last units assigned. This in turn is used to determine the span of a
    chunk for man/unmap cache flushes and whether an address belongs to
    the first chunk or not in per_cpu_ptr_to_phys().

    When the number of possible CPUs isn't power of two, a chunk may
    contain unassigned units towards the end of a chunk. The logic to
    determine pcpu_last_unit_cpu was incorrect when there was an unused
    unit at the end of a chunk. It failed to ignore the unused unit and
    assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.

    This was discovered through kdump failure which was caused by
    malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
    CPUs by CAI Qian.

    Signed-off-by: Tejun Heo
    Reported-by: CAI Qian
    Cc: stable@kernel.org

    Tejun Heo
     
  • Commit 4969c1192d15 ("mm: fix swapin race condition") is now agreed to
    be incomplete. There's a race, not very much less likely than the
    original race envisaged, in which it is further necessary to check that
    the swapcache page's swap has not changed.

    Here's the reasoning: cast in terms of reuse_swap_page(), but probably
    could be reformulated to rely on try_to_free_swap() instead, or on
    swapoff+swapon.

    A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
    comes through the lock_page(page1).

    B, a racing thread of the same process, faults on the same address: does
    page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
    for whatever reason is unlucky not to get the lock any time soon.

    A carries on through do_swap_page(), a write fault, but cannot reuse the
    swap page1 (another reference to swap1). Unlocks the page1 (but B
    doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.

    C, perhaps the parent of A+B, comes in and write faults the same swap
    page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.

    kswapd comes in after some time (B still unlucky) and swaps out some
    pages from A+B and C: it allocates the original swap1 to page2 in A+B,
    and some other swap2 to the original page1 now in C. But does not
    immediately free page1 (actually it couldn't: B holds a reference),
    leaving it in swap cache for now.

    B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
    Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
    given the swap1 which page1 used to have. So B proceeds to insert page1
    into A+B's page_table, though its content now belongs to C, quite
    different from what A wrote there.

    B ought to have checked that page1's swap was still swap1.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Sep, 2010

15 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: Range check cpu in blk_cpu_to_group
    scatterlist: prevent invalid free when alloc fails
    writeback: Fix lost wake-up shutting down writeback thread
    writeback: do not lose wakeup events when forking bdi threads
    cciss: fix reporting of max queue depth since init
    block: switch s390 tape_block and mg_disk to elevator_change()
    block: add function call to switch the IO scheduler from a driver
    fs/bio-integrity.c: return -ENOMEM on kmalloc failure
    bio-integrity.c: remove dependency on __GFP_NOFAIL
    BLOCK: fix bio.bi_rw handling
    block: put dev->kobj in blk_register_queue fail path
    cciss: handle allocation failure
    cfq-iosched: Documentation help for new tunables
    cfq-iosched: blktrace print per slice sector stats
    cfq-iosched: Implement tunable group_idle
    cfq-iosched: Do group share accounting in IOPS when slice_idle=0
    cfq-iosched: Do not idle if slice_idle=0
    cciss: disable doorbell reset on reset_devices
    blkio: Fix return code for mkdir calls

    Linus Torvalds
     
  • When under significant memory pressure, a process enters direct reclaim
    and immediately afterwards tries to allocate a page. If it fails and no
    further progress is made, it's possible the system will go OOM. However,
    on systems with large amounts of memory, it's possible that a significant
    number of pages are on per-cpu lists and inaccessible to the calling
    process. This leads to a process entering direct reclaim more often than
    it should increasing the pressure on the system and compounding the
    problem.

    This patch notes that if direct reclaim is making progress but allocations
    are still failing that the system is already under heavy pressure. In
    this case, it drains the per-cpu lists and tries the allocation a second
    time before continuing.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Dave Chinner
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • When allocating a page, the system uses NR_FREE_PAGES counters to
    determine if watermarks would remain intact after the allocation was made.
    This check is made without interrupts disabled or the zone lock held and
    so is race-prone by nature. Unfortunately, when pages are being freed in
    batch, the counters are updated before the pages are added on the list.
    During this window, the counters are misleading as the pages do not exist
    yet. When under significant pressure on systems with large numbers of
    CPUs, it's possible for processes to make progress even though they should
    have been stalled. This is particularly problematic if a number of the
    processes are using GFP_ATOMIC as the min watermark can be accidentally
    breached and in extreme cases, the system can livelock.

    This patch updates the counters after the pages have been added to the
    list. This makes the allocator more cautious with respect to preserving
    the watermarks and mitigates livelock possibilities.

    [akpm@linux-foundation.org: avoid modifying incoming args]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • refresh_zone_stat_thresholds() calculates parameter based on the number of
    online cpus. It's called at cpu offlining but needs to be called at
    onlining, too.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
    show a shift since I last tested in December: in part because of firmware
    updates, in part because of the necessary move from barriers to awaiting
    completion at the block layer. While discard at swapon still shows as
    slightly beneficial on both, discarding 1MB swap cluster when allocating
    is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

    Surrender: discard as presently implemented is more hindrance than help
    for swap; but might prove useful on other devices, or with improvements.
    So continue to do the discard at swapon, but make discard while swapping
    conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
    only the lower 16 bits of int flags).

    We can add a --discard or -d to swapon(8), and a "discard" to swap in
    /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Nigel Cunningham
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: James Bottomley
    Cc: "Martin K. Petersen"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code already uses synchronous discards, no need to add I/O
    barriers.

    This fixes the worst of the terrible slowdown in swap allocation for
    hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
    eliminate that regression.

    [tj@kernel.org: superflous newlines removed]
    Signed-off-by: Christoph Hellwig
    Tested-by: Nigel Cunningham
    Signed-off-by: Tejun Heo
    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Cc: James Bottomley
    Cc: "Martin K. Petersen"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Move the hibernation check from scan_swap_map() into try_to_free_swap():
    to catch not only the common case when hibernation's allocation itself
    triggers swap reuse, but also the less likely case when concurrent page
    reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

    Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
    reclaim from going to swap: check that to prevent swap reuse too.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: "Rafael J. Wysocki"
    Cc: Ondrej Zary
    Cc: Andrea Gelmini
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Cc: Nigel Cunningham
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Please revert 2.6.36-rc commit d2997b1042ec150616c1963b5e5e919ffd0b0ebf
    "hibernation: freeze swap at hibernation". It complicated matters by
    adding a second swap allocation path, just for hibernation; without in any
    way fixing the issue that it was intended to address - page reclaim after
    fixing the hibernation image might free swap from a page already imaged as
    swapcache, letting its swap be reallocated to store a different page of
    the image: resulting in data corruption if the imaged page were freed as
    clean then swapped back in. Pages freed to si->swap_map were still in
    danger of being reallocated by the alternative allocation path.

    I guess it inadvertently fixed slow SSD swap allocation for hibernation,
    as reported by Nigel Cunningham: by missing out the discards that occur on
    the usual swap allocation path; but that was unintentional, and needs a
    separate fix.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: "Rafael J. Wysocki"
    Cc: Ondrej Zary
    Cc: Andrea Gelmini
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Cc: Nigel Cunningham
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM
    enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache
    maintenance lazily), and the root cause appears to be that the mm bouncing
    code is calling flush_dcache_page before it copies the bounce buffer into
    the bio.

    The bounced page needs to be flushed after data is copied into it, to
    ensure that architecture implementations can synchronize instruction and
    data caches if necessary.

    Signed-off-by: Gary King
    Cc: Tejun Heo
    Cc: Russell King
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary King
     
  • next_active_pageblock() is for finding next _used_ freeblock. It skips
    several blocks when it finds there are a chunk of free pages lager than
    pageblock. But it has 2 bugs.

    1. We have no lock. page_order(page) - pageblock_order can be minus.
    2. pageblocks_stride += is wrong. it should skip page_order(p) of pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Iram reported that compaction's too_many_isolated() loops forever.
    (http://www.spinics.net/lists/linux-mm/msg08123.html)

    The meminfo when the situation happened was inactive anon is zero. That's
    because the system has no memory pressure until then. While all anon
    pages were in the active lru, compaction could select active lru as well
    as inactive lru. That's a different thing from vmscan's isolated. So we
    has been two too_many_isolated.

    While compaction can isolate pages in both active and inactive, current
    implementation of too_many_isolated only considers inactive. It made
    Iram's problem.

    This patch handles active and inactive fairly. That's because we can't
    expect where from and how many compaction would isolated pages.

    This patch changes (nr_isolated > nr_inactive) with
    nr_isolated > (nr_active + nr_inactive) / 2.

    Signed-off-by: Minchan Kim
    Reported-by: Iram Shahzad
    Acked-by: Mel Gorman
    Acked-by: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or
    memhotplug aren't selected. However MIGRATION doesn't depend on them. I
    guess it's just trying to be strict doing a double check on who's enabling
    it, but it doesn't know that compaction also enables MIGRATION.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The pte_same check is reliable only if the swap entry remains pinned (by
    the page lock on swapcache). We've also to ensure the swapcache isn't
    removed before we take the lock as try_to_free_swap won't care about the
    page pin.

    One of the possible impacts of this patch is that a KSM-shared page can
    point to the anon_vma of another process, which could exit before the page
    is freed.

    This can leave a page with a pointer to a recycled anon_vma object, or
    worse, a pointer to something that is no longer an anon_vma.

    [riel@redhat.com: changelog help]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • So it can be used by all that need to check for that.

    Signed-off-by: Stefan Bader
    Signed-off-by: Linus Torvalds

    Stefan Bader
     

08 Sep, 2010

1 commit


29 Aug, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: fix get_ticket_handler() error handling
    ceph: don't BUG on ENOMEM during mds reconnect
    ceph: ceph_mdsc_build_path() returns an ERR_PTR
    ceph: Fix warnings
    ceph: ceph_get_inode() returns an ERR_PTR
    ceph: initialize fields on new dentry_infos
    ceph: maintain i_head_snapc when any caps are dirty, not just for data
    ceph: fix osd request lru adjustment when sending request
    ceph: don't improperly set dir complete when holding EXCL cap
    mm: exporting account_page_dirty
    ceph: direct requests in snapped namespace based on nonsnap parent
    ceph: queue cap snap writeback for realm children on snap update
    ceph: include dirty xattrs state in snapped caps
    ceph: fix xattr cap writeback
    ceph: fix multiple mds session shutdown

    Linus Torvalds
     
  • After several hours, kbuild tests hang with anon_vma_prepare() spinning on
    a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
    (which makes this very much more likely, but it could happen without).

    The ever-subtle page_lock_anon_vma() now needs a further twist: since
    anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
    of a reused anon_vma structure at any moment, page_lock_anon_vma()
    needs to check page_mapped() again before succeeding, otherwise
    page_unlock_anon_vma() might address a different root->lock.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Aug, 2010

3 commits

  • When pcpu_build_alloc_info() searches best_upa value, it ignores current value
    if the number of waste units exceeds 1/3 of the number of total cpus. But the
    comment on the code says that it will ignore if wastage is over 25%.
    Modify the comment.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Tejun Heo

    Namhyung Kim
     
  • The original code did not free the old map. This patch fixes it.

    tj: use @old as memcpy source instead of @chunk->map, and indentation
    and description update

    Signed-off-by: Huang Shijie
    Signed-off-by: Tejun Heo
    Cc: stable@kernel.org

    Huang Shijie
     
  • This patch fixes the following issue:

    INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000
    ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
    ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
    00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
    Call Trace:
    [] schedule_timeout+0x34/0xf1
    [] ? wait_for_common+0x3f/0x130
    [] ? trace_hardirqs_on+0xd/0xf
    [] wait_for_common+0xd2/0x130
    [] ? default_wake_function+0x0/0xf
    [] ? _raw_spin_unlock+0x26/0x2a
    [] wait_for_completion+0x18/0x1a
    [] sync_inodes_sb+0xca/0x1bc
    [] __sync_filesystem+0x47/0x7e
    [] sync_filesystem+0x47/0x4b
    [] generic_shutdown_super+0x22/0xd2
    [] kill_anon_super+0x11/0x4f
    [] nfs4_kill_super+0x3f/0x72 [nfs]
    [] deactivate_locked_super+0x21/0x41
    [] deactivate_super+0x40/0x45
    [] mntput_no_expire+0xb8/0xed
    [] release_mounts+0x9a/0xb0
    [] put_mnt_ns+0x6a/0x7b
    [] nfs_follow_remote_path+0x19a/0x296 [nfs]
    [] nfs4_try_mount+0x75/0xaf [nfs]
    [] nfs4_get_sb+0x276/0x2ff [nfs]
    [] vfs_kern_mount+0xb8/0x196
    [] do_kern_mount+0x48/0xe8
    [] do_mount+0x771/0x7e8
    [] sys_mount+0x83/0xbd
    [] system_call_fastpath+0x16/0x1b

    The reason of this hang was a race condition: when the flusher thread is
    forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
    visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
    'bdi->wb.task' is still NULL. And this is a dangerous time window.

    If at this time someone queues a work for this bdi, he does not see the bdi
    thread and wakes up the forker thread instead! But the forker has already
    forked this bdi thread, but just did not make it visible yet!

    The result is that we lose the wake up event for this bdi thread and the NFS4
    code waits forever.

    To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
    then make them visible in 'bdi->wb.task', and only after this wake them up.
    This is exactly what this patch does.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     

25 Aug, 2010

2 commits

  • * '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
    xfs: do not discard page cache data on EAGAIN
    xfs: don't do memory allocation under the CIL context lock
    xfs: Reduce log force overhead for delayed logging
    xfs: dummy transactions should not dirty VFS state
    xfs: ensure f_ffree returned by statfs() is non-negative
    xfs: handle negative wbc->nr_to_write during sync writeback
    writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
    xfs: fix untrusted inode number lookup
    xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE
    xfs: unlock items before allowing the CIL to commit

    Linus Torvalds
     
  • pa-risc and ia64 have stacks that grow upwards. Check that
    they do not run into other mappings. By making VM_GROWSUP
    0x0 on architectures that do not ever use it, we can avoid
    some unpleasant #ifdefs in check_stack_guard_page().

    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Luck, Tony
     

24 Aug, 2010

1 commit

  • I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
    been. Enabling writeback tracing showed:

    flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
    flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0

    Writeback is not terminating in background writeback if ->writepage is
    returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
    writeback on XFS.

    Fix the write_cache_pages loop to terminate correctly when this situation
    occurs and so prevent this sub-optimal background writeback pattern. This
    improves sustained sequential buffered write performance from around
    250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.

    Cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Wu Fengguang
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

23 Aug, 2010

1 commit

  • This allows code outside of the mm core to safely manipulate page state
    and not worry about the other accounting. Not using these routines means
    that some code will lose track of the accounting and we get bugs. This
    has happened once already.

    Signed-off-by: Michael Rubin
    Signed-off-by: Sage Weil

    Michael Rubin