28 Jan, 2011

1 commit

  • In current rtmutex, the pending owner may be boosted by the tasks
    in the rtmutex's waitlist when the pending owner is deboosted
    or a task in the waitlist is boosted. This boosting is unrelated,
    because the pending owner does not really take the rtmutex.
    It is not reasonable.

    Example.

    time1:
    A(high prio) onwers the rtmutex.
    B(mid prio) and C (low prio) in the waitlist.

    time2
    A release the lock, B becomes the pending owner
    A(or other high prio task) continues to run. B's prio is lower
    than A, so B is just queued at the runqueue.

    time3
    A or other high prio task sleeps, but we have passed some time
    The B and C's prio are changed in the period (time2 ~ time3)
    due to boosting or deboosting. Now C has the priority higher
    than B. ***Is it reasonable that C has to boost B and help B to
    get the rtmutex?

    NO!! I think, it is unrelated/unneed boosting before B really
    owns the rtmutex. We should give C a chance to beat B and
    win the rtmutex.

    This is the motivation of this patch. This patch *ensures*
    only the top waiter or higher priority task can take the lock.

    How?
    1) we don't dequeue the top waiter when unlock, if the top waiter
    is changed, the old top waiter will fail and go to sleep again.
    2) when requiring lock, it will get the lock when the lock is not taken and:
    there is no waiter OR higher priority than waiters OR it is top waiter.
    3) In any time, the top waiter is changed, the top waiter will be woken up.

    The algorithm is much simpler than before, no pending owner, no
    boosting for pending owner.

    Other advantage of this patch:
    1) The states of a rtmutex are reduced a half, easier to read the code.
    2) the codes become shorter.
    3) top waiter is not dequeued until it really take the lock:
    they will retain FIFO when it is stolen.

    Not advantage nor disadvantage
    1) Even we may wakeup multiple waiters(any time when top waiter changed),
    we hardly cause "thundering herd",
    the number of wokenup task is likely 1 or very little.
    2) two APIs are changed.
    rt_mutex_owner() will not return pending owner, it will return NULL when
    the top waiter is going to take the lock.
    rt_mutex_next_owner() always return the top waiter.
    will not return NULL if we have waiters
    because the top waiter is not dequeued.

    I have fixed the code that use these APIs.

    need updated after this patch is accepted
    1) Document/*
    2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst

    Signed-off-by: Lai Jiangshan
    LKML-Reference:
    Reviewed-by: Steven Rostedt
    Signed-off-by: Steven Rostedt

    Lai Jiangshan
     

26 Jan, 2011

38 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: wacom - pass touch resolution to clients through input_absinfo
    Input: wacom - add 2 Bamboo Pen and touch models
    Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent
    Input: sparse-keymap - fix KEY_VSW handling in sparse_keymap_setup
    Input: tegra-kbc - add tegra keyboard driver
    Input: gpio_keys - switch to using request_any_context_irq
    Input: serio - allow registered drivers to get status flag
    Input: ct82710c - return proper error code for ct82c710_open
    Input: bu21013_ts - added regulator support
    Input: bu21013_ts - remove duplicate resolution parameters
    Input: tnetv107x-ts - don't treat NULL clk as an error
    Input: tnetv107x-keypad - don't treat NULL clk as an error

    Fix up trivial conflicts in drivers/input/keyboard/Makefile due to
    additions of tc3589x/Tegra drivers

    Linus Torvalds
     
  • Also remove fake ABS_RX/ABS_RY "axes" that were used to report physical
    dimensions now that we have better way.

    Signed-off-by: Ping Cheng
    Reviewed-by: Henrik Rydberg
    Signed-off-by: Dmitry Torokhov

    Ping Cheng
     
  • The -rt patches change the console_semaphore to console_mutex. As a
    result, a quite large chunk of the patches changes all
    acquire/release_console_sem() to acquire/release_console_mutex()

    This commit makes things use more neutral function names which dont make
    implications about the underlying lock.

    The only real change is the return value of console_trylock which is
    inverted from try_acquire_console_sem()

    This patch also paves the way to switching console_sem from a semaphore to
    a mutex.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert]
    Signed-off-by: Torben Hohn
    Cc: Thomas Gleixner
    Cc: Greg KH
    Cc: Ingo Molnar
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Torben Hohn
     
  • Fix potential use of uninitialised variable caused by recent
    decompressor code optimisations.

    In zlib_uncompress (zlib_wrapper.c) we have

    int zlib_err, zlib_init = 0;
    ...
    do {
    ...
    if (avail == 0) {
    offset = 0;
    put_bh(bh[k++]);
    continue;
    }
    ...
    zlib_err = zlib_inflate(stream, Z_SYNC_FLUSH);
    ...
    } while (zlib_err == Z_OK);

    If continue is executed (avail == 0) then the while condition will be
    evaluated testing zlib_err, which is uninitialised first time around the
    loop.

    Fix this by getting rid of the 'if (avail == 0)' condition test, this
    edge condition should not be being handled in the decompressor code, and
    instead handle it generically in the caller code.

    Similarly for xz_wrapper.c.

    Incidentally, on most architectures (bar Mips and Parisc), no
    uninitialised variable warning is generated by gcc, this is because the
    while condition test on continue is optimised out and not performed
    (when executing continue zlib_err has not been changed since entering
    the loop, and logically if the while condition was true previously, then
    it's still true).

    Signed-off-by: Phillip Lougher
    Reported-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phillip Lougher
     
  • Executed command: fsstress -d /mnt -n 600 -p 850

    crash> bt
    PID: 7947 TASK: ffff880160546a70 CPU: 0 COMMAND: "fsstress"
    #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9
    #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952
    #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8
    #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969
    #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b
    #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514
    #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60
    [exception RIP: __lookup_tag+100]
    RIP: ffffffff812274b4 RSP: ffff88016056b998 RFLAGS: 00000287
    RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000006
    RDX: 000000000000001d RSI: ffff88016056bb18 RDI: ffff8800c85366e0
    RBP: ffff88016056b9c8 R8: ffff88016056b9e8 R9: 0000000000000000
    R10: 000000000000000e R11: ffff8800c8536908 R12: 0000000000000010
    R13: 0000000000000040 R14: ffffffffffffffc0 R15: ffff8800c85366e0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

    #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4
    #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605
    #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110
    #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85
    #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47
    #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014
    #13 [ffff88016056bbe0] do_writepages at ffffffff81105055
    #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb
    #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a
    #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc
    #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5
    #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085
    #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea
    #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8
    #21 [ffff88016056bf30] sys_write at ffffffff81150691
    #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2

    I think this root cause is the following:

    radix_tree_range_tag_if_tagged() always tags the root tag with settag
    if the root tag is set with iftag even if there are no iftag tags
    in the specified range (Of course, there are some iftag tags
    outside the specified range).

    ===============================================================================
    [[[Detailed description]]]

    (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever?

    __lookup_tag():
    - Return with 0.
    - Return with the index which is not bigger than the old one as the
    input parameter.

    Therefore the following "while" repeats forever because the above
    conditions cause "ret" not to be updated and the cur_index cannot be
    changed into the bigger one.

    (So, radix_tree_gang_lookup_tag_slot() cannot return forever.)

    radix_tree_gang_lookup_tag_slot():
    1178 while (ret < max_items) {
    1179 unsigned int slots_found;
    1180 unsigned long next_index; /* Index of next search */
    1181
    1182 if (cur_index > max_index)
    1183 break;
    1184 slots_found = __lookup_tag(node, results + ret,
    1185 cur_index, max_items - ret, &next_index,
    tag);
    1186 ret += slots_found;
    // cannot update ret because slots_found == 0.
    // so, this while loops forever.
    1187 if (next_index == 0)
    1188 break;
    1189 cur_index = next_index;
    1190 }

    (2) Why does __lookup_tag() return with 0 and doesn't update the index?

    Assuming the following:
    - the one of the slot in radix_tree_node is NULL.
    - the one of the tag which corresponds to the slot sets with
    PAGECACHE_TAG_TOWRITE or other.
    - In a certain height(!=0), the corresponding index is 0.

    a) __lookup_tag() notices that the tag is set.

    1005 static unsigned int
    1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
    1007 unsigned int max_items, unsigned long *next_index, unsigned int tag)
    1008 {
    1009 unsigned int nr_found = 0;
    1010 unsigned int shift, height;
    1011
    1012 height = slot->height;
    1013 if (height == 0)
    1014 goto out;
    1015 shift = (height-1) * RADIX_TREE_MAP_SHIFT;
    1016
    1017 while (height > 0) {
    1018 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
    1019
    1020 for (;;) {
    1021 if (tag_get(slot, tag, i))
    1022 break;
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    * the index is not updated yet.

    b) __lookup_tag() notices that the slot is NULL.

    1023 index &= ~((1UL << shift) - 1);
    1024 index += 1UL << shift;
    1025 if (index == 0)
    1026 goto out; /* 32-bit wraparound */
    1027 i++;
    1028 if (i == RADIX_TREE_MAP_SIZE)
    1029 goto out;
    1030 }
    1031 height--;
    1032 if (height == 0) { /* Bottom level: grab some items */
    ...
    1055 }
    1056 shift -= RADIX_TREE_MAP_SHIFT;
    1057 slot = rcu_dereference_raw(slot->slots[i]);
    1058 if (slot == NULL)
    1059 break;
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    c) __lookup_tag() doesn't update the index and return with 0.

    1060 }
    1061 out:
    1062 *next_index = index;
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    1063 return nr_found;
    1064 }

    (3) Why is the slot NULL even if the tag is set?

    Because radix_tree_range_tag_if_tagged() always sets the root tag with
    PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY,
    even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE
    in the specified range (from *first_indexp to last_index). Of course,
    some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range.
    (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback())

    640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
    *root,
    641 unsigned long *first_indexp, unsigned long last_index,
    642 unsigned long nr_to_tag,
    643 unsigned int iftag, unsigned int settag)
    644 {
    645 unsigned int height = root->height;
    646 struct radix_tree_path path[height];
    647 struct radix_tree_path *pathp = path;
    648 struct radix_tree_node *slot;
    649 unsigned int shift;
    650 unsigned long tagged = 0;
    651 unsigned long index = *first_indexp;
    652
    653 last_index = min(last_index, radix_tree_maxindex(height));
    654 if (index > last_index)
    655 return 0;
    656 if (!nr_to_tag)
    657 return 0;
    658 if (!root_tag_get(root, iftag)) {
    659 *first_indexp = last_index + 1;
    660 return 0;
    661 }
    662 if (height == 0) {
    663 *first_indexp = last_index + 1;
    664 root_tag_set(root, settag);
    665 return 1;
    666 }
    ...
    733 root_tag_set(root, settag);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    734 *first_indexp = index;
    735
    736 return tagged;
    737 }

    As the result, there is no radix_tree_node which is set with
    PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with
    PAGECACHE_TAG_TOWRITE.

    [figure: inside radix_tree]
    (Please see the figure with typewriter font)
    ===========================================
    [roottag = DIRTY]
    | tag=0:NOTHING
    tag[0 0 0 1] 1:DIRTY
    [x x x +] 2:WRITEBACK
    | 3:DIRTY,WRITEBACK
    p 4:TOWRITE
    5:DIRTY,TOWRITE ...
    specified range (index: 0 to 2)

    * There is no DIRTY tag within the specified range.
    (But there is a DIRTY tag outside that range.)

    | | | | | | | | |
    after calling tag_pages_for_writeback()
    | | | | | | | | |
    v v v v v v v v v

    [roottag = DIRTY,TOWRITE]
    | p is "page".
    tag[0 0 0 1] x is NULL.
    [x x x +] +- is a pointer to "page".
    |
    p

    * But TOWRITE tag is set on the root tag.
    ============================================

    After that, radix_tree_extend() via radix_tree_insert() is called
    when the page is added.
    This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE
    to succeed the status of the root tag.

    246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long
    index)
    247 {
    248 struct radix_tree_node *node;
    249 unsigned int height;
    250 int tag;
    251
    252 /* Figure out what the height should be. */
    253 height = root->height + 1;
    254 while (index > radix_tree_maxindex(height))
    255 height++;
    256
    257 if (root->rnode == NULL) {
    258 root->height = height;
    259 goto out;
    260 }
    261
    262 do {
    263 unsigned int newheight;
    264 if (!(node = radix_tree_node_alloc(root)))
    265 return -ENOMEM;
    266
    267 /* Increase the height. */
    268 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
    269
    270 /* Propagate the aggregated tag info into the new root */
    271 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
    272 if (root_tag_get(root, tag))
    273 tag_set(node, tag, 0);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    274 }

    ===========================================
    [roottag = DIRTY,TOWRITE]
    | :
    tag[0 0 0 1] [0 0 0 0]
    [x x x +] [+ x x x]
    | |
    p p (new page)

    | | | | | | | | |
    after calling radix_tree_insert
    | | | | | | | | |
    v v v v v v v v v

    [roottag = DIRTY,TOWRITE]
    |
    tag [5 0 0 0] * DIRTY and TOWRITE tags are
    [+ + x x] succeeded to the new node.
    | |
    tag [0 0 0 1] [0 0 0 0]
    [x x x +] [+ x x x]
    | |
    p p
    ============================================

    After that, the index 3 page is released by remove_from_page_cache().
    Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE
    and that the slot which corresponds to the tag is NULL.
    ===========================================
    [roottag = DIRTY,TOWRITE]
    |
    tag [5 0 0 0]
    [+ + x x]
    | |
    tag [0 0 0 1] [0 0 0 0]
    [x x x +] [+ x x x]
    | |
    p p
    (remove)

    | | | | | | | | |
    after calling remove_page_cache
    | | | | | | | | |
    v v v v v v v v v

    [roottag = DIRTY,TOWRITE]
    |
    tag [4 0 0 0] * Only DIRTY tag is cleared
    [x + x x] because no TOWRITE tag is existed
    | in the bottom node.
    [0 0 0 0]
    [+ x x x]
    |
    p
    ============================================

    To solve this problem

    Change to that radix_tree_tag_if_tagged() doesn't tag the root tag
    if it doesn't set any tags within the specified range.

    Like this.
    ============================================
    640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
    *root,
    641 unsigned long *first_indexp, unsigned long last_index,
    642 unsigned long nr_to_tag,
    643 unsigned int iftag, unsigned int settag)
    644 {
    650 unsigned long tagged = 0;
    ...
    733 if (tagged)
    ^^^^^^^^^^^^^^^^^^^^^^^^
    734 root_tag_set(root, settag);
    735 *first_indexp = index;
    736
    737 return tagged;
    738 }

    ============================================

    Signed-off-by: Toshiyuki Okajima
    Acked-by: Jan Kara
    Cc: Dave Chinner
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshiyuki Okajima
     
  • setup_irq() was called before clockevents_register_device() which is
    needed by the irq handler. Bug was reproducible by restarting the
    kernel using kexec (reliable crash).

    Signed-off-by: Nikolaus Voss
    Cc: David Brownell
    Cc: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Voss, Nikolaus
     
  • A fix up mem_cgroup_move_parent() which use compound_order() in
    asynchronous manner. This compound_order() may return unknown value
    because we don't take lock. Use PageTransHuge() and HPAGE_SIZE instead
    of it.

    Also clean up for mem_cgroup_move_parent().
    - remove unnecessary initialization of local variable.
    - rename charge_size -> page_size
    - remove unnecessary (wrong) comment.
    - added a comment about THP.

    Note:
    Current design take compound_page_lock() in caller of move_account().
    This should be revisited when we implement direct move_task of hugepage
    without splitting.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_disabled() should be checked at splitting. If disabled, no
    heavy work is necesary.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Johannes Weiner
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a
    cancel of charge at case: memory charge-> success. mem+swap charge->
    failure.

    This leaks usage of memory. Fix it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Callers of migrate_pages should putback_lru_pages to return pages
    isolated to LRU or free list. Now comment is rather confusing. It says
    caller always have to call it.

    It is more clear to point out that the caller has to call it if
    migrate_pages's return value isn't zero.

    Signed-off-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit 5d6892407 ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
    enabled") causes this warning during the configuration process:

    warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
    direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)

    COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
    either, it is also useful for regular alloc_pages(order > 0) including
    the very kernel stack during fork (THREAD_ORDER = 1). It's always
    better to enable COMPACTION.

    The warning should be an error because we would end up with MIGRATION
    not selected, and COMPACTION wouldn't work without migration (despite it
    seems to build with an inline migrate_pages returning -ENOSYS).

    I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
    for some releases (for full safety the default remains disabled which I
    think is enough).

    Signed-off-by: Andrea Arcangeli
    Reported-by: Luca Tettamanti
    Tested-by: Luca Tettamanti
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps
    to the 'put_back' label

    ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
    if (ret || !parent)
    goto put_back;

    where we'll

    if (charge > PAGE_SIZE)
    compound_unlock_irqrestore(page, flags);

    but, we have not assigned anything to 'flags' at this point, nor have we
    called 'compound_lock_irqsave()' (which is what sets 'flags'). The
    'put_back' label should be moved below the call to
    compound_unlock_irqrestore() as per this patch.

    Signed-off-by: Jesper Juhl
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Pavel Emelianov
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Commit 0e093d99763e ("writeback: do not sleep on the congestion queue if
    there are no congested BDIs or if significant congestion is not being
    encountered in the current zone") uncovered a livelock in the page
    allocator that resulted in tasks infinitely looping trying to find
    memory and kswapd running at 100% cpu.

    The issue occurs because drain_all_pages() is called immediately
    following direct reclaim when no memory is freed and try_to_free_pages()
    returns non-zero because all zones in the zonelist do not have their
    all_unreclaimable flag set.

    When draining the per-cpu pagesets back to the buddy allocator for each
    zone, the zone->pages_scanned counter is cleared to avoid erroneously
    setting zone->all_unreclaimable later. The problem is that no pages may
    actually be drained and, thus, the unreclaimable logic never fails
    direct reclaim so the oom killer may be invoked.

    This apparently only manifested after wait_iff_congested() was
    introduced and the zone was full of anonymous memory that would not
    congest the backing store. The page allocator would infinitely loop if
    there were no other tasks waiting to be scheduled and clear
    zone->pages_scanned because of drain_all_pages() as the result of this
    change before kswapd could scan enough pages to trigger the reclaim
    logic. Additionally, with every loop of the page allocator and in the
    reclaim path, kswapd would be kicked and would end up running at 100%
    cpu. In this scenario, current and kswapd are all running continuously
    with kswapd incrementing zone->pages_scanned and current clearing it.

    The problem is even more pronounced when current swaps some of its
    memory to swap cache and the reclaimable logic then considers all active
    anonymous memory in the all_unreclaimable logic, which requires a much
    higher zone->pages_scanned value for try_to_free_pages() to return zero
    that is never attainable in this scenario.

    Before wait_iff_congested(), the page allocator would incur an
    unconditional timeout and allow kswapd to elevate zone->pages_scanned to
    a level that the oom killer would be called the next time it loops.

    The fix is to only attempt to drain pcp pages if there is actually a
    quantity to be drained. The unconditional clearing of
    zone->pages_scanned in free_pcppages_bulk() need not be changed since
    other callers already ensure that draining will occur. This patch
    ensures that free_pcppages_bulk() will actually free memory before
    calling into it from drain_all_pages() so zone->pages_scanned is only
    cleared if appropriate.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Reviewed-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Before 0e093d99763e ("writeback: do not sleep on the congestion queue if
    there are no congested BDIs or if significant congestion is not being
    encountered in the current zone"), preferred_zone was only used for NUMA
    statistics, to determine the zoneidx from which to allocate from given
    the type requested, and whether to utilize memory compaction.

    wait_iff_congested(), though, uses preferred_zone to determine if the
    congestion wait should be deferred because its dirty pages are backed by
    a congested bdi. This incorrectly defers the timeout and busy loops in
    the page allocator with various cond_resched() calls if preferred_zone
    is not allowed in the current context, usually consuming 100% of a cpu.

    This patch ensures preferred_zone is an allowed zone in the fastpath
    depending on whether current is constrained by its cpuset or nodes in
    its mempolicy (when the nodemask passed is non-NULL). This is correct
    since the fastpath allocation always passes ALLOC_CPUSET when trying to
    allocate memory. In the slowpath, this patch resets preferred_zone to
    the first zone of the allowed type when the allocation is not
    constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET.

    This patch also ensures preferred_zone is from the set of allowed nodes
    when called from within direct reclaim since allocations are always
    constrained by cpusets in this context (it is blockable).

    Both of these uses of cpuset_current_mems_allowed are protected by
    get_mems_allowed().

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Both pps_parport and pps_gen_parport are written in a way that they
    can't share a port with any other driver. This can result in locking up
    the process that loads modules or even the whole kernel if the modules
    are compiled in. Use PARPORT_FLAG_EXCL to indicate this.

    Signed-off-by: Alexander Gordeev
    Cc: Alexander Gordeev
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Gordeev
     
  • Signed-off-by: Rodolfo Giometti
    Reported-by: Ingo Molnar
    Cc: Alexander Gordeev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rodolfo Giometti
     
  • parport_unregister_device() should never be used when interrupts are
    enabled in hardware and irq handler is registered so there is no need to
    disable interrupts when using waitlist_lock. But there is no way to
    explain this subtle semantics to lockdep analyzer.

    So disable interrupts here too to simplify things. The price is
    negligible.

    Signed-off-by: Alexander Gordeev
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Gordeev
     
  • Latest kernel has many changes in IRQ subsystem and its interfaces, like
    adding "irq_eoi" for struct irq_chip, this patch is a follow up change
    for that.

    Also remove the unnecessary cast for a "void *".

    Signed-off-by: Feng Tang
    Cc: Alek Du
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • Return PTR_ERR(led_dat->pwm) instead of 0 if pwm_request failed

    Signed-off-by: Axel Lin
    Cc: Richard Purdie
    Cc: Luotao Fu
    Cc: Reviewed-by: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Lin
     
  • mips (and sparc32):

    In file included from arch/mips/include/asm/tlb.h:21,
    from mm/pgtable-generic.c:9:
    include/asm-generic/tlb.h: In function `tlb_flush_mmu':
    include/asm-generic/tlb.h:76: error: implicit declaration of function `release_pages'
    include/asm-generic/tlb.h: In function `tlb_remove_page':
    include/asm-generic/tlb.h:105: error: implicit declaration of function `page_cache_release'

    free_pages_and_swap_cache() and free_page_and_swap_cache() are macros
    which call release_pages() and page_cache_release(). The obvious fix is
    to include pagemap.h in swap.h, where those macros are defined. But that
    breaks sparc for weird reasons.

    So fix it within mm/pgtable-generic.c instead.

    Reported-by: Yoichi Yuasa
    Cc: Geert Uytterhoeven
    Acked-by: Sam Ravnborg
    Cc: Sergei Shtylyov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This fixes TRANSPARENT_HUGEPAGE=y with PARAVIRT=y and HIGHMEM64=n.

    The #ifdef that this patch removes was erratically introduced to fix a
    build error for noPAE (where pmd.pmd doesn't exist). So then the kernel
    built but it failed at runtime because set_pmd_at was a noop. This will
    correct it by enabling set_pmd_at for noPAE mode too.

    Signed-off-by: Andrea Arcangeli
    Reported-by: werner
    Reported-by: Minchan Kim
    Tested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
    ALSA: AACI: fix timeout duration
    ALSA: AACI: fix timeout condition checking
    ARM: 6636/1: ep93xx: default multiplexed gpio ports to gpio mode
    ARM: 6637/1: Make the argument to virt_to_phys() "const volatile"
    ARM: twd: ensure timer reload is reprogrammed on entry to periodic mode
    ARM: 6635/2: Configure reference clock for Versatile Express timers
    ARM: versatile: name configuration options after actual board names
    ARM: realview: name configuration options after actual board names
    ARM: realview,vexpress: fix section mismatch warning for pen_release
    ARM: 6632/3: mmci: stop using the blockend interrupts

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
    nilfs2: fix crash after one superblock became unavailable

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
    m68k/amiga: Fix "debug=mem"
    m68k/atari: Rename "scc" to "atari_scc"
    m68k: Uninline strchr()

    Linus Torvalds
     
  • …nel/git/lethal/sh-2.6

    * 'rmobile-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
    ARM: mach-shmobile: AG5EVM LCDC / MIPI-DSI platform data
    ARM: mach-shmobile: sh73a0 CPGA fix for PLL CFG bit
    ARM: mach-shmobile: mackerel: clarify shdi/mmcif switch settings
    ARM: mach-shmobile: sh73a0 CPGA fix for IrDA MSTP
    ARM: mach-shmobile: sh73a0 CPGA fix for FRQCRA M3
    ARM: mach-shmobile: remove sh7367 on-chip set_irq_type()
    ARM: mach-shmobile: sh7372 INTCS MFIS2 interrupt update
    ARM: mach-shmobile: ag5evm: Add IrDA support
    ARM: mach-shmobile: clock-sh7372: fixup pllc2 set_rate
    mmc: sh_mmcif: Convert to __raw_xxx() I/O accessors.
    ARM: mach-shmobile: ag5evm requires GPIOLIB
    ARM: mach-shmobile: fix cpu_base of gic_init() on sh73a0

    Linus Torvalds
     
  • …l/git/lethal/fbdev-2.6

    * 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-2.6:
    mailmap: Add an entry for Axel Lin.
    video: fix some comments in drivers/video/console/vgacon.c
    drivers/video/bf537-lq035.c: Add missing IS_ERR test
    video: pxa168fb: remove a redundant pxa168fb_check_var call
    video: da8xx-fb: fix fb_probe error path
    video: pxa3xx-gcu: Return -EFAULT when copy_from_user() fails
    video: nuc900fb: properly free resources in nuc900fb_remove
    video: nuc900fb: fix compile error

    Linus Torvalds
     
  • * 'sh-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
    sh: Fix build of sh7750 base boards
    sh: update INTC to clear IRQ sense valid flag
    sh: Fix sh build failure when CONFIG_SFC=m
    sh: fix MSIOF0 SPI on ecovec: it conflicts with VOU
    sh: support XZ-compressed kernel.
    sh: Fix up breakage from asm-generic/pgtable.h changes.

    Linus Torvalds
     
  • Fix __key_link_end()'s attempt to fix up the quota if an error occurs.

    There are two erroneous cases: Firstly, we always decrease the quota if
    the preallocated replacement keyring needs cleaning up, irrespective of
    whether or not we should (we may have replaced a pointer rather than
    adding another pointer).

    Secondly, we never clean up the quota if we added a pointer without the
    keyring storage being extended (we allocate multiple pointers at a time,
    even if we're not going to use them all immediately).

    We handle this by setting the bottom bit of the preallocation pointer in
    __key_link_begin() to indicate that the quota needs fixing up, which is
    then passed to __key_link() (which clears the whole thing) and
    __key_link_end().

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • GPL V2 should be GPL v2

    Signed-off-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • A B C D E ...

    Signed-off-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • busy_loop() returns negative error code, thus change err variable
    from u32 to int to properly propagate correct error code.

    Also remove unneeded initialization for err and i variables.

    Signed-off-by: Axel Lin
    Signed-off-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Axel Lin
     
  • Russell King
     
  • Relying on the access time of peripherals is unreliable - it depends
    on the speed of the CPU and the bus. On Versatile Express, these
    timeouts were expiring, causing the driver to fail.

    Add udelay(1) to ensure that they don't expire early, and adjust
    timeouts to give a reasonable margin over the response times.

    Signed-off-by: Russell King

    Russell King
     
  • Ensure that a timeout coincident with the condition being waited for
    results in success rather than failure. This helps avoid timeout
    conditions being inappropriately flagged.

    Signed-off-by: Russell King

    Russell King
     
  • The EP93xx C and D GPIO ports are multiplexed with the Keypad Interface
    peripheral.  At power-up they default into non-GPIO mode with the Key
    Matrix controller enabled so these ports are unusable for GPIO.  Note
    that the Keypad Interface peripheral is only available in the EP9307,
    EP9312, and EP9315 processor variants.

    The keypad support will clear the DeviceConfig bits appropriately to
    enable the Keypad Interface when the driver is loaded.  And, when the
    driver is unloaded it will set the bits to return the ports to GPIO mode.

    To make these ports available for GPIO after power-up on all EP93xx
    processor variants, set the KEYS and GONK bits in the DeviceConfig
    register.

    Similarly, the E, G, and H ports are multiplexed with the IDE Interface
    peripheral.  At power-up these also default into non-GPIO mode.  Note
    that the IDE peripheral is only available in the EP9312 and EP9315
    processor variants.

    Since an IDE driver is not even available in mainline, set the EONIDE,
    GONIDE, and HONIDE bits in the DeviceConfig register so that these
    ports will be available for GPIO use after power-up.

    Signed-off-by: H Hartley Sweeten
    Acked-by: Ryan Mallon
    Signed-off-by: Russell King

    Hartley Sweeten
     
  • Changing the virt_to_phys() argument to "const volatile void *" avoids
    compiler warnings in some situations where this function is used.

    Signed-off-by: Catalin Marinas
    Acked-by: Stephen Boyd
    Acked-by: Arnd Bergmann
    Signed-off-by: Russell King

    Catalin Marinas
     
  • Ensure that the twd timer reload value is reprogrammed each time we
    enter periodic mode. This ensures that the reload value is always
    reset correctly.

    Tested-by: Santosh Shilimkar
    Acked-by: Colin Cross
    Signed-off-by: Russell King

    Russell King
     
  • Timers on Versatile Express mainboard are used as system clock/event
    sources. Driver assumes that they are clocked with 1MHz signal.
    Old V2M firmware apparently configured it by default, but on newer
    boards one can observe that "sleep 1" command takes over 30 seconds
    to finish, as the timers are fed with 32kHz instead...

    This patch performs required magic and also removes code clearing
    timer's control registers, as exactly the same operations are
    performed by the timer driver few jiffies later.

    Signed-off-by: Pawel Moll
    Tested-by: Will Deacon
    Signed-off-by: Russell King

    Pawel Moll
     

25 Jan, 2011

1 commit