24 Dec, 2011

1 commit

  • If CONFIG_SCHEDSTATS is defined, the kernel maintains
    information about how long the task was sleeping or
    in the case of iowait, blocking in the kernel before
    getting woken up.

    This will be useful for sleep time profiling.

    Note: this information is only provided for sched_fair.
    Other scheduling classes may choose to provide this in
    the future.

    Note: the delay includes the time spent on the runqueue
    as well.

    Signed-off-by: Arun Sharma
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Arnaldo Carvalho de Melo
    Cc: Andrew Vagin
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1324512940-32060-2-git-send-email-asharma@fb.com
    Signed-off-by: Ingo Molnar

    Arun Sharma
     

23 Dec, 2011

1 commit

  • The panic-on-framebuffer code seems to cause a schedule
    to occur during an oops. This causes a bunch of extra
    spew as can be seen in:

    https://bugzilla.redhat.com/attachment.cgi?id=549230

    Don't do scheduler debug checks when we are oopsing already.

    Signed-off-by: Dave Jones
    Link: http://lkml.kernel.org/r/20111222213929.GA4722@redhat.com
    Signed-off-by: Ingo Molnar

    Dave Jones
     

21 Dec, 2011

7 commits

  • There is a small race between try_to_wake_up() and sched_move_task(),
    which is trying to move the process being woken up.

    try_to_wake_up() on CPU0 sched_move_task() on CPU1
    --------------------------------+---------------------------------
    raw_spin_lock_irqsave(p->pi_lock)
    task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
    ttwu_queue()
    ->send reschedule IPI to CPU1
    raw_spin_unlock_irqsave(p->pi_lock)
    task_rq_lock()
    -> tring to aquire both p->pi_lock and
    rq->lock with IRQ disabled
    task_move_group_fair()
    -> p.se.vruntime
    -= (old)cfs_rq->min_vruntime
    += (new)cfs_rq->min_vruntime
    task_rq_unlock()

    (via IPI)
    sched_ttwu_pending()
    raw_spin_lock(rq->lock)
    ttwu_do_activate()
    ...
    enqueue_entity()
    child.se->vruntime += cfs_rq->min_vruntime
    raw_spin_unlock(rq->lock)

    As a result, vruntime of the process becomes far bigger than min_vruntime,
    if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

    This patch fixes this problem by just ignoring such process in
    task_move_group_fair(), because the vruntime has already been normalized in
    task_waking_fair().

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     
  • There is a small race between do_fork() and sched_move_task(), which is
    trying to move the child.

    do_fork() sched_move_task()
    --------------------------------+---------------------------------
    copy_process()
    sched_fork()
    task_fork_fair()
    -> vruntime of the child is initialized
    based on that of the parent.
    -> we can see the child in "tasks" file now.
    task_rq_lock()
    task_move_group_fair()
    -> child.se.vruntime
    -= (old)cfs_rq->min_vruntime
    += (new)cfs_rq->min_vruntime
    task_rq_unlock()
    wake_up_new_task()
    ...
    enqueue_entity()
    child.se.vruntime += cfs_rq->min_vruntime

    As a result, vruntime of the child becomes far bigger than min_vruntime,
    if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

    This patch fixes this problem by just ignoring such process in
    task_move_group_fair(), because the vruntime has already been normalized in
    task_fork_fair().

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     
  • There is a small race between task_fork_fair() and sched_move_task(),
    which is trying to move the parent.

    task_fork_fair() sched_move_task()
    --------------------------------+---------------------------------
    cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
    curr = cfs_rq->curr
    -> curr is set to the parent.
    task_rq_lock()
    dequeue_task()
    ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
    enqueue_task()
    ->parent.se.vruntime += (new)cfs_rq->min_vruntime
    task_rq_unlock()
    raw_spin_lock_irqsave(rq->lock)
    se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
    which has already been updated by sched_move_task().
    se->vruntime -= (old)cfs_rq->min_vruntime.
    raw_spin_unlock_irqrestore(rq->lock)

    As a result, vruntime of the child becomes far bigger than expected,
    if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

    This patch fixes this problem by setting "cfs_rq" and "curr" after
    holding the rq->lock.

    Signed-off-by: Daisuke Nishimura
    Acked-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     
  • Remove cfs bandwidth period check from tg_set_cfs_period.
    Invalid bandwidth period's lower/upper limits are denoted
    by min_cfs_quota_period/max_cfs_quota_period repsectively,
    and are checked against valid period in tg_set_cfs_bandwidth().

    As pjt pointed out, negative input will result in very large unsigned
    numbers and will be caught by the max allowed period test.

    Signed-off-by: Kamalesh Babulal
    Acked-by: Paul Turner
    [ammended changelog to mention negative values]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111210135925.GA14593@linux.vnet.ibm.com
    --
    kernel/sched/core.c | 3 ---
    1 file changed, 3 deletions(-)

    Signed-off-by: Ingo Molnar

    Kamalesh Babulal
     
  • The current lock break relies on contention on the rq locks, something
    which might never come because we've got IRQs disabled. Or will be
    very likely because on anything with more than 2 cpus a synchronized
    load-balance pass will very likely cause contention on the rq locks.

    Also the sched_nr_migrate thing fails when it gets trapped the loops
    of either the cgroup muck in load_balance_fair() or the move_tasks()
    load condition.

    Instead, use the new lb_flags field to propagate break/abort
    conditions for all these loops and create a new loop outside the irq
    disabled on the break being required.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-tsceb6w61q0gakmsccix6xxi@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Replace the all_pinned argument with a flags field so that we can add
    some extra controls throughout that entire call chain.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-33kevm71m924ok1gpxd720v3@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Mike reported a 13% drop in netperf TCP_RR performance due to the
    new remote wakeup code. Suresh too noticed some performance issues
    with it.

    Reducing the IPIs to only cross cache domains solves the observed
    performance issues.

    Reported-by: Suresh Siddha
    Reported-by: Mike Galbraith
    Acked-by: Suresh Siddha
    Acked-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Chris Mason
    Cc: Dave Kleikamp
    Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Dec, 2011

1 commit


16 Dec, 2011

1 commit


15 Dec, 2011

8 commits


14 Dec, 2011

16 commits

  • Fixes:
    https://bugs.freedesktop.org/show_bug.cgi?id=43739

    Signed-off-by: Alex Deucher
    Cc: stable@kernel.org
    Signed-off-by: Dave Airlie

    Alex Deucher
     
  • The label 'out_bdi' should be followed by bdi_destroy() instead of
    fput() which should be after the 'out_fput' label.

    If bdi_setup_and_register() fails then jump to the 'out_fput' label
    instead of the 'out_bdi' one.

    If fget(data.info_fd) fails then jump to the previously fixed 'out_bdi'
    label to call bdi_destroy() otherwise the bdi object will not be
    destroyed.

    Compile tested only.

    Signed-off-by: Djalal Harouni
    Signed-off-by: Al Viro

    Djalal Harouni
     
  • We need to zero out part of a page which beyond EOF before setting uptodate,
    otherwise, mapread or write will see non-zero data beyond EOF.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Yongqiang Yang
     
  • If a file is fallocated on a hole, map->m_lblk + map->m_len may be greater
    than ee_block + ee_len.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Yongqiang Yang
     
  • If a page has been read into memory and never been written, it has no
    buffers, but we should handle the page in truncate or punch hole.

    VFS code of writing operations has handled holes correctly, so this
    patch removes the code handling holes in writing operations.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Yongqiang Yang
     
  • If there is an unwritten but clean buffer in a page and there is a
    dirty buffer after the buffer, then mpage_submit_io does not write the
    dirty buffer out. As a result, da_writepages loops forever.

    This patch fixes the problem by checking dirty flag.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Yongqiang Yang
     
  • If the pte mapping in generic_perform_write() is unmapped between
    iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
    "copied" parameter to ->end_write can be zero. ext4 couldn't cope with
    it with delayed allocations enabled. This skips the i_disksize
    enlargement logic if copied is zero and no new data was appeneded to
    the inode.

    gdb> bt
    #0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
    08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
    #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
    xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
    #2 0xffffffff810d97f1 in generic_perform_write (iocb=, iov=, nr_segs=, pos=0x108000, ppos=0xffff88001e26be40, count=, written=0x0) at mm/filemap.c:2440
    #3 generic_file_buffered_write (iocb=, iov=, nr_segs=, p\
    os=0x108000, ppos=0xffff88001e26be40, count=, written=0x0) at mm/filemap.c:2482
    #4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
    xffff88001e26be40) at mm/filemap.c:2600
    #5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=, pos=) at mm/filemap.c:2632
    #6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
    t fs/ext4/file.c:136
    #7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=, len=, \
    ppos=0xffff88001e26bf48) at fs/read_write.c:406
    #8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960

    , count=0x4\
    000, pos=0xffff88001e26bf48) at fs/read_write.c:435
    #9 0xffffffff8113816c in sys_write (fd=, buf=0x1ec2960
    , count=0x\
    4000) at fs/read_write.c:487
    #10
    #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
    #12 0x0000000000000000 in ?? ()
    gdb> print offset
    $22 = 0xffffffffffffffff
    gdb> print idx
    $23 = 0xffffffff
    gdb> print inode->i_blkbits
    $24 = 0xc
    gdb> up
    #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
    xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
    2512 if (ext4_da_should_update_i_disksize(page, end)) {
    gdb> print start
    $25 = 0x0
    gdb> print end
    $26 = 0xffffffffffffffff
    gdb> print pos
    $27 = 0x108000
    gdb> print new_i_size
    $28 = 0x108000
    gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
    $29 = 0xd9000
    gdb> down
    2467 for (i = 0; i < idx; i++)
    gdb> print i
    $30 = 0xd44acbee

    This is 100% reproducible with some autonuma development code tuned in
    a very aggressive manner (not normal way even for knumad) which does
    "exotic" changes to the ptes. It wouldn't normally trigger but I don't
    see why it can't happen normally if the page is added to swap cache in
    between the two faults leading to "copied" being zero (which then
    hangs in ext4). So it should be fixed. Especially possible with lumpy
    reclaim (albeit disabled if compaction is enabled) as that would
    ignore the young bits in the ptes.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Andrea Arcangeli
     
  • * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "x86, efi: Calling __pa() with an ioremap()ed address is invalid"
    x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: add missing spin_unlock at ceph_mdsc_build_path()
    ceph: fix SEEK_CUR, SEEK_SET regression
    crush: fix mapping calculation when force argument doesn't exist
    ceph: use i_ceph_lock instead of i_lock
    rbd: remove buggy rollback functionality
    rbd: return an error when an invalid header is read
    ceph: fix rasize reporting by ceph_show_options

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: set max_pause to lowest value on zero bdi_dirty
    writeback: permit through good bdi even when global dirty exceeded
    writeback: comment on the bdi dirty threshold
    fs: Make write(2) interruptible by a fatal signal
    writeback: Fix issue on make htmldocs

    Linus Torvalds
     
  • one of the paths was missing spin_unlock

    Signed-off-by: Yehuda Sadeh

    Yehuda Sadeh
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • same story as with ubifs

    Signed-off-by: Al Viro

    Al Viro
     
  • doing that before you are ready to handle mount() is a Bad Idea(tm)...

    Signed-off-by: Al Viro

    Al Viro
     
  • * 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm:
    ARM: 7204/1: arch/arm/kernel/setup.c: initialize arm_dma_zone_size earlier
    ARM: 7185/1: perf: don't assign platform_device on unsupported CPUs
    ARM: 7187/1: fix unwinding for XIP kernels
    ARM: 7186/1: fix Kconfig issue with PHYS_OFFSET and !MMU

    Linus Torvalds
     
  • Commit 06222e491e663dac939f04b125c9dc52126a75c4 got the if wrong so that
    it always evaluates as true. This is semantically harmless, but makes
    SEEK_CUR and SEEK_SET needlessly query the server.

    Rewrite the if to explicitly enumerate the cases we DO need a valid i_size
    to make this code less fragile.

    Reported-by: Roel Kluin
    Signed-off-by: Sage Weil

    Sage Weil
     

13 Dec, 2011

5 commits

  • Fix race between lseek(fd, 0, SEEK_CUR) and read/write. This was fixed in
    generic code by commit 5b6f1eb97d (vfs: lseek(fd, 0, SEEK_CUR) race condition).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The test in fuse_file_llseek() "not SEEK_CUR or not SEEK_SET" always evaluates
    to true.

    This was introduced in 3.1 by commit 06222e49 (fs: handle SEEK_HOLE/SEEK_DATA
    properly in all fs's that define their own llseek) and changed the behavior of
    SEEK_CUR and SEEK_SET to always retrieve the file attributes. This is a
    performance regression.

    Fix the test so that it makes sense.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org
    CC: Josef Bacik
    CC: Al Viro

    Roel Kluin
     
  • Fix two bugs in fuse_retrieve():

    - retrieving more than one page would yield repeated instances of the
    first page

    - if more than FUSE_MAX_PAGES_PER_REQ pages were requested than the
    request page array would overflow

    fuse_retrieve() was added in 2.6.36 and these bugs had been there since the
    beginning.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org

    Miklos Szeredi
     
  • Exactly like roundup_pow_of_two(1), the rounddown version was buggy for
    the case of a compile-time constant '1' argument. Probably because it
    originated from the same code, sharing history with the roundup version
    from before the bugfix (for that one, see commit 1a06a52ee1b0: "Fix
    roundup_pow_of_two(1)").

    However, unlike the roundup version, the fix for rounddown is to just
    remove the broken special case entirely. It's simply not needed - the
    generic code

    1UL << ilog2(n)

    does the right thing for the constant '1' argment too. The only reason
    roundup needed that special case was because rounding up does so by
    subtracting one from the argument (and then adding one to the result)
    causing the obvious problems with "ilog2(0)".

    But rounddown doesn't do any of that, since ilog2() naturally truncates
    (ie "rounds down") to the right rounded down value. And without the
    ilog2(0) case, there's no reason for the special case that had the wrong
    value.

    tl;dr: rounddown_pow_of_two(1) should be 1, not 0.

    Acked-by: Dmitry Torokhov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
    hwmon: (jz4740) Staticise jz4740_hwmon_driver
    hwmon: (jz4740) fix signedness bug

    Linus Torvalds