24 Dec, 2011
1 commit
-
If CONFIG_SCHEDSTATS is defined, the kernel maintains
information about how long the task was sleeping or
in the case of iowait, blocking in the kernel before
getting woken up.This will be useful for sleep time profiling.
Note: this information is only provided for sched_fair.
Other scheduling classes may choose to provide this in
the future.Note: the delay includes the time spent on the runqueue
as well.Signed-off-by: Arun Sharma
Acked-by: Peter Zijlstra
Cc: Steven Rostedt
Cc: Mathieu Desnoyers
Cc: Arnaldo Carvalho de Melo
Cc: Andrew Vagin
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/1324512940-32060-2-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar
23 Dec, 2011
1 commit
-
The panic-on-framebuffer code seems to cause a schedule
to occur during an oops. This causes a bunch of extra
spew as can be seen in:https://bugzilla.redhat.com/attachment.cgi?id=549230
Don't do scheduler debug checks when we are oopsing already.
Signed-off-by: Dave Jones
Link: http://lkml.kernel.org/r/20111222213929.GA4722@redhat.com
Signed-off-by: Ingo Molnar
21 Dec, 2011
7 commits
-
There is a small race between try_to_wake_up() and sched_move_task(),
which is trying to move the process being woken up.try_to_wake_up() on CPU0 sched_move_task() on CPU1
--------------------------------+---------------------------------
raw_spin_lock_irqsave(p->pi_lock)
task_waking_fair()
->p.se.vruntime -= cfs_rq->min_vruntime
ttwu_queue()
->send reschedule IPI to CPU1
raw_spin_unlock_irqsave(p->pi_lock)
task_rq_lock()
-> tring to aquire both p->pi_lock and
rq->lock with IRQ disabled
task_move_group_fair()
-> p.se.vruntime
-= (old)cfs_rq->min_vruntime
+= (new)cfs_rq->min_vruntime
task_rq_unlock()(via IPI)
sched_ttwu_pending()
raw_spin_lock(rq->lock)
ttwu_do_activate()
...
enqueue_entity()
child.se->vruntime += cfs_rq->min_vruntime
raw_spin_unlock(rq->lock)As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_waking_fair().Signed-off-by: Daisuke Nishimura
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar -
There is a small race between do_fork() and sched_move_task(), which is
trying to move the child.do_fork() sched_move_task()
--------------------------------+---------------------------------
copy_process()
sched_fork()
task_fork_fair()
-> vruntime of the child is initialized
based on that of the parent.
-> we can see the child in "tasks" file now.
task_rq_lock()
task_move_group_fair()
-> child.se.vruntime
-= (old)cfs_rq->min_vruntime
+= (new)cfs_rq->min_vruntime
task_rq_unlock()
wake_up_new_task()
...
enqueue_entity()
child.se.vruntime += cfs_rq->min_vruntimeAs a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_fork_fair().Signed-off-by: Daisuke Nishimura
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar -
There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.task_fork_fair() sched_move_task()
--------------------------------+---------------------------------
cfs_rq = task_cfs_rq(current)
-> cfs_rq is the "old" one.
curr = cfs_rq->curr
-> curr is set to the parent.
task_rq_lock()
dequeue_task()
->parent.se.vruntime -= (old)cfs_rq->min_vruntime
enqueue_task()
->parent.se.vruntime += (new)cfs_rq->min_vruntime
task_rq_unlock()
raw_spin_lock_irqsave(rq->lock)
se->vruntime = curr->vruntime
-> vruntime of the child is set to that of the parent
which has already been updated by sched_move_task().
se->vruntime -= (old)cfs_rq->min_vruntime.
raw_spin_unlock_irqrestore(rq->lock)As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.This patch fixes this problem by setting "cfs_rq" and "curr" after
holding the rq->lock.Signed-off-by: Daisuke Nishimura
Acked-by: Paul Turner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar -
Remove cfs bandwidth period check from tg_set_cfs_period.
Invalid bandwidth period's lower/upper limits are denoted
by min_cfs_quota_period/max_cfs_quota_period repsectively,
and are checked against valid period in tg_set_cfs_bandwidth().As pjt pointed out, negative input will result in very large unsigned
numbers and will be caught by the max allowed period test.Signed-off-by: Kamalesh Babulal
Acked-by: Paul Turner
[ammended changelog to mention negative values]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111210135925.GA14593@linux.vnet.ibm.com
--
kernel/sched/core.c | 3 ---
1 file changed, 3 deletions(-)Signed-off-by: Ingo Molnar
-
The current lock break relies on contention on the rq locks, something
which might never come because we've got IRQs disabled. Or will be
very likely because on anything with more than 2 cpus a synchronized
load-balance pass will very likely cause contention on the rq locks.Also the sched_nr_migrate thing fails when it gets trapped the loops
of either the cgroup muck in load_balance_fair() or the move_tasks()
load condition.Instead, use the new lb_flags field to propagate break/abort
conditions for all these loops and create a new loop outside the irq
disabled on the break being required.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-tsceb6w61q0gakmsccix6xxi@git.kernel.org
Signed-off-by: Ingo Molnar -
Replace the all_pinned argument with a flags field so that we can add
some extra controls throughout that entire call chain.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-33kevm71m924ok1gpxd720v3@git.kernel.org
Signed-off-by: Ingo Molnar -
Mike reported a 13% drop in netperf TCP_RR performance due to the
new remote wakeup code. Suresh too noticed some performance issues
with it.Reducing the IPIs to only cross cache domains solves the observed
performance issues.Reported-by: Suresh Siddha
Reported-by: Mike Galbraith
Acked-by: Suresh Siddha
Acked-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc: Chris Mason
Cc: Dave Kleikamp
Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins
Signed-off-by: Ingo Molnar
20 Dec, 2011
1 commit
-
Conflicts:
drivers/cpufreq/cpufreq_conservative.c
drivers/cpufreq/cpufreq_ondemand.c
drivers/macintosh/rack-meter.c
fs/proc/stat.c
fs/proc/uptime.c
kernel/sched/core.c
16 Dec, 2011
1 commit
-
Wrap another ->real_parent dereference while under rcu_read_lock.
Signed-off-by: Kees Cook
Cc: Peter Zijlstra
Cc: Glauber Costa
Cc: Suresh Siddha
Cc: KAMEZAWA Hiroyuki
Link: http://lkml.kernel.org/r/20111215164918.GA13003@www.outflux.net
[ tidied up the changelog ]
Signed-off-by: Ingo Molnar
15 Dec, 2011
8 commits
-
For 32-bit architectures using standard jiffies the idletime calculation
in uptime_proc_show will quickly overflow. It takes (2^32 / HZ) seconds
of idle-time, or e.g. 12.45 days with no load on a quad-core with HZ=1000.
Switch to 64-bit calculations.Cc: stable@vger.kernel.org
Cc: Michael Abbott
Signed-off-by: Martin Schwidefsky -
Make cputime_t and cputime64_t nocast to enable sparse checking to
detect incorrect use of cputime. Drop the cputime macros for simple
scalar operations. The conversion macros are still needed.Signed-off-by: Martin Schwidefsky
-
The parent and real_parent pointers should be considered __rcu,
since they should be held under either tasklist_lock or
rcu_read_lock.Signed-off-by: Kees Cook
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Cc: Al Viro
Link: http://lkml.kernel.org/r/20111214223925.GA27578@www.outflux.net
Signed-off-by: Ingo Molnar -
Merge reason: Pick up the latest fixes.
Signed-off-by: Ingo Molnar
-
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon/kms: add some new pci ids -
* tag 'tytso-for-linus-20111214' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: handle EOF correctly in ext4_bio_write_page()
ext4: remove a wrong BUG_ON in ext4_ext_convert_to_initialized
ext4: correctly handle pages w/o buffers in ext4_discard_partial_buffers()
ext4: avoid potential hang in mpage_submit_io() when blocksize < pagesize
ext4: avoid hangs in ext4_da_should_update_i_disksize()
ext4: display the correct mount option in /proc/mounts for [no]init_itable
ext4: Fix crash due to getting bogus eh_depth value on big-endian systems
ext4: fix ext4_end_io_dio() racing against fsync().. using the new signed tag merge of git that now verifies the gpg
signature automatically. Yay. The branchname was just 'dev', which is
prettier. I'll tell Ted to use nicer tag names for future cases. -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: llseek fix race
fuse: fix llseek bug
fuse: fix fuse_retrieve -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/ncpfs: fix error paths and goto statements in ncp_fill_super()
configfs: register_filesystem() called too early
fuse: register_filesystem() called too early
ubifs: too early register_filesystem()
... and the same kind of leak for mqueue
procfs: fix a vfsmount longterm reference leak
14 Dec, 2011
16 commits
-
Fixes:
https://bugs.freedesktop.org/show_bug.cgi?id=43739Signed-off-by: Alex Deucher
Cc: stable@kernel.org
Signed-off-by: Dave Airlie -
The label 'out_bdi' should be followed by bdi_destroy() instead of
fput() which should be after the 'out_fput' label.If bdi_setup_and_register() fails then jump to the 'out_fput' label
instead of the 'out_bdi' one.If fget(data.info_fd) fails then jump to the previously fixed 'out_bdi'
label to call bdi_destroy() otherwise the bdi object will not be
destroyed.Compile tested only.
Signed-off-by: Djalal Harouni
Signed-off-by: Al Viro -
We need to zero out part of a page which beyond EOF before setting uptodate,
otherwise, mapread or write will see non-zero data beyond EOF.Signed-off-by: Yongqiang Yang
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
If a file is fallocated on a hole, map->m_lblk + map->m_len may be greater
than ee_block + ee_len.Signed-off-by: Yongqiang Yang
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
If a page has been read into memory and never been written, it has no
buffers, but we should handle the page in truncate or punch hole.VFS code of writing operations has handled holes correctly, so this
patch removes the code handling holes in writing operations.Signed-off-by: Yongqiang Yang
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
If there is an unwritten but clean buffer in a page and there is a
dirty buffer after the buffer, then mpage_submit_io does not write the
dirty buffer out. As a result, da_writepages loops forever.This patch fixes the problem by checking dirty flag.
Signed-off-by: Yongqiang Yang
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
If the pte mapping in generic_perform_write() is unmapped between
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
"copied" parameter to ->end_write can be zero. ext4 couldn't cope with
it with delayed allocations enabled. This skips the i_disksize
enlargement logic if copied is zero and no new data was appeneded to
the inode.gdb> bt
, count=0x4\
#0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
#1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
#2 0xffffffff810d97f1 in generic_perform_write (iocb=, iov=, nr_segs=, pos=0x108000, ppos=0xffff88001e26be40, count=, written=0x0) at mm/filemap.c:2440
#3 generic_file_buffered_write (iocb=, iov=, nr_segs=, p\
os=0x108000, ppos=0xffff88001e26be40, count=, written=0x0) at mm/filemap.c:2482
#4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
xffff88001e26be40) at mm/filemap.c:2600
#5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=, pos=) at mm/filemap.c:2632
#6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
t fs/ext4/file.c:136
#7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=, len=, \
ppos=0xffff88001e26bf48) at fs/read_write.c:406
#8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960
000, pos=0xffff88001e26bf48) at fs/read_write.c:435
#9 0xffffffff8113816c in sys_write (fd=, buf=0x1ec2960 , count=0x\
4000) at fs/read_write.c:487
#10
#11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
#12 0x0000000000000000 in ?? ()
gdb> print offset
$22 = 0xffffffffffffffff
gdb> print idx
$23 = 0xffffffff
gdb> print inode->i_blkbits
$24 = 0xc
gdb> up
#1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
2512 if (ext4_da_should_update_i_disksize(page, end)) {
gdb> print start
$25 = 0x0
gdb> print end
$26 = 0xffffffffffffffff
gdb> print pos
$27 = 0x108000
gdb> print new_i_size
$28 = 0x108000
gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
$29 = 0xd9000
gdb> down
2467 for (i = 0; i < idx; i++)
gdb> print i
$30 = 0xd44acbeeThis is 100% reproducible with some autonuma development code tuned in
a very aggressive manner (not normal way even for knumad) which does
"exotic" changes to the ptes. It wouldn't normally trigger but I don't
see why it can't happen normally if the page is added to swap cache in
between the two faults leading to "copied" being zero (which then
hangs in ext4). So it should be fixed. Especially possible with lumpy
reclaim (albeit disabled if compaction is enabled) as that would
ignore the young bits in the ptes.Signed-off-by: Andrea Arcangeli
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "x86, efi: Calling __pa() with an ioremap()ed address is invalid"
x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: add missing spin_unlock at ceph_mdsc_build_path()
ceph: fix SEEK_CUR, SEEK_SET regression
crush: fix mapping calculation when force argument doesn't exist
ceph: use i_ceph_lock instead of i_lock
rbd: remove buggy rollback functionality
rbd: return an error when an invalid header is read
ceph: fix rasize reporting by ceph_show_options -
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: set max_pause to lowest value on zero bdi_dirty
writeback: permit through good bdi even when global dirty exceeded
writeback: comment on the bdi dirty threshold
fs: Make write(2) interruptible by a fatal signal
writeback: Fix issue on make htmldocs -
one of the paths was missing spin_unlock
Signed-off-by: Yehuda Sadeh
-
Signed-off-by: Al Viro
-
same story as with ubifs
Signed-off-by: Al Viro
-
doing that before you are ready to handle mount() is a Bad Idea(tm)...
Signed-off-by: Al Viro
-
* 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm:
ARM: 7204/1: arch/arm/kernel/setup.c: initialize arm_dma_zone_size earlier
ARM: 7185/1: perf: don't assign platform_device on unsupported CPUs
ARM: 7187/1: fix unwinding for XIP kernels
ARM: 7186/1: fix Kconfig issue with PHYS_OFFSET and !MMU -
Commit 06222e491e663dac939f04b125c9dc52126a75c4 got the if wrong so that
it always evaluates as true. This is semantically harmless, but makes
SEEK_CUR and SEEK_SET needlessly query the server.Rewrite the if to explicitly enumerate the cases we DO need a valid i_size
to make this code less fragile.Reported-by: Roel Kluin
Signed-off-by: Sage Weil
13 Dec, 2011
5 commits
-
Fix race between lseek(fd, 0, SEEK_CUR) and read/write. This was fixed in
generic code by commit 5b6f1eb97d (vfs: lseek(fd, 0, SEEK_CUR) race condition).Signed-off-by: Miklos Szeredi
-
The test in fuse_file_llseek() "not SEEK_CUR or not SEEK_SET" always evaluates
to true.This was introduced in 3.1 by commit 06222e49 (fs: handle SEEK_HOLE/SEEK_DATA
properly in all fs's that define their own llseek) and changed the behavior of
SEEK_CUR and SEEK_SET to always retrieve the file attributes. This is a
performance regression.Fix the test so that it makes sense.
Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
CC: Josef Bacik
CC: Al Viro -
Fix two bugs in fuse_retrieve():
- retrieving more than one page would yield repeated instances of the
first page- if more than FUSE_MAX_PAGES_PER_REQ pages were requested than the
request page array would overflowfuse_retrieve() was added in 2.6.36 and these bugs had been there since the
beginning.Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org -
Exactly like roundup_pow_of_two(1), the rounddown version was buggy for
the case of a compile-time constant '1' argument. Probably because it
originated from the same code, sharing history with the roundup version
from before the bugfix (for that one, see commit 1a06a52ee1b0: "Fix
roundup_pow_of_two(1)").However, unlike the roundup version, the fix for rounddown is to just
remove the broken special case entirely. It's simply not needed - the
generic code1UL << ilog2(n)
does the right thing for the constant '1' argment too. The only reason
roundup needed that special case was because rounding up does so by
subtracting one from the argument (and then adding one to the result)
causing the obvious problems with "ilog2(0)".But rounddown doesn't do any of that, since ilog2() naturally truncates
(ie "rounds down") to the right rounded down value. And without the
ilog2(0) case, there's no reason for the special case that had the wrong
value.tl;dr: rounddown_pow_of_two(1) should be 1, not 0.
Acked-by: Dmitry Torokhov
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds -
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (jz4740) Staticise jz4740_hwmon_driver
hwmon: (jz4740) fix signedness bug