Eric Lee / smarc-fsl-linux-kernel

12 Dec, 2011

19 commits

2bbb6817c nohz: Allow rcu extended quiescent state handling seperately from tick stop ... Browse Code »

It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.

To prepare for fixing this, add two new APIs:
tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu().

If no use of RCU is made in the idle loop between
tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch
must instead call the new *_norcu() version such that the arch doesn't
need to call rcu_idle_enter() and rcu_idle_exit().

Otherwise the arch must call tick_nohz_enter_idle() and
tick_nohz_exit_idle() and also call explicitly:

- rcu_idle_enter() after its last use of RCU before the CPU is put
to sleep.
- rcu_idle_exit() before the first use of RCU after the CPU is woken
up.

Signed-off-by: Frederic Weisbecker
Cc: Mike Frysinger
Cc: Guan Xuetao
Cc: David Miller
Cc: Chris Metcalf
Cc: Hans-Christian Egtvedt
Cc: Ralf Baechle
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: H. Peter Anvin
Cc: Russell King
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Paul Mundt
Signed-off-by: Paul E. McKenney

Frederic Weisbecker
2011-12-12 02:31:36 +0800
280f06774 nohz: Separate out irq exit and idle loop dyntick logic ... Browse Code »

The tick_nohz_stop_sched_tick() function, which tries to delay
the next timer tick as long as possible, can be called from two
places:

- From the idle loop to start the dytick idle mode
- From interrupt exit if we have interrupted the dyntick
idle mode, so that we reprogram the next tick event in
case the irq changed some internal state that requires this
action.

There are only few minor differences between both that
are handled by that function, driven by the ts->inidle
cpu variable and the inidle parameter. The whole guarantees
that we only update the dyntick mode on irq exit if we actually
interrupted the dyntick idle mode, and that we enter in RCU extended
quiescent state from idle loop entry only.

Split this function into:

- tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
dynticks idle mode unconditionally if it can, and enters into RCU
extended quiescent state.

- tick_nohz_irq_exit() which only updates the dynticks idle mode
when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).

To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
into tick_nohz_idle_exit().

This simplifies the code and micro-optimize the irq exit path (no need
for local_irq_save there). This also prepares for the split between
dynticks and rcu extended quiescent state logics. We'll need this split to
further fix illegal uses of RCU in extended quiescent states in the idle
loop.

Signed-off-by: Frederic Weisbecker
Cc: Mike Frysinger
Cc: Guan Xuetao
Cc: David Miller
Cc: Chris Metcalf
Cc: Hans-Christian Egtvedt
Cc: Ralf Baechle
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: H. Peter Anvin
Cc: Russell King
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Paul Mundt
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2011-12-12 02:31:35 +0800
867f236bd rcu: Make srcu_read_lock_held() call common lockdep-enabled function ... Browse Code »

A common debug_lockdep_rcu_enabled() function is used to check whether
RCU lockdep splats should be reported, but srcu_read_lock() does not
use it. This commit therefore brings srcu_read_lock_held() up to date.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:34 +0800
ff195cb69 rcu: Warn when srcu_read_lock() is used in an extended quiescent state ... Browse Code »

Catch SRCU up to the other variants of RCU by making PROVE_RCU
complain if either srcu_read_lock() or srcu_read_lock_held() are
used from within RCU-idle mode.

Frederic reworked this to allow for the new versions of his patches
that check for extended quiescent states.

Signed-off-by: Paul E. McKenney
Signed-off-by: Frederic Weisbecker
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:33 +0800
d8ab29f8b rcu: Remove one layer of abstraction from PROVE_RCU checking ... Browse Code »

Simplify things a bit by substituting the definitions of the single-line
rcu_read_acquire(), rcu_read_release(), rcu_read_acquire_bh(),
rcu_read_release_bh(), rcu_read_acquire_sched(), and
rcu_read_release_sched() functions at their call points.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:32 +0800
00f49e572 rcu: Warn when rcu_read_lock() is used in extended quiescent state ... Browse Code »

We are currently able to detect uses of rcu_dereference_check() inside
extended quiescent states (such as the RCU-free window in idle).
But rcu_read_lock() and friends can be used without rcu_dereference(),
so that the earlier commit checking for use of rcu_dereference() and
friends while in RCU idle mode miss some error conditions. This commit
therefore adds extended quiescent state checking to rcu_read_lock() and
friends.

Uses of RCU from within RCU-idle mode are totally ignored by
RCU, hence the importance of these checks.

Signed-off-by: Frederic Weisbecker
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Lai Jiangshan
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2011-12-12 02:31:31 +0800
0464e9374 rcu: Inform the user about extended quiescent state on PROVE_RCU warning ... Browse Code »

Inform the user if an RCU usage error is detected by lockdep while in
an extended quiescent state (in this case, the RCU-free window in idle).
This is accomplished by adding a line to the RCU lockdep splat indicating
whether or not the splat occurred in extended quiescent state.

Uses of RCU from within extended quiescent state mode are totally ignored
by RCU, hence the importance of this diagnostic.

Signed-off-by: Frederic Weisbecker
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Lai Jiangshan
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2011-12-12 02:31:30 +0800
e6b80a3b0 rcu: Detect illegal rcu dereference in extended quiescent state ... Browse Code »

Report that none of the rcu read lock maps are held while in an RCU
extended quiescent state (the section between rcu_idle_enter()
and rcu_idle_exit()). This helps detect any use of rcu_dereference()
and friends from within the section in idle where RCU is not allowed.

This way we can guarantee an extended quiescent window where the CPU
can be put in dyntick idle mode or can simply aoid to be part of any
global grace period completion while in the idle loop.

Uses of RCU from such mode are totally ignored by RCU, hence the
importance of these checks.

Signed-off-by: Frederic Weisbecker
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Lai Jiangshan
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2011-12-12 02:31:30 +0800
a0f8eefb1 rcu: Remove redundant return from rcu_report_exp_rnp() ... Browse Code »

Empty void functions do not need "return", so this commit removes it
from rcu_report_exp_rnp().

Signed-off-by: Thomas Gleixner
Signed-off-by: Paul E. McKenney

Thomas Gleixner
2011-12-12 02:31:29 +0800
b40d293eb rcu: Omit self-awaken when setting up expedited grace period ... Browse Code »

When setting up an expedited grace period, if there were no readers, the
task will awaken itself. This commit removes this useless self-awakening.

Signed-off-by: Thomas Gleixner
Signed-off-by: Paul E. McKenney

Thomas Gleixner
2011-12-12 02:31:28 +0800
34240697d rcu: Disable preemption in rcu_is_cpu_idle() ... Browse Code »

Because rcu_is_cpu_idle() is to be used to check for extended quiescent
states in RCU-preempt read-side critical sections, it cannot assume that
preemption is disabled. And preemption must be disabled when accessing
the dyntick-idle state, because otherwise the following sequence of events
could occur:

1. Task A on CPU 1 enters rcu_is_cpu_idle() and picks up the pointer
to CPU 1's per-CPU variables.

2. Task B preempts Task A and starts running on CPU 1.

3. Task A migrates to CPU 2.

4. Task B blocks, leaving CPU 1 idle.

5. Task A continues execution on CPU 2, accessing CPU 1's dyntick-idle
information using the pointer fetched in step 1 above, and finds
that CPU 1 is idle.

6. Task A therefore incorrectly concludes that it is executing in
an extended quiescent state, possibly issuing a spurious splat.

Therefore, this commit disables preemption within the rcu_is_cpu_idle()
function.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:27 +0800
2c01531f0 rcu: Document failing tick as cause of RCU CPU stall warning ... Browse Code »

One of lclaudio's systems was seeing RCU CPU stall warnings from idle.
These turned out to be caused by a bug that stopped scheduling-clock
tick interrupts from being sent to a given CPU for several hundred seconds.
This commit therefore updates the documentation to call this out as a
possible cause for RCU CPU stall warnings.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:26 +0800
91afaf300 rcu: Add failure tracing to rcutorture ... Browse Code »

Trace the rcutorture RCU accesses and dump the trace buffer when the
first failure is detected.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:26 +0800
a8eecf224 trace: Allow ftrace_dump() to be called from modules ... Browse Code »

Add an EXPORT_SYMBOL_GPL() so that rcutorture can dump the trace buffer
upon detection of an RCU error.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:25 +0800
9b2e4f188 rcu: Track idleness independent of idle tasks ... Browse Code »

Earlier versions of RCU used the scheduling-clock tick to detect idleness
by checking for the idle task, but handled idleness differently for
CONFIG_NO_HZ=y. But there are now a number of uses of RCU read-side
critical sections in the idle task, for example, for tracing. A more
fine-grained detection of idleness is therefore required.

This commit presses the old dyntick-idle code into full-time service,
so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
always invoked at the beginning of an idle loop iteration. Similarly,
rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
at the end of an idle-loop iteration. This allows the idle task to
use RCU everywhere except between consecutive rcu_idle_enter() and
rcu_idle_exit() calls, in turn allowing architecture maintainers to
specify exactly where in the idle loop that RCU may be used.

Because some of the userspace upcall uses can result in what looks
to RCU like half of an interrupt, it is not possible to expect that
the irq_enter() and irq_exit() hooks will give exact counts. This
patch therefore expands the ->dynticks_nesting counter to 64 bits
and uses two separate bitfields to count process/idle transitions
and interrupt entry/exit transitions. It is presumed that userspace
upcalls do not happen in the idle loop or from usermode execution
(though usermode might do a system call that results in an upcall).
The counter is hard-reset on each process/idle transition, which
avoids the interrupt entry/exit error from accumulating. Overflow
is avoided by the 64-bitness of the ->dyntick_nesting counter.

This commit also adds warnings if a non-idle task asks RCU to enter
idle state (and these checks will need some adjustment before applying
Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
In addition, validation of ->dynticks and ->dynticks_nesting is added.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:24 +0800
b804cb9e9 lockdep: Update documentation for lock-class leak detection ... Browse Code »

There are a number of bugs that can leak or overuse lock classes,
which can cause the maximum number of lock classes (currently 8191)
to be exceeded. However, the documentation does not tell you how to
track down these problems. This commit addresses this shortcoming.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-12-12 02:31:23 +0800
7077714ec rcu: Make synchronize_sched_expedited() better at work sharing ... Browse Code »

When synchronize_sched_expedited() takes its second and subsequent
snapshots of sync_sched_expedited_started, it subtracts 1. This
means that the concurrent caller of synchronize_sched_expedited()
that incremented to that value sees our successful completion, it
will not be able to take advantage of it. This restriction is
pointless, given that our full expedited grace period would have
happened after the other guy started, and thus should be able to
serve as a proxy for the other guy successfully executing
try_stop_cpus().

This commit therefore removes the subtraction of 1.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:22 +0800
389abd48e rcu: Avoid RCU-preempt expedited grace-period botch ... Browse Code »

Because rcu_read_unlock_special() samples rcu_preempted_readers_exp(rnp)
after dropping rnp->lock, the following sequence of events is possible:

1. Task A exits its RCU read-side critical section, and removes
itself from the ->blkd_tasks list, releases rnp->lock, and is
then preempted. Task B remains on the ->blkd_tasks list, and
blocks the current expedited grace period.

2. Task B exits from its RCU read-side critical section and removes
itself from the ->blkd_tasks list. Because it is the last task
blocking the current expedited grace period, it ends that
expedited grace period.

3. Task A resumes, and samples rcu_preempted_readers_exp(rnp) which
of course indicates that nothing is blocking the nonexistent
expedited grace period. Task A is again preempted.

4. Some other CPU starts an expedited grace period. There are several
tasks blocking this expedited grace period queued on the
same rcu_node structure that Task A was using in step 1 above.

5. Task A examines its state and incorrectly concludes that it was
the last task blocking the expedited grace period on the current
rcu_node structure. It therefore reports completion up the
rcu_node tree.

6. The expedited grace period can then incorrectly complete before
the tasks blocked on this same rcu_node structure exit their
RCU read-side critical sections. Arbitrarily bad things happen.

This commit therefore takes a snapshot of rcu_preempted_readers_exp(rnp)
prior to dropping the lock, so that only the last task thinks that it is
the last task, thus avoiding the failure scenario laid out above.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:21 +0800
af446b702 rcu: ->signaled better named ->fqs_state ... Browse Code »

The ->signaled field was named before complications in the form of
dyntick-idle mode and offlined CPUs. These complications have required
that force_quiescent_state() be implemented as a state machine, instead
of simply unconditionally sending reschedule IPIs. Therefore, this
commit renames ->signaled to ->fqs_state to catch up with the new
force_quiescent_state() reality.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-12-12 02:31:20 +0800

10 Dec, 2011

9 commits

dc47ce90c Linux 3.2-rc5 Browse Code »

Linus Torvalds
2011-12-10 07:09:32 +0800
8def5f51b Merge git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

* git://git.samba.org/sfrench/cifs-2.6:
cifs: check for NULL last_entry before calling cifs_save_resume_key
cifs: attempt to freeze while looping on a receive attempt
cifs: Fix sparse warning when calling cifs_strtoUCS
CIFS: Add descriptions to the brlock cache functions

Linus Torvalds
2011-12-10 06:45:44 +0800
a776878d6 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Calling __pa() with an ioremap()ed address is invalid
x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
x86/intel_mid: Kconfig select fix
x86/intel_mid: Fix the Kconfig for MID selection

Linus Torvalds
2011-12-10 06:45:12 +0800
e2f4e0bc2 Merge branch 'spi/for-3.2' of git://git.pengutronix.de/git/wsa/linux-2.6 ... Browse Code »

* 'spi/for-3.2' of git://git.pengutronix.de/git/wsa/linux-2.6:
spi/gpio: fix section mismatch warning
spi/fsl-espi: disable CONFIG_SPI_FSL_ESPI=m build
spi/nuc900: Include linux/module.h
spi/ath79: fix compile error due to missing include

Linus Torvalds
2011-12-10 06:41:50 +0800
af209e0ae Merge branch 'for-linus' of git://neil.brown.name/md ... Browse Code »

* 'for-linus' of git://neil.brown.name/md:
md: raid5 crash during degradation
md/raid5: never wait for bad-block acks on failed device.
md: ensure new badblocks are handled promptly.
md: bad blocks shouldn't cause a Blocked status on a Faulty device.
md: take a reference to mddev during sysfs access.
md: refine interpretation of "hold_active == UNTIL_IOCTL".
md/lock: ensure updates to page_attrs are properly locked.

Linus Torvalds
2011-12-10 00:18:08 +0800
53523d526 Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: use new generic {enable,disable}_percpu_irq() routines
drivers/net/ethernet/tile: use skb_frag_page() API
asm-generic/unistd.h: support new process_vm_{readv,write} syscalls
arch/tile: fix double-free bug in homecache_free_pages()
arch/tile: add a few #includes and an EXPORT to catch up with kernel changes.

Linus Torvalds
2011-12-10 00:08:57 +0800
592d44a5f Merge branch 'iommu/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu ... Browse Code »

* 'iommu/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
MAINTAINERS: Update amd-iommu F: patterns
iommu/amd: Fix typo in kernel-parameters.txt
iommu/msm: Fix compile error in mach-msm/devices-iommu.c
Fix comparison using wrong pointer variable in dma debug code

Linus Torvalds
2011-12-10 00:08:14 +0800
3ab345fc4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Fix lost speaker volume controls
ALSA: hda/realtek - Create "Bass Speaker" for two speaker pins
ALSA: hda/realtek - Don't create extra controls with channel suffix
ALSA: hda - Fix remaining VREF mute-LED NID check in post-3.1 changes
ALSA: hda - Fix GPIO LED setup for IDT 92HD75 codecs
ASoC: Provide a more complete DMA driver stub
ASoC: Remove references to corgi and spitz from machine driver document
ASoC: Make SND_SOC_MX27VIS_AIC32X4 depend on I2C
ASoC: Fix dependency for SND_SOC_RAUMFELD and SND_PXA2XX_SOC_HX4700
ASoC: uda1380: Return proper error in uda1380_modinit failure path
ASoC: kirkwood: Make SND_KIRKWOOD_SOC_OPENRD and SND_KIRKWOOD_SOC_T5325 depend on I2C
ASoC: Mark WM8994 ADC muxes as virtual
ALSA: hda/realtek - Fix Oops in alc_mux_select()
ALSA: sis7019 - give slow codecs more time to reset

Linus Torvalds
2011-12-10 00:07:42 +0800
975e32c28 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do no try to schedule task events if there are none
lockdep, kmemcheck: Annotate ->lock in lockdep_init_map()
perf header: Use event_name() to get an event name
perf stat: Failure with "Operation not supported"

Linus Torvalds
2011-12-10 00:07:24 +0800

09 Dec, 2011

12 commits

031af165b sys_getppid: add missing rcu_dereference ... Browse Code »

In order to safely dereference current->real_parent inside an
rcu_read_lock, we need an rcu_dereference.

Signed-off-by: Mandeep Singh Baines
Cc: Thomas Gleixner
Cc: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2011-12-09 23:50:29 +0800
1cee22b7f rapidio/tsi721: modify PCIe capability settings ... Browse Code »

Modify initialization of PCIe capability registers in Tsi721 mport driver:
- change Completion Timeout value to avoid unexpected data transfer
aborts during intensive traffic.
- replace hardcoded offset of PCIe capability block by making it use the
common function.

This patch is applicable to kernel versions starting from 3.2-rc1.

Signed-off-by: Alexandre Bounine
Cc: Matt Porter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2011-12-09 23:50:29 +0800
b439e66f0 rapidio/tsi721: fix mailbox resource reporting ... Browse Code »

Bug fix for Tsi721 RapidIO mport driver: Tsi721 supports four RapidIO
mailboxes (MBOX0 - MBOX3) as defined by RapidIO specification. Mailbox
resources has to be properly reported to allow use of all available
mailboxes (initial version reports only MBOX0).

This patch is applicable to kernel versions staring from 3.2-rc1.

Signed-off-by: Alexandre Bounine
Cc: Matt Porter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2011-12-09 23:50:29 +0800
ceb963981 rapidio/tsi721: switch to dma_zalloc_coherent ... Browse Code »

Replace the pair dma_alloc_coherent()+memset() with the new
dma_zalloc_coherent() added by Andrew Morton for kernel version 3.2

Signed-off-by: Alexandre Bounine
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2011-12-09 23:50:29 +0800
2a95ea6c0 procfs: do not overflow get_{idle,iowait}_time for nohz ... Browse Code »

Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
iowait times") we are reporting idle/io_wait time also while a CPU is
tickless. We rely on get_{idle,iowait}_time functions to retrieve
proper data.

These functions, however, use usecs_to_cputime to translate micro
seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
which reduces the data type from u64 to unsigned int and also checks
whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
and returns MAX_JIFFY_OFFSET in that case.

When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
>3000s! until we overflow unsigned int. Just for reference
CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
CONFIG_HZ_1000 ~2s.

This results in a bug when people saw [h]top going mad reporting 100%
CPU usage even though there was basically no CPU load. The reason was
simply that /proc/stat stopped reporting idle/io_wait changes (and
reported MAX_JIFFY_OFFSET) and so the only change happening was for user
system time.

Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
to 32b type and it is much more appropriate for cumulative time values
(unlike usecs_to_jiffies which intended for timeout calculations).

Signed-off-by: Michal Hocko
Tested-by: Artem S. Tashkinov
Cc: Dave Jones
Cc: Arnd Bergmann
Cc: Alexey Dobriyan
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-12-09 23:50:29 +0800
1368edf06 mm: vmalloc: check for page allocation failure before vmlist insertion ... Browse Code »
1

Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().

This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.

Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.

Signed-off-by: Mel Gorman
Reported-by: Luciano Chavez
Tested-by: Luciano Chavez
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Cc: [3.1.x+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-12-09 23:50:29 +0800
d02156388 mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks ... Browse Code »
1

setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

IP: [] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[] __setup_per_zone_wmarks+0xec/0x160
[] setup_per_zone_wmarks+0xf/0x20
[] init_per_zone_wmark_min+0x27/0x86
[] do_one_initcall+0x2b/0x160
[] kernel_init+0xbe/0x157
[] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4

We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.

The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").

Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.

Signed-off-by: Michal Hocko
Signed-off-by: Mel Gorman
Tested-by: Dang Bo
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Arve Hjnnevg
Cc: KOSAKI Motohiro
Cc: John Stultz
Cc: Dave Hansen
Cc: [3.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-12-09 23:50:28 +0800
09761333e mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages ... Browse Code »

Avoid unlocking and unlocked page if we failed to lock it.

Signed-off-by: Hillf Danton
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-12-09 23:50:28 +0800
58a84aa92 thp: set compound tail page _count to zero ... Browse Code »
1

Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.

kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159

Signed-off-by: Youquan Song
Reviewed-by: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Youquan Song
2011-12-09 23:50:28 +0800
b6999b191 thp: add compound tail page _mapcount when mapped ... Browse Code »
87

With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.

The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd. If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.

So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.

This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.

Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
-net none -device pci-assign,host=07:00.1

kernel BUG at mm/swap.c:114!
invalid opcode: 0000 [#1] SMP
Call Trace:
put_page+0x15/0x37
kvm_release_pfn_clean+0x31/0x36
kvm_iommu_put_pages+0x94/0xb1
kvm_iommu_unmap_memslots+0x80/0xb6
kvm_assign_device+0xba/0x117
kvm_vm_ioctl_assigned_device+0x301/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP put_compound_page+0xd4/0x168

Signed-off-by: Youquan Song
Reviewed-by: Andrea Arcangeli
Cc: Andi Kleen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Youquan Song
2011-12-09 23:50:28 +0800
09dc3cf93 printk: avoid double lock acquire ... Browse Code »

Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race")
introduced another silly bug where we would want to acquire an already
held lock. Avoid this.

Reported-by: Andrea Arcangeli
Signed-off-by: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-12-09 23:50:28 +0800
c193c82f0 memcg: update maintainers ... Browse Code »

More players joined to memory cgroup developments and Johannes' great work
changed internal design of memory cgroup dramatically. And he will do
more works. Michal Hokko did many bug fixes and know memory cgroup very
well. Daisuke Nishimura helped us very much but he seems busy now.
Thanks to his works.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-12-09 23:50:28 +0800