Eric Lee / smarc-fsl-linux-kernel

30 May, 2011

1 commit

6345d24da mm: Fix boot crash in mm_alloc() ... Browse Code »

Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] find_next_bit+0x55/0xb0
Call Trace:
[] cpumask_any_but+0x2a/0x70
[] flush_tlb_mm+0x2b/0x80
[] pud_populate+0x35/0x50
[] pgd_alloc+0x9a/0xf0
[] mm_init+0xec/0x120
[] mm_alloc+0x53/0xd0

which was introduced by commit de03c72cfce5 ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.

Reported-by: Thomas Gleixner
Reported-by: Ingo Molnar
Cc: KOSAKI Motohiro
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-30 02:32:28 +0800

27 May, 2011

1 commit

a77aea920 cgroup: remove the ns_cgroup ... Browse Code »

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:

* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroup

The ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.

Signed-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Cc: Eric W. Biederman
Cc: Jamal Hadi Salim
Reviewed-by: Li Zefan
Acked-by: Paul Menage
Acked-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Lezcano
2011-05-27 08:12:34 +0800

25 May, 2011

4 commits

162a7e750 printk: allocate kernel log buffer earlier ... Browse Code »

On larger systems, because of the numerous ACPI, Bootmem and EFI messages,
the static log buffer overflows before the larger one specified by the
log_buf_len param is allocated. Minimize the overflow by allocating the
new log buffer as soon as possible.

On kernels without memblock, a later call to setup_log_buf from
kernel/init.c is the fallback.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_PRINTK=n build]
Signed-off-by: Mike Travis
Cc: Yinghai Lu
Cc: "H. Peter Anvin"
Cc: Jack Steiner
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Travis
2011-05-25 23:39:48 +0800
d2b463135 init/calibrate.c: fix for critical bogoMIPS intermittent calculation failure ... Browse Code »

A fix to the TSC (Time Stamp Counter) based bogoMIPS calculation used on
secondary CPUs which has two faults:

1: Not handling wrapping of the lower 32 bits of the TSC counter on
32bit kernel - perhaps TSC is not reset by a warm reset?

2: TSC and Jiffies are no incrementing together properly. Either
jiffies increment too quickly or Time Stamp Counter isn't incremented
in during an SMI but the real time clock is and jiffies are
incremented.

Case 1 can result in a factor of 16 too large a value which makes udelay()
values too small and can cause mysterious driver errors. Case 2 appears
to give smaller 10-15% errors after averaging but enough to cause
occasional failures on my own board

I have tested this code on my own branch and attach patch suitable for
current kernel code. See below for examples of the failures and how the
fix handles these situations now.

I reported this issue earlier here:
Intermittent problem with BogoMIPs calculation on Intel AP CPUs -
http://marc.info/?l=linux-kernel&m=129947246316875&w=4

I suspect this issue has been seen by others but as it is intermittent and
bogoMIPS for secondary CPUs are no longer printed out it might have been
difficult to identify this as the cause. Perhaps these unresolved issues,
although quite old, might be relevant as possibly this fault has been
around for a while. In particular Case 1 may only be relevant to 32bit
kernels on newer HW (most people run 64bit kernels?). Case 2 is less
dramatic since the earlier fix in this area and also intermittent.

Re: bogomips discrepancy on Intel Core2 Quad CPU -
http://marc.info/?l=linux-kernel&m=118929277524298&w=4
slow system and bogus bogomips -
http://marc.info/?l=linux-kernel&m=116791286716107&w=4
Re: Re: [RFC-PATCH] clocksource: update lpj if clocksource has -
http://marc.info/?l=linux-kernel&m=128952775819467&w=4

This issue is masked a little by commit feae3203d711db0a ("timers, init:
Limit the number of per cpu calibration bootup messages") which only
prints out the first bogoMIPS value making it much harder to notice other
values differing. Perhaps it should be changed to only suppress them when
they are similar values?

Here are some outputs showing faults occurring and the new code handling
them properly. See my earlier message for examples of the original
failure.

Case 1: A Time Stamp Counter wrap:
...
Calibrating delay loop (skipped), value calculated using timer
frequency.. 6332.70 BogoMIPS (lpj=31663540)
....
calibrate_delay_direct() timer_rate_max=31666493
timer_rate_min=31666151 pre_start=4170369255 pre_end=4202035539
calibrate_delay_direct() timer_rate_max=2425955274
timer_rate_min=2425954941 pre_start=4265368533 pre_end=2396356387
calibrate_delay_direct() ignoring timer_rate as we had a TSC wrap
around start=4265368581 >=post_end=2396356511
calibrate_delay_direct() timer_rate_max=31666274
timer_rate_min=31665942 pre_start=2440373374 pre_end=2472039515
calibrate_delay_direct() timer_rate_max=31666492
timer_rate_min=31666160 pre_start=2535372139 pre_end=2567038422
calibrate_delay_direct() timer_rate_max=31666455
timer_rate_min=31666207 pre_start=2630371084 pre_end=2662037415
Calibrating delay using timer specific routine.. 6333.28 BogoMIPS (lpj=31666428)
Total of 2 processors activated (12665.99 BogoMIPS).
....

Case 2: Some thing (presumably the SMM interrupt?) causing the
very low increase in TSC counter for the DELAY_CALIBRATION_TICKS
increase in jiffies
...
Calibrating delay loop (skipped), value calculated using timer
frequency.. 6333.25 BogoMIPS (lpj=31666270)
...
calibrate_delay_direct() timer_rate_max=31666483
timer_rate_min=31666074 pre_start=4199536526 pre_end=4231202809
calibrate_delay_direct() timer_rate_max=864348 timer_rate_min=864016
pre_start=2405343672 pre_end=2406207897
calibrate_delay_direct() timer_rate_max=31666483
timer_rate_min=31666179 pre_start=2469540464 pre_end=2501206823
calibrate_delay_direct() timer_rate_max=31666511
timer_rate_min=31666122 pre_start=2564539400 pre_end=2596205712
calibrate_delay_direct() timer_rate_max=31666084
timer_rate_min=31665685 pre_start=2659538782 pre_end=2691204657
calibrate_delay_direct() dropping min bogoMips estimate 1 = 864348
Calibrating delay using timer specific routine.. 6333.27 BogoMIPS (lpj=31666390)
Total of 2 processors activated (12666.53 BogoMIPS).
...

After 70 boots I saw 2 variations
Reviewed-by: Phil Carmody
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Worsley
2011-05-25 23:39:46 +0800
de03c72cf mm: convert mm->cpu_vm_cpumask into cpumask_var_t ... Browse Code »

cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
It might lead to reduce cache hit ratio.

This patch has two change.
1) Move the place of cpumask into last of mm_struct. Because usually cpumask
is accessed only front bits when the system has cpu-hotplug capability
2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
footprint if cpumask_size() will use nr_cpumask_bits properly in future.

In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
It may help to detect out of tree cpu_vm_mask users.

This patch has no functional change.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro
Cc: David Howells
Cc: Koichi Yasutake
Cc: Hugh Dickins
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:21 +0800
2bb732cdb Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 ... Browse Code »

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
kbuild: make KBUILD_NOCMDDEP=1 handle empty built-in.o
scripts/kallsyms.c: fix potential segfault
scripts/gen_initramfs_list.sh: Convert to a /bin/sh script
kbuild: Fix GNU make v3.80 compatibility
kbuild: Fix passing -Wno-* options to gcc 4.4+
kbuild: move scripts/basic/docproc.c to scripts/docproc.c
kbuild: Fix Makefile.asm-generic for um
kbuild: Allow to combine multiple W= levels
kbuild: Disable -Wunused-but-set-variable for gcc 4.6.0
Fix handling of backlash character in LINUX_COMPILE_BY name
kbuild: asm-generic support
kbuild: implement several W= levels
kbuild: Fix build with binutils <= 2.19
initramfs: Use KBUILD_BUILD_TIMESTAMP for generated entries
kbuild: Allow to override LINUX_COMPILE_BY and LINUX_COMPILE_HOST macros
kbuild: Drop unused LINUX_COMPILE_TIME and LINUX_COMPILE_DOMAIN macros
kbuild: Use the deterministic mode of ar
kbuild: Call gzip with -n
kbuild: move KALLSYMS_EXTRA_PASS from Kconfig to Makefile
Kconfig: improve KALLSYMS_ALL documentation

Fix up trivial conflict in Makefile

Linus Torvalds
2011-05-25 04:31:37 +0800

23 May, 2011

2 commits

e98bae759 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6: (28 commits)
sparc32: fix build, fix missing cpu_relax declaration
SCHED_TTWU_QUEUE is not longer needed since sparc32 now implements IPI
sparc32,leon: Remove unnecessary page_address calls in LEON DMA API.
sparc: convert old cpumask API into new one
sparc32, sun4d: Implemented SMP IPIs support for SUN4D machines
sparc32, sun4m: Implemented SMP IPIs support for SUN4M machines
sparc32,leon: Implemented SMP IPIs for LEON CPU
sparc32: implement SMP IPIs using the generic functions
sparc32,leon: SMP power down implementation
sparc32,leon: added some SMP comments
sparc: add {read,write}*_be routines
sparc32,leon: don't rely on bootloader to mask IRQs
sparc32,leon: operate on boot-cpu IRQ controller registers
sparc32: always define boot_cpu_id
sparc32: removed unused code, implemented by generic code
sparc32: avoid build warning at mm/percpu.c:1647
sparc32: always register a PROM based early console
sparc32: probe for cpu info only during startup
sparc: consolidate show_cpuinfo in cpu.c
sparc32,leon: implement genirq CPU affinity
...

Linus Torvalds
2011-05-23 13:06:24 +0800
281dc5c5e Give up on pushing CC_OPTIMIZE_FOR_SIZE ... Browse Code »

I still happen to believe that I$ miss costs are a major thing, but
sadly, -Os doesn't seem to be the solution. With or without it, gcc
will miss some obvious code size improvements, and with it enabled gcc
will sometimes make choices that aren't good even with high I$ miss
ratios.

For example, with -Os, gcc on x86 will turn a 20-byte constant memcpy
into a "rep movsl". While I sincerely hope that x86 CPU's will some day
do a good job at that, they certainly don't do it yet, and the cost is
higher than a L1 I$ miss would be.

Some day I hope we can re-enable this.

Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-23 05:30:36 +0800

21 May, 2011

1 commit

17d9f311e SCHED_TTWU_QUEUE is not longer needed since sparc32 now implements IPI ... Browse Code »

Signed-off-by: Daniel Hellstrom
Reported-by: Peter Zijlstra
Acked-by: Peter Zijlstra
Signed-off-by: David S. Miller

Daniel Hellstrom
2011-05-21 04:10:55 +0800

20 May, 2011

3 commits

eb04f2f04 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()
...

Linus Torvalds
2011-05-20 09:14:34 +0800
80fe02b5d Merge branches 'sched-core-for-linus' and 'sched-urgent-for-linus' of git://git.… ... Browse Code »

…kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits)
sched: Fix and optimise calculation of the weight-inverse
sched: Avoid going ahead if ->cpus_allowed is not changed
sched, rt: Update rq clock when unthrottling of an otherwise idle CPU
sched: Remove unused parameters from sched_fork() and wake_up_new_task()
sched: Shorten the construction of the span cpu mask of sched domain
sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG
sched: Remove unused 'this_best_prio arg' from balance_tasks()
sched: Remove noop in alloc_rt_sched_group()
sched: Get rid of lock_depth
sched: Remove obsolete comment from scheduler_tick()
sched: Fix sched_domain iterations vs. RCU
sched: Next buddy hint on sleep and preempt path
sched: Make set_*_buddy() work on non-task entities
sched: Remove need_migrate_task()
sched: Move the second half of ttwu() to the remote cpu
sched: Restructure ttwu() some more
sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
sched: Remove rq argument from ttwu_stat()
sched: Remove rq->lock from the first half of ttwu()
sched: Drop rq->lock from sched_exec()
...

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix rt_rq runtime leakage bug

Linus Torvalds
2011-05-20 08:41:22 +0800
9b090f2da kmemleak: Initialise kmemleak after debug_objects_mem_init() ... Browse Code »

Kmemleak frees objects via RCU and when CONFIG_DEBUG_OBJECTS_RCU_HEAD
is enabled, the RCU callback triggers a call to free_object() in
lib/debugobjects.c. Since kmemleak is initialised before debug objects
initialisation, it may result in a kernel panic during booting. This
patch moves the kmemleak_init() call after debug_objects_mem_init().

Reported-by: Marcin Slusarz
Tested-by: Tejun Heo
Signed-off-by: Catalin Marinas
Cc:

Catalin Marinas
2011-05-20 00:36:37 +0800

12 May, 2011

1 commit

9cb5baba5 Merge commit 'v2.6.39-rc7' into sched/core Browse Code »

Ingo Molnar
2011-05-12 15:36:18 +0800

11 May, 2011

1 commit

21a43e397 slub: Revert "[PARISC] slub: fix panic with DISCONTIGMEM" ... Browse Code »

This reverts commit 4a5fa3590f09, which did not allow SLUB to be used
on architectures that use DISCONTIGMEM without compiling NUMA support
without CONFIG_BROKEN also set.

The slub panic that it was intended to prevent is addressed by
d9b41e0b54fd ("[PARISC] set memory ranges in N_NORMAL_MEMORY when
onlined") on parisc so there is no further slub issues with such a
configuration.

The reverts allows SLUB now to be used on such architectures since
there haven't been any reports of additional errors.

Cc: James Bottomley
Signed-off-by: David Rientjes
Signed-off-by: Linus Torvalds

David Rientjes
2011-05-11 08:37:34 +0800

06 May, 2011

1 commit

27f4d2805 rcu: priority boosting for TREE_PREEMPT_RCU ... Browse Code »

Add priority boosting for TREE_PREEMPT_RCU, similar to that for
TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST
kernel parameter. The priority to which to boost preempted
RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
(defaulting to real-time priority 1) and the time to wait before
boosting the readers who are blocking a given grace period is
controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
500 milliseconds).

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-05-06 14:16:55 +0800

28 Apr, 2011

1 commit

e8dad6940 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6 ... Browse Code »

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
[PARISC] slub: fix panic with DISCONTIGMEM
[PARISC] set memory ranges in N_NORMAL_MEMORY when onlined

Linus Torvalds
2011-04-28 06:20:33 +0800

27 Apr, 2011

1 commit

6befe5f69 init/Kconfig: fix EXPERT menu list ... Browse Code »

The EXPERT menu list was recently broken by the insertion of a
kconfig symbol (EMBEDDED) at the beginning of the EXPERT list of
kconfig items. Broken by:

commit 6a108a14fa356ef607be308b68337939e56ea94e
Author: David Rientjes
Date: Thu Jan 20 14:44:16 2011 -0800
kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT

Restore the EXPERT menu list -- don't inject a symbol (EMBEDDED)
that does not depend on EXPERT into the list.

Signed-off-by: Randy Dunlap
Cc: David Rientjes
Cc: Peter Foley
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-04-27 11:48:37 +0800

23 Apr, 2011

1 commit

4a5fa3590 [PARISC] slub: fix panic with DISCONTIGMEM ... Browse Code »

Slub makes assumptions about page_to_nid() which are violated by
DISCONTIGMEM and !NUMA. This violation results in a panic because
page_to_nid() can be non-zero for pages in the discontiguous ranges and
this leads to a null return by get_node(). The assertion by the
maintainer is that DISCONTIGMEM should only be allowed when NUMA is also
defined. However, at least six architectures: alpha, ia64, m32r, m68k,
mips, parisc violate this. The panic is a regression against slab, so
just mark slub broken in the problem configuration to prevent users
reporting these panics.

Cc: stable@kernel.org
Acked-by: David Rientjes
Acked-by: Pekka Enberg
Signed-off-by: James Bottomley

James Bottomley
2011-04-23 04:42:46 +0800

15 Apr, 2011

2 commits

1e2795a11 kbuild: move KALLSYMS_EXTRA_PASS from Kconfig to Makefile ... Browse Code »

At the moment we have the CONFIG_KALLSYMS_EXTRA_PASS Kconfig switch,
which users can enable or disable while configuring the kernel. This
option is then used by 'make' to determine whether an extra kallsyms
pass is needed or not.

However, this approach is not nice and confusing, and this patch moves
CONFIG_KALLSYMS_EXTRA_PASS from Kconfig to Makefile instead. The
rationale is below.

1. CONFIG_KALLSYMS_EXTRA_PASS is really about the build time, not
run-time. There is no real need for it to be in Kconfig. It is
just an additional work-around which should be used only in rare
cases, when someone breaks kallsyms, so Kbuild/Makefile is much
better place for this option.
2. Grepping CONFIG_KALLSYMS_EXTRA_PASS shows that many defconfigs have
it enabled, probably not because they try to work-around a kallsyms
bug, but just because the Kconfig help text is confusing and does
not really make it clear that this option should not be used unless
except when kallsyms is broken.
3. And since many people have CONFIG_KALLSYMS_EXTRA_PASS enabled in
their Kconfig, we do might fail to notice kallsyms bugs in time. E.g.,
many testers use "make allyesconfig" to test builds, which will enable
CONFIG_KALLSYMS_EXTRA_PASS and kallsyms breakage will not be noticed.

To address that, this patch:

1. Kills CONFIG_KALLSYMS_EXTRA_PASS
2. Changes Makefile so that people can use "make KALLSYMS_EXTRA_PASS=1"
to enable the extra pass if needed. Additionally, they may define
KALLSYMS_EXTRA_PASS as an environment variable.
3. By default KALLSYMS_EXTRA_PASS is disabled and if kallsyms has issues,
"make" should print a warning and suggest using KALLSYMS_EXTRA_PASS

Signed-off-by: Artem Bityutskiy
[mmarek: Removed make help text, is not necessary]
Signed-off-by: Michal Marek

Artem Bityutskiy
2011-04-15 21:56:15 +0800
71a83ec7d Kconfig: improve KALLSYMS_ALL documentation ... Browse Code »

Dumb users like myself are not able to grasp from the existing KALLSYMS_ALL
documentation that this option is not what they need. Improve the help
message and make it clearer that KALLSYMS is enough in the majority of
use cases, and KALLSYMS_ALL should really be used very rarely.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Michal Marek

Artem Bityutskiy
2011-04-15 21:48:01 +0800

14 Apr, 2011

1 commit

317f39416 sched: Move the second half of ttwu() to the remote cpu ... Browse Code »
1

Now that we've removed the rq->lock requirement from the first part of
ttwu() and can compute placement without holding any rq->lock, ensure
we execute the second half of ttwu() on the actual cpu we want the
task to run on.

This avoids having to take rq->lock and doing the task enqueue
remotely, saving lots on cacheline transfers.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 647278 worker burns per second
patched: run time 30 seconds 816715 worker burns per second

Reviewed-by: Frank Rowand
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl

Peter Zijlstra
2011-04-14 14:52:41 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

24 Mar, 2011

2 commits

59607db36 userns: add a user_namespace as creator/owner of uts_namespace ... Browse Code »

The expected course of development for user namespaces targeted
capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

Goals:

- Make it safe for an unprivileged user to unshare namespaces. They
will be privileged with respect to the new namespace, but this should
only include resources which the unprivileged user already owns.

- Provide separate limits and accounting for userids in different
namespaces.

Status:

Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
CAP_SETGID capabilities. What this gets you is a whole new set of
userids, meaning that user 500 will have a different 'struct user' in
your namespace than in other namespaces. So any accounting information
stored in struct user will be unique to your namespace.

However, throughout the kernel there are checks which

- simply check for a capability. Since root in a child namespace
has all capabilities, this means that a child namespace is not
constrained.

- simply compare uid1 == uid2. Since these are the integer uids,
uid 500 in namespace 1 will be said to be equal to uid 500 in
namespace 2.

As a result, the lxc implementation at lxc.sf.net does not use user
namespaces. This is actually helpful because it leaves us free to
develop user namespaces in such a way that, for some time, user
namespaces may be unuseful.

Bugs aside, this patchset is supposed to not at all affect systems which
are not actively using user namespaces, and only restrict what tasks in
child user namespace can do. They begin to limit privilege to a user
namespace, so that root in a container cannot kill or ptrace tasks in the
parent user namespace, and can only get world access rights to files.
Since all files currently belong to the initila user namespace, that means
that child user namespaces can only get world access rights to *all*
files. While this temporarily makes user namespaces bad for system
containers, it starts to get useful for some sandboxing.

I've run the 'runltplite.sh' with and without this patchset and found no
difference.

This patch:

copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
will have the new user namespace as its owner. That is what we want,
since we want root in that new userns to be able to have privilege over
it.

Changelog:
Feb 15: don't set uts_ns->user_ns if we didn't create
a new uts_ns.
Feb 23: Move extern init_user_ns declaration from
init/version.c to utsname.h.

Signed-off-by: Serge E. Hallyn
Acked-by: "Eric W. Biederman"
Acked-by: Daniel Lezcano
Acked-by: David Howells
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2011-03-24 10:46:59 +0800
45a68628d pid: remove the child_reaper special case in init/main.c ... Browse Code »

This patchset is a cleanup and a preparation to unshare the pid namespace.
These prerequisites prepare for Eric's patchset to give a file descriptor
to a namespace and join an existing namespace.

This patch:

It turns out that the existing assignment in copy_process of the
child_reaper can handle the initial assignment of child_reaper we just
need to generalize the test in kernel/fork.c

Signed-off-by: Eric W. Biederman
Signed-off-by: Daniel Lezcano
Cc: Oleg Nesterov
Cc: Alexey Dobriyan
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2011-03-24 10:46:57 +0800

23 Mar, 2011

6 commits

ea611b269 init: return proper error code in do_mounts_rd() ... Browse Code »

In do_mounts_rd() if memory cannot be allocated, return -ENOMEM.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2011-03-23 08:44:15 +0800
b1b5f65e5 calibrate: retry with wider bounds when converge seems to fail ... Browse Code »

Systems with unmaskable interrupts such as SMIs may massively
underestimate loops_per_jiffy, and fail to converge anywhere near the real
value. A case seen on x86_64 was an initial estimate of 256<<<
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Tested-by: Stephen Boyd
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2011-03-23 08:44:12 +0800
191e56880 calibrate: home in on correct lpj value more quickly ... Browse Code »

Binary chop with a jiffy-resync on each step to find an upper bound is
slow, so just race in a tight-ish loop to find an underestimate.

If done with lots of individual steps, sometimes several hundreds of
iterations would be required, which would impose a significant overhead,
and make the initial estimate very low. By taking slowly increasing steps
there will be less overhead.

E.g. an x86_64 2.67GHz could have fitted in 613 individual small delays,
but in reality should have been able to fit in a single delay 644 times
longer, so underestimated by 31 steps. To reach the equivalent of 644
small delays with the accelerating scheme now requires about 130
iterations, so has
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Tested-by: Stephen Boyd
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2011-03-23 08:44:11 +0800
71c696b1d calibrate: extract fall-back calculation into own helper ... Browse Code »

The motivation for this patch series is that currently our OMAP calibrates
itself using the trial-and-error binary chop fallback that some other
architectures no longer need to perform. This is a lengthy process,
taking 0.2s in an environment where boot time is of great interest.

Patch 2/4 has two optimisations. Firstly, it replaces the initial
repeated- doubling to find the relevant power of 2 with a tight loop that
just does as much as it can in a jiffy. Secondly, it doesn't binary chop
over an entire power of 2 range, it choses a much smaller range based on
how much it squeezed in, and failed to squeeze in, during the first stage.
Both are significant optimisations, and bring our calibration down from
23 jiffies to 5, and, in the process, often arrive at a more accurate lpj
value.

The 'bands' and 'sub-logarithmic' growth may look over-engineered, but
they only cost a small level of inaccuracy in the initial guess (for all
architectures) in order to avoid the very large inaccuracies that appeared
during testing (on x86_64 architectures, and presumably others with less
metronomic operation). Note that due to the existence of the TSC and
other timers, the x86_64 will not typically use this fallback routine, but
I wanted to code defensively, able to cope with all kinds of processor
behaviours and kernel command line options.

Patch 3/4 is an additional trap for the nightmare scenario where the
initial estimate is very inaccurate, possibly due to things like SMIs.
It simply retries with a larger bound.

Stephen said:

I tried this patch set out on an MSM7630.
:
: Before:
:
: Calibrating delay loop... 681.57 BogoMIPS (lpj=3407872)
:
: After:
:
: Calibrating delay loop... 680.75 BogoMIPS (lpj=3403776)
:
: But the really good news is calibration time dropped from ~247ms to ~56ms.
: Sadly we won't be able to benefit from this should my udelay patches make
: it into ARM because we would be using calibrate_delay_direct() instead (at
: least on machines who choose to). Can we somehow reapply the logic behind
: this to calibrate_delay_direct()? That would be even better, but this is
: definitely a boot time improvement.
:
: Or maybe we could just replace calibrate_delay_direct() with this fallback
: calculation? If __delay() is a thin wrapper around read_current_timer()
: it should work just as well (plus patch 3 makes it handle SMIs). I'll try
: that out.

This patch:

... so that it can be modified more clinically.

This is almost entirely cosmetic. The only change to the operation
is that the global variable is only set once after the estimation is
completed, rather than taking on all the intermediate values. However,
there are no readers of that variable, so this change is unimportant.

Signed-off-by: Phil Carmody
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Tested-by: Stephen Boyd
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2011-03-23 08:44:11 +0800
34db18a05 smp: move smp setup functions to kernel/smp.c ... Browse Code »

Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
CONFIG_SMP.

Signed-off-by: WANG Cong
Cc: Rakib Mullick
Cc: David Howells
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Arnd Bergmann
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amerigo Wang
2011-03-23 08:44:11 +0800
80cdc6dae fs: use appropriate printk priority levels ... Browse Code »

printk()s without a priority level default to KERN_WARNING. To reduce
noise at KERN_WARNING, this patch set the priority level appriopriately
for unleveled printks()s. This should be useful to folks that look at
dmesg warnings closely.

Signed-off-by: Mandeep Singh Baines
Cc: Jens Axboe
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2011-03-23 08:44:10 +0800

19 Mar, 2011

1 commit

e16b396ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...

Trivial conflict in fs/eventpoll.c (spelling vs addition)

Linus Torvalds
2011-03-19 01:37:40 +0800

17 Mar, 2011

2 commits

f74b94441 Merge branch 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl ... Browse Code »

* 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
BKL: That's all, folks
fs/locks.c: Remove stale FIXME left over from BKL conversion
ipx: remove the BKL
appletalk: remove the BKL
x25: remove the BKL
ufs: remove the BKL
hpfs: remove the BKL
drivers: remove extraneous includes of smp_lock.h
tracing: don't trace the BKL
adfs: remove the big kernel lock

Linus Torvalds
2011-03-17 08:21:00 +0800
a5e6b135b Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/gregkh/driver-core-2.6

* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (50 commits)
printk: do not mangle valid userspace syslog prefixes
efivars: Add Documentation
efivars: Expose efivars functionality to external drivers.
efivars: Parameterize operations.
efivars: Split out variable registration
efivars: parameterize efivars
efivars: Make efivars bin_attributes dynamic
efivars: move efivars globals into struct efivars
drivers:misc: ti-st: fix debugging code
kref: Fix typo in kref documentation
UIO: add PRUSS UIO driver support
Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches
firmware: Fix unaligned memory accesses in dmi-sysfs
firmware: Add documentation for /sys/firmware/dmi
firmware: Expose DMI type 15 System Event Log
firmware: Break out system_event_log in dmi-sysfs
firmware: Basic dmi-sysfs support
firmware: Add DMI entry types to the headers
Driver core: convert platform_{get,set}_drvdata to static inline functions
Translate linux-2.6/Documentation/magic-number.txt into Chinese
...

Linus Torvalds
2011-03-17 06:05:40 +0800

16 Mar, 2011

1 commit

a926021cb Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (184 commits)
perf probe: Clean up probe_point_lazy_walker() return value
tracing: Fix irqoff selftest expanding max buffer
tracing: Align 4 byte ints together in struct tracer
tracing: Export trace_set_clr_event()
tracing: Explain about unstable clock on resume with ring buffer warning
ftrace/graph: Trace function entry before updating index
ftrace: Add .ref.text as one of the safe areas to trace
tracing: Adjust conditional expression latency formatting.
tracing: Fix event alignment: skb:kfree_skb
tracing: Fix event alignment: mce:mce_record
tracing: Fix event alignment: kvm:kvm_hv_hypercall
tracing: Fix event alignment: module:module_request
tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup
tracing: Remove lock_depth from event entry
perf header: Stop using 'self'
perf session: Use evlist/evsel for managing perf.data attributes
perf top: Don't let events to eat up whole header line
perf top: Fix events overflow in top command
ring-buffer: Remove unused #include <linux/trace_irq.h>
tracing: Add an 'overwrite' trace_option.
...

Linus Torvalds
2011-03-16 09:31:30 +0800

15 Mar, 2011

1 commit

990d6c2d7 vfs: Add name to file handle conversion support ... Browse Code »

The syscall also return mount id which can be used
to lookup file system specific information such as uuid
in /proc//mountinfo

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Al Viro

Aneesh Kumar K.V
2011-03-15 14:21:37 +0800

05 Mar, 2011

1 commit

4ba8216cd BKL: That's all, folks ... Browse Code »

This removes the implementation of the big kernel lock,
at last. A lot of people have worked on this in the
past, I so the credit for this patch should be with
everyone who participated in the hunt.

The names on the Cc list are the people that were the
most active in this, according to the recorded git
history, in alphabetical order.

Signed-off-by: Arnd Bergmann
Acked-by: Alan Cox
Cc: Alessio Igor Bogani
Cc: Al Viro
Cc: Andrew Hendry
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Eric W. Biederman
Cc: Frederic Weisbecker
Cc: Hans Verkuil
Acked-by: Ingo Molnar
Cc: Jan Blunck
Cc: John Kacur
Cc: Jonathan Corbet
Cc: Linus Torvalds
Cc: Matthew Wilcox
Cc: Oliver Neukum
Cc: Paul Menage
Acked-by: Thomas Gleixner
Cc: Trond Myklebust

Arnd Bergmann
2011-03-05 17:56:00 +0800

04 Mar, 2011

1 commit

2d0f25201 perf cgroup: Fix a typo in kernel config ... Browse Code »

s/specificied/specified

Signed-off-by: Li Zefan
Acked-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Li Zefan
2011-03-04 18:32:51 +0800

16 Feb, 2011

1 commit

e5d1367f1 perf: Add cgroup support ... Browse Code »

This kernel patch adds the ability to filter monitoring based on
container groups (cgroups). This is for use in per-cpu mode only.

The cgroup to monitor is passed as a file descriptor in the pid
argument to the syscall. The file descriptor must be opened to
the cgroup name in the cgroup filesystem. For instance, if the
cgroup name is foo and cgroupfs is mounted in /cgroup, then the
file descriptor is opened to /cgroup/foo. Cgroup mode is
activated by passing PERF_FLAG_PID_CGROUP in the flags argument
to the syscall.

For instance to measure in cgroup foo on CPU1 assuming
cgroupfs is mounted under /cgroup:

struct perf_event_attr attr;
int cgroup_fd, fd;

cgroup_fd = open("/cgroup/foo", O_RDONLY);
fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
close(cgroup_fd);

Signed-off-by: Stephane Eranian
[ added perf_cgroup_{exit,attach} ]
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Stephane Eranian
2011-02-16 20:30:48 +0800

15 Feb, 2011

1 commit

0a9d59a24 Merge branch 'master' into for-next Browse Code »

Jiri Kosina
2011-02-15 17:24:31 +0800

11 Feb, 2011

1 commit

70a062286 fix jiffy calculations in calibrate_delay_direct to handle overflow ... Browse Code »

Fixes a hang when booting as dom0 under Xen, when jiffies can be
quite large by the time the kernel init gets this far.

Signed-off-by: Tim Deegan
[jbeulich@novell.com: !time_after() -> time_before_eq() as suggested by Jiri Slaby]
Signed-off-by: Jan Beulich
Cc: Jiri Slaby
Cc: Jeremy Fitzhardinge
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Tim Deegan
2011-02-11 03:00:09 +0800