Eric Lee / smarc-fsl-linux-kernel

24 May, 2012

2 commits

644473e9c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.

Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.

- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.

- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.

- With the user namespaces compiled out the performance is as good or
better than it is today.

- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.

- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).

- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.

- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.

- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.

- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.

My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...

Linus Torvalds
2012-05-24 08:42:39 +0800
56edab315 Merge branches 'perf-urgent-for-linus' and 'perf-core-for-linus' of git://git.ke… ... Browse Code »

…rnel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:

- Leftover AMD PMU driver fix fix from the end of the v3.4
stabilization cycle.

- Late tools/perf/ changes that missed the first round:
* endianness fixes
* event parsing improvements
* libtraceevent fixes factored out from trace-cmd
* perl scripting engine fixes related to libtraceevent,
* testcase improvements
* perf inject / pipe mode fixes
* plus a kernel side fix

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Update event scheduling constraints for AMD family 15h models

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched, perf: Use a single callback into the scheduler"
perf evlist: Show event attribute details
perf tools: Bump default sample freq to 4 kHz
perf buildid-list: Work better with pipe mode
perf tools: Fix piped mode read code
perf inject: Fix broken perf inject -b
perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
perf tools: Add union u64_swap type for swapping u64 data
perf tools: Carry perf_event_attr bitfield throught different endians
perf record: Fix documentation for branch stack sampling
perf target: Add cpu flag to sample_type if target has cpu
perf tools: Always try to build libtraceevent
perf tools: Rename libparsevent to libtraceevent in Makefile
perf script: Rename struct event to struct event_format in perl engine
perf script: Explicitly handle known default print arg type
perf tools: Add hardcoded name term for pmu events
perf tools: Separate 'mem:' event scanner bits
perf tools: Use allocated list for each parsed event
perf tools: Add support for displaying event parser debug info
perf test: Move parse event automated tests to separated object

Linus Torvalds
2012-05-24 03:12:49 +0800

23 May, 2012

4 commits

ab0cce560 Revert "sched, perf: Use a single callback into the scheduler" ... Browse Code »

This reverts commit cb04ff9ac424 ("sched, perf: Use a single
callback into the scheduler").

Before this change was introduced, the process switch worked
like this (wrt. to perf event schedule):

schedule (prev, next)
- schedule out all perf events for prev
- switch to next
- schedule in all perf events for current (next)

After the commit, the process switch looks like:

schedule (prev, next)
- schedule out all perf events for prev
- schedule in all perf events for (next)
- switch to next

The problem is, that after we schedule perf events in, the pmu
is enabled and we can receive events even before we make the
switch to next - so "current" still being prev process (event
SAMPLE data are filled based on the value of the "current"
process).

Thats exactly what we see for test__PERF_RECORD test. We receive
SAMPLES with PID of the process that our tracee is scheduled
from.

Discussed with Peter Zijlstra:

> Bah!, yeah I guess reverting is the right thing for now. Sad
> though.
>
> So by having the two hooks we have a black-spot between them
> where we receive no events at all, this black-spot covers the
> hand-over of current and we thus don't receive the 'wrong'
> events.
>
> I rather liked we could do away with both that black-spot and
> clean up the code a little, but apparently people rely on it.

Signed-off-by: Jiri Olsa
Acked-by: Peter Zijlstra
Cc: acme@redhat.com
Cc: paulus@samba.org
Cc: cjashfor@linux.vnet.ibm.com
Cc: fweisbec@gmail.com
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.com
Signed-off-by: Ingo Molnar

Jiri Olsa
2012-05-23 23:40:51 +0800
d79ee93de Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The biggest change is the cleanup/simplification of the load-balancer:
instead of the current practice of architectures twiddling scheduler
internal data structures and providing the scheduler domains in
colorfully inconsistent ways, we now have generic scheduler code in
kernel/sched/core.c:sched_init_numa() that looks at the architecture's
node_distance() parameters and (while not fully trusting it) deducts a
NUMA topology from it.

This inevitably changes balancing behavior - hopefully for the better.

There are various smaller optimizations, cleanups and fixlets as well"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
sched: Remove stale power aware scheduling remnants and dysfunctional knobs
sched/debug: Fix printing large integers on 32-bit platforms
sched/fair: Improve the ->group_imb logic
sched/nohz: Fix rq->cpu_load[] calculations
sched/numa: Don't scale the imbalance
sched/fair: Revert sched-domain iteration breakage
sched/x86: Rewrite set_cpu_sibling_map()
sched/numa: Fix the new NUMA topology bits
sched/numa: Rewrite the CONFIG_NUMA sched domain support
sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
sched/fair: Add some serialization to the sched_domain load-balance walk
sched/fair: Let minimally loaded cpu balance the group
sched: Change rq->nr_running to unsigned int
x86/numa: Check for nonsensical topologies on real hw as well
x86/numa: Hard partition cpu topology masks on node boundaries
x86/numa: Allow specifying node_distance() for numa=fake
x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
sched: Update documentation and comments
sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()

Linus Torvalds
2012-05-23 09:27:32 +0800
2ff2b289a Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf changes from Ingo Molnar:
"Lots of changes:

- (much) improved assembly annotation support in perf report, with
jump visualization, searching, navigation, visual output
improvements and more.

- kernel support for AMD IBS PMU hardware features. Notably 'perf
record -e cycles:p' and 'perf top -e cycles:p' should work without
skid now, like PEBS does on the Intel side, because it takes
advantage of IBS transparently.

- the libtracevents library: it is the first step towards unifying
tracing tooling and perf, and it also gives a tracing library for
external tools like powertop to rely on.

- infrastructure: various improvements and refactoring of the UI
modules and related code

- infrastructure: cleanup and simplification of the profiling
targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)

- tons of robustness fixes all around

- various ftrace updates: speedups, cleanups, robustness
improvements.

- typing 'make' in tools/ will now give you a menu of projects to
build and a short help text to explain what each does.

- ... and lots of other changes I forgot to list.

The perf record make bzImage + perf report regression you reported
should be fixed."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
tracing: Remove kernel_lock annotations
tracing: Fix initial buffer_size_kb state
ring-buffer: Merge separate resize loops
perf evsel: Create events initially disabled -- again
perf tools: Split term type into value type and term type
perf hists: Fix callchain ip printf format
perf target: Add uses_mmap field
ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
ftrace: Make ftrace_modify_all_code() global for archs to use
ftrace: Return record ip addr for ftrace_location()
ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
ftrace: Speed up search by skipping pages by address
ftrace: Remove extra helper functions
ftrace: Sort all function addresses, not just per page
tracing: change CPU ring buffer state from tracing_cpumask
tracing: Check return value of tracing_dentry_percpu()
ring-buffer: Reset head page before running self test
ring-buffer: Add integrity check at end of iter read
ring-buffer: Make addition of pages in ring buffer atomic
...

Linus Torvalds
2012-05-23 09:18:55 +0800
88d6ae8dc Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"cgroup file type addition / removal is updated so that file types are
added and removed instead of individual files so that dynamic file
type addition / removal can be implemented by cgroup and used by
controllers. blkio controller changes which will come through block
tree are dependent on this. Other changes include res_counter cleanup
and disallowing kthread / PF_THREAD_BOUND threads to be attached to
non-root cgroups.

There's a reported bug with the file type addition / removal handling
which can lead to oops on cgroup umount. The issue is being looked
into. It shouldn't cause problems for most setups and isn't a
security concern."

Fix up trivial conflict in Documentation/feature-removal-schedule.txt

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
res_counter: Account max_usage when calling res_counter_charge_nofail()
res_counter: Merge res_counter_charge and res_counter_charge_nofail
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
cgroup: remove cgroup_subsys->populate()
cgroup: get rid of populate for memcg
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
cgroup: make css->refcnt clearing on cgroup removal optional
cgroup: use negative bias on css->refcnt to block css_tryget()
cgroup: implement cgroup_rm_cftypes()
cgroup: introduce struct cfent
cgroup: relocate __d_cgrp() and __d_cft()
cgroup: remove cgroup_add_file[s]()
cgroup: convert memcg controller to the new cftype interface
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
cgroup: convert all non-memcg controllers to the new cftype interface
cgroup: relocate cftype and cgroup_subsys definitions in controllers
cgroup: merge cft_release_agent cftype array into the base files array
cgroup: implement cgroup_add_cftypes() and friends
cgroup: build list of all cgroups under a given cgroupfs_root
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
...

Linus Torvalds
2012-05-23 08:40:19 +0800

22 May, 2012

2 commits

bf67f3a5c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull smp hotplug cleanups from Thomas Gleixner:
"This series is merily a cleanup of code copied around in arch/* and
not changing any of the real cpu hotplug horrors yet. I wish I'd had
something more substantial for 3.5, but I underestimated the lurking
horror..."

Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
um: Remove leftover declaration of alloc_task_struct_node()
task_allocator: Use config switches instead of magic defines
sparc: Use common threadinfo allocator
score: Use common threadinfo allocator
sh-use-common-threadinfo-allocator
mn10300: Use common threadinfo allocator
powerpc: Use common threadinfo allocator
mips: Use common threadinfo allocator
hexagon: Use common threadinfo allocator
m32r: Use common threadinfo allocator
frv: Use common threadinfo allocator
cris: Use common threadinfo allocator
x86: Use common threadinfo allocator
c6x: Use common threadinfo allocator
fork: Provide kmemcache based thread_info allocator
tile: Use common threadinfo allocator
fork: Provide weak arch_release_[task_struct|thread_info] functions
fork: Move thread info gfp flags to header
fork: Remove the weak insanity
sh: Remove cpu_idle_wait()
...

Linus Torvalds
2012-05-22 10:43:57 +0800
226da0dbc Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU changes from Ingo Molnar:
"This is the v3.5 RCU tree from Paul E. McKenney:

1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature (with
more on the way for 3.6). Posted to LKML:
https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
https://lkml.org/lkml/2012/4/16/611 (commit 4),
https://lkml.org/lkml/2012/4/30/390 (commit 6), and
https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
the other commits for the convenience of the tester).

2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
that have no RCU callbacks. Posted to LKML:
https://lkml.org/lkml/2012/4/23/322.

3) A couple of commits that improve the efficiency of the interaction
between preemptible RCU and the scheduler, these two being all that
survived an abortive attempt to allow preemptible RCU's
__rcu_read_lock() to be inlined. The full set was posted to LKML at
https://lkml.org/lkml/2012/4/14/143, and the first and third patches
of that set remain.

4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
call_srcu() and srcu_barrier(). A major feature of this new
implementation is that synchronize_srcu() no longer disturbs the
execution of other CPUs. This work is based on earlier
implementations by Peter Zijlstra and Paul E. McKenney. Posted to
LKML: https://lkml.org/lkml/2012/2/22/82.

5) A number of miscellaneous bug fixes and improvements which were
posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
subsequent updates posted to LKML."

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rcu: Make rcu_barrier() less disruptive
rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
rcu: Make RCU_FAST_NO_HZ handle timer migration
rcu: Update RCU maintainership
rcu: Make exit_rcu() more precise and consolidate
rcu: Move PREEMPT_RCU preemption to switch_to() invocation
rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
rcu: Add rcutorture test for call_srcu()
rcu: Implement per-domain single-threaded call_srcu() state machine
rcu: Use single value to handle expedited SRCU grace periods
rcu: Improve srcu_readers_active_idx()'s cache locality
rcu: Remove unused srcu_barrier()
rcu: Implement a variant of Peter's SRCU algorithm
rcu: Improve SRCU's wait_idx() comments
rcu: Flip ->completed only once per SRCU grace period
rcu: Increment upper bit only for srcu_read_lock()
rcu: Remove fast check path from __synchronize_srcu()
rcu: Direct algorithmic SRCU implementation
rcu: Introduce rcutorture testing for rcu_barrier()
timer: Fix mod_timer_pinned() header comment
...

Linus Torvalds
2012-05-22 10:26:51 +0800

19 May, 2012

1 commit

16ee6576e Merge remote-tracking branch 'tip/perf/urgent' into perf/core ... Browse Code »

Merge reason: We are going to queue up a dependent patch:

"perf tools: Move parse event automated tests to separated object"

That depends on:

commit e7c72d8
perf tools: Add 'G' and 'H' modifiers to event parsing

Conflicts:
tools/perf/builtin-stat.c

Conflicted with the recent 'perf_target' patches when checking the
result of perf_evsel open routines to see if a retry is needed to cope
with older kernels where the exclude guest/host perf_event_attr bits
were not used.

Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2012-05-19 00:13:33 +0800

18 May, 2012

1 commit

1c2927f18 sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug ... Browse Code »

Usually sleep-in-atomic bugs are followed by dozens other warnings.
This patch should help to figure out original source of problem.

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120510122004.4873.12726.stgit@zurg
Signed-off-by: Ingo Molnar

Konstantin Khlebnikov
2012-05-18 19:07:40 +0800

17 May, 2012

2 commits

8e7fbcbc2 sched: Remove stale power aware scheduling remnants and dysfunctional knobs ... Browse Code »
88

It's been broken forever (i.e. it's not scheduling in a power
aware fashion), as reported by Suresh and others sending
patches, and nobody cares enough to fix it properly ...
so remove it to make space free for something better.

There's various problems with the code as it stands today, first
and foremost the user interface which is bound to topology
levels and has multiple values per level. This results in a
state explosion which the administrator or distro needs to
master and almost nobody does.

Furthermore large configuration state spaces aren't good, it
means the thing doesn't just work right because it's either
under so many impossibe to meet constraints, or even if
there's an achievable state workloads have to be aware of
it precisely and can never meet it for dynamic workloads.

So pushing this kind of decision to user-space was a bad idea
even with a single knob - it's exponentially worse with knobs
on every node of the topology.

There is a proposal to replace the user interface with a single
3 state knob:

sched_balance_policy := { performance, power, auto }

where 'auto' would be the preferred default which looks at things
like Battery/AC mode and possible cpufreq state or whatever the hw
exposes to show us power use expectations - but there's been no
progress on it in the past many months.

Aside from that, the actual implementation of the various knobs
is known to be broken. There have been sporadic attempts at
fixing things but these always stop short of reaching a mergable
state.

Therefore this wholesale removal with the hopes of spurring
people who care to come forward once again and work on a
coherent replacement.

Signed-off-by: Peter Zijlstra
Cc: Suresh Siddha
Cc: Arjan van de Ven
Cc: Vincent Guittot
Cc: Vaidyanathan Srinivasan
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-17 19:48:56 +0800
fac536f7e Merge branch 'sched/urgent' into sched/core ... Browse Code »

Merge reason: bring together all the pending scheduler bits,
for the sched/numa changes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2012-05-17 18:17:12 +0800

14 May, 2012

7 commits

13e099d2f sched/debug: Fix printing large integers on 32-bit platforms ... Browse Code »

Some numbers like nr_running and nr_uninterruptible are fundamentally
unsigned since its impossible to have a negative amount of tasks, yet
we still print them as signed to easily recognise the underflow
condition.

rq->nr_uninterruptible has 'special' accounting and can in fact very
easily become negative on a per-cpu basis.

It was noted that since the P() macro assumes things are long long and
the promotion of unsigned 'int/long' to long long on 32bit doesn't
sign extend we print silly large numbers instead of the easier to read
signed numbers.

Therefore extend the P() macro to not require the sign extention.

Reported-by: Diwakar Tundlam
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-gk5tm8t2n4ix2vkpns42uqqp@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-14 21:05:28 +0800
e44bc5c5d sched/fair: Improve the ->group_imb logic ... Browse Code »

Group imbalance is meant to deal with situations where affinity masks
and sched domains don't align well, such as 3 cpus from one group and
6 from another. In this case the domain based balancer will want to
put an equal amount of tasks on each side even though they don't have
equal cpus.

Currently group_imb is set whenever two cpus of a group have a weight
difference of at least one avg task and the heaviest cpu has at least
two tasks. A group with imbalance set will always be picked as busiest
and a balance pass will be forced.

The problem is that even if there are no affinity masks this stuff can
trigger and cause weird balancing decisions, eg. the observed
behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
difference of 1 avg load (they all had the same weight) and nr_running
being >1 the group_imbalance logic triggered and did the weird thing
of pulling more load instead of trying to move the 1 excess task to
the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
1 task.

Curb the group_imbalance stuff by making the nr_running condition
weaker by also tracking the min_nr_running and using the difference in
nr_running over the set instead of the absolute max nr_running.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-14 21:05:28 +0800
556061b00 sched/nohz: Fix rq->cpu_load[] calculations ... Browse Code »
44

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Signed-off-by: Peter Zijlstra
Cc: pjt@google.com
Cc: Venkatesh Pallipadi
Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-14 21:05:27 +0800
870a0bb5d sched/numa: Don't scale the imbalance ... Browse Code »

It's far too easy to get ridiculously large imbalance pct when you
scale it like that. Use a fixed 125% for now.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-zsriaft1dv7hhboyrpvqjy6s@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-14 21:05:26 +0800
04f733b4a sched/fair: Revert sched-domain iteration breakage ... Browse Code »

Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the
group") and 0ce90475 ("sched/fair: Add some serialization to the
sched_domain load-balance walk") are horribly broken so revert them.

The problem is that while it sounds good to have the minimally loaded
cpu do the pulling of more load, the way we walk the domains there is
absolutely no guarantee this cpu will actually get to the domain. In
fact its very likely it wont. Therefore the higher up the tree we get,
the less likely it is we'll balance at all.

The first of mask always walks up, while sucky in that it accumulates
load on the first cpu and needs extra passes to spread it out at least
guarantees a cpu gets up that far and load-balancing happens at all.

Since its now always the first and idle cpus should always be able to
balance so they get a task as fast as possible we can also do away
with the added serialization.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-14 21:05:26 +0800
dd7d8634e sched/numa: Fix the new NUMA topology bits ... Browse Code »

There's no need to convert a node number to a node number by
pretending its a cpu number..

Reported-by: Yinghai Lu
Reported-and-Tested-by: Greg Pearson
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-0sqhrht34phowgclj12dgk8h@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-14 21:05:25 +0800
2d84e023c Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck… ... Browse Code »

…/linux-rcu into core/rcu

Pull the v3.5 RCU tree from Paul E. McKenney:

1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature
(with more on the way for 3.6). Posted to LKML:
https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
https://lkml.org/lkml/2012/4/16/611 (commit 4),
https://lkml.org/lkml/2012/4/30/390 (commit 6), and
https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
the other commits for the convenience of the tester).

2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
that have no RCU callbacks. Posted to LKML:
https://lkml.org/lkml/2012/4/23/322.

3) A couple of commits that improve the efficiency of the interaction
between preemptible RCU and the scheduler, these two being all
that survived an abortive attempt to allow preemptible RCU's
__rcu_read_lock() to be inlined. The full set was posted to
LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
third patches of that set remain.

4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
call_srcu() and srcu_barrier(). A major feature of this new
implementation is that synchronize_srcu() no longer disturbs
the execution of other CPUs. This work is based on earlier
implementations by Peter Zijlstra and Paul E. McKenney. Posted to
LKML: https://lkml.org/lkml/2012/2/22/82.

5) A number of miscellaneous bug fixes and improvements which were
posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
subsequent updates posted to LKML.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2012-05-14 14:41:46 +0800

09 May, 2012

7 commits

cb04ff9ac sched, perf: Use a single callback into the scheduler ... Browse Code »
44

We can easily use a single callback for both sched-in and sched-out. This
reduces the code footprint in the scheduler path as well as removes
the PMU black spot otherwise present between the out and in callback.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-09 21:23:17 +0800
cb83b629b sched/numa: Rewrite the CONFIG_NUMA sched domain support ... Browse Code »
44

The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.

Therefore, build a proper numa topology based on node_distance().

Since there's no fixed numa layers anymore, the static SD_NODE_INIT
and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
construct something similar and scales some values either on the
number of cpus in the domain and/or the node_distance() ratio.

Signed-off-by: Peter Zijlstra
Cc: Anton Blanchard
Cc: Benjamin Herrenschmidt
Cc: Chris Metcalf
Cc: David Howells
Cc: "David S. Miller"
Cc: Fenghua Yu
Cc: "H. Peter Anvin"
Cc: Ivan Kokshaysky
Cc: linux-alpha@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: Matt Turner
Cc: Paul Mackerras
Cc: Paul Mundt
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: sparclinux@vger.kernel.org
Cc: Tony Luck
Cc: x86@kernel.org
Cc: Dimitri Sivanich
Cc: Greg Pearson
Cc: KAMEZAWA Hiroyuki
Cc: bob.picco@oracle.com
Cc: chris.mason@oracle.com
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-09 21:00:55 +0800
bd939f45d sched/fair: Propagate 'struct lb_env' usage into find_busiest_group ... Browse Code »

More function argument passing reduction.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-09 21:00:54 +0800
0ce90475d sched/fair: Add some serialization to the sched_domain load-balance walk ... Browse Code »
44

Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
it is possible that multiple cpus in the group get elected to do the
next level. Avoid this by adding some serialization.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-09 21:00:53 +0800
c22402a2f sched/fair: Let minimally loaded cpu balance the group ... Browse Code »
44

Currently we let the leftmost (or first idle) cpu ascend the
sched_domain tree and perform load-balancing. The result is that the
busiest cpu in the group might be performing this function and pull
more load to itself. The next load balance pass will then try to
equalize this again.

Change this to pick the least loaded cpu to perform higher domain
balancing.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-09 21:00:51 +0800
c82513e51 sched: Change rq->nr_running to unsigned int ... Browse Code »

Since there's a PID space limit of 30bits (see
futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
lower bound of 2 pages per task) would still take 8T of memory it
seems reasonable to say that unsigned int is sufficient for
rq->nr_running.

When we do get anywhere near that amount of tasks I suspect other
things would go funny, load-balancer load computations would really
need to be hoisted to 128bit etc.

So save a few bytes and convert rq->nr_running and friends to
unsigned int.

Suggested-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-05-09 21:00:49 +0800
30b4e9eb7 sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption ... Browse Code »

If we have one cpu that failed to boot and boot cpu gave up on
waiting for it and then another cpu is being booted, kernel
might crash with following OOPS:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [] __bitmap_weight+0x30/0x80
Call Trace:
[] build_sched_domains+0x7b6/0xa50

The crash happens in init_sched_groups_power() that expects
sched_groups to be circular linked list. However it is not
always true, since sched_groups preallocated in __sdt_alloc are
initialized in build_sched_groups and it may exit early

if (cpu != cpumask_first(sched_domain_span(sd)))
return 0;

without initializing sd->groups->next field.

Fix bug by initializing next field right after sched_group was
allocated.

Also-Reported-by: Jiang Liu
Signed-off-by: Igor Mammedov
Cc: a.p.zijlstra@chello.nl
Cc: pjt@google.com
Cc: seto.hidetoshi@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar

Igor Mammedov
2012-05-09 18:27:35 +0800

08 May, 2012

1 commit

67ba5293f Merge branch 'smp/threadalloc' into smp/hotplug ... Browse Code »

Reason: Pull in the separate branch which was created so arch/tile can
base further work on it.

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2012-05-08 20:07:48 +0800

07 May, 2012

2 commits

489a71b02 sched: Update documentation and comments ... Browse Code »

Change sched_*.c to sched/*.c in documentation and comments.

Signed-off-by: Hiroshi Shimamoto
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/4F795CAC.9080206@ct.jp.nec.com
Signed-off-by: Ingo Molnar

Hiroshi Shimamoto
2012-05-07 21:04:18 +0800
436281c9a Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: We were on a pretty old base, refresh before moving on.

Signed-off-by: Ingo Molnar

Ingo Molnar
2012-05-07 21:03:42 +0800

05 May, 2012

1 commit

a4a2eb490 init_task: Create generic init_task instance ... Browse Code »

All archs define init_task in the same way (except ia64, but there is
no particular reason why ia64 cannot use the common version). Create a
generic instance so all archs can be converted over.

The config switch is temporary and will be removed when all archs are
converted over.

Signed-off-by: Thomas Gleixner
Cc: Benjamin Herrenschmidt
Cc: Chen Liqin
Cc: Chris Metcalf
Cc: Chris Zankel
Cc: David Howells
Cc: David S. Miller
Cc: Geert Uytterhoeven
Cc: Guan Xuetao
Cc: Haavard Skinnemoen
Cc: Hirokazu Takata
Cc: James E.J. Bottomley
Cc: Jesper Nilsson
Cc: Jonas Bonn
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Matt Turner
Cc: Michal Simek
Cc: Mike Frysinger
Cc: Paul Mundt
Cc: Ralf Baechle
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Russell King
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de

Thomas Gleixner
2012-05-05 19:00:21 +0800

03 May, 2012

2 commits

9c806aa06 userns: Convert sched_set_affinity and sched_set_scheduler's permission checks ... Browse Code »

- Compare kuids with uid_eq
- kuid are uniuqe across all user namespaces so there is no longer the
need for a user_namespace comparison.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-03 18:28:39 +0800
616c310e8 rcu: Move PREEMPT_RCU preemption to switch_to() invocation ... Browse Code »
44

Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
This is inefficient because enqueuing is required only if there is a
context switch, and entry to the scheduler does not guarantee a context
switch.

The commit therefore moves the enqueuing to immediately precede the
call to switch_to() from the scheduler.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Tested-by: Linus Torvalds

Paul E. McKenney
2012-05-03 05:43:23 +0800

26 Apr, 2012

3 commits

fb2cf2c66 sched: Fix OOPS when build_sched_domains() percpu allocation fails ... Browse Code »

Under extreme memory used up situations, percpu allocation
might fail. We hit it when system goes to suspend-to-ram,
causing a kworker panic:

EIP: [] build_sched_domains+0x23a/0xad0
Kernel panic - not syncing: Fatal exception
Pid: 3026, comm: kworker/u:3
3.0.8-137473-gf42fbef #1

Call Trace:
[] panic+0x66/0x16c
[...]
[] partition_sched_domains+0x287/0x4b0
[] cpuset_update_active_cpus+0x1fe/0x210
[] cpuset_cpu_inactive+0x1d/0x30
[...]

With this fix applied build_sched_domains() will return -ENOMEM and
the suspend attempt fails.

Signed-off-by: he, bo
Reviewed-by: Zhang, Yanmin
Reviewed-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc:
Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo
[ So, we fail to deallocate a CPU because we cannot allocate RAM :-/
I don't like that kind of sad behavior but nevertheless it should
not crash under high memory load. ]
Signed-off-by: Ingo Molnar

he, bo
2012-04-26 18:54:53 +0800
eb95308ee sched: Fix more load-balancing fallout ... Browse Code »

Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.

By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.

Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/

Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.

Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.

Reported-by: Tim Chen
Acked-by: Tim Chen
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
[ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-04-26 18:54:52 +0800
29d5e0476 smp: Provide generic idle thread allocation ... Browse Code »

All SMP architectures have magic to fork the idle task and to store it
for reusage when cpu hotplug is enabled. Provide a generic
infrastructure for it.

Create/reinit the idle thread for the cpu which is brought up in the
generic code and hand the thread pointer to the architecture code via
__cpu_up().

Note, that fork_idle() is called via a workqueue, because this
guarantees that the idle thread does not get a reference to a user
space VM. This can happen when the boot process did not bring up all
possible cpus and a later cpu_up() is initiated via the sysfs
interface. In that case fork_idle() would be called in the context of
the user space task and take a reference on the user space VM.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Rusty Russell
Cc: Paul E. McKenney
Cc: Srivatsa S. Bhat
Cc: Matt Turner
Cc: Russell King
Cc: Mike Frysinger
Cc: Jesper Nilsson
Cc: Richard Kuo
Cc: Tony Luck
Cc: Hirokazu Takata
Cc: Ralf Baechle
Cc: David Howells
Cc: James E.J. Bottomley
Cc: Benjamin Herrenschmidt
Cc: Martin Schwidefsky
Cc: Paul Mundt
Cc: David S. Miller
Cc: Chris Metcalf
Cc: Richard Weinberger
Cc: x86@kernel.org
Acked-by: Venkatesh Pallipadi
Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de

Thomas Gleixner
2012-04-26 18:06:09 +0800

14 Apr, 2012

1 commit

4cbb62148 Merge branch 'tip/sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/r… ... Browse Code »

…ostedt/linux-trace into sched/core

Pull a scheduler optimization commit from Steven Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2012-04-14 21:12:04 +0800

13 Apr, 2012

1 commit

8d3d5ada5 sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt() ... Browse Code »

Migration status depends on a difference of weight from 0 and 1.
If weight > 1 ( 1) then task becomes
pushable (or not pushable). We are not insterested in its exact
values, is it 3 or 4, for example.
Now if we are changing affinity from a set of 3 cpus to a set of 4, the-
task will be dequeued and enqueued sequentially without important
difference in comparison with initial state. The only difference is in
internal representation of plist queue of pushable tasks and the fact
that the task may won't be the first in a sequence of the same priority
tasks. But it seems to me it gives nothing.

Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru

Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Tkhai Kirill
Signed-off-by: Steven Rostedt

Kirill Tkhai
2012-04-13 04:59:37 +0800

08 Apr, 2012

1 commit

c4a4d6037 userns: Use cred->user_ns instead of cred->user->user_ns ... Browse Code »

Optimize performance and prepare for the removal of the user_ns reference
from user_struct. Remove the slow long walk through cred->user->user_ns and
instead go straight to cred->user_ns.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-04-08 07:55:51 +0800

02 Apr, 2012

1 commit

4baf6e332 cgroup: convert all non-memcg controllers to the new cftype interface ... Browse Code »
44

Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation. There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Paul Menage
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "David S. Miller"
Cc: Vivek Goyal

Tejun Heo
2012-04-02 03:09:55 +0800

01 Apr, 2012

1 commit

f22e08a79 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()
sched: Fix __schedule_bug() output when called from an interrupt
sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback

Linus Torvalds
2012-04-01 04:35:31 +0800