Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

08 Apr, 2014

5 commits

52f5684c8 kernel: use macros from compiler.h instead of __attribute__((...)) ... Browse Code »

To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.

Signed-off-by: Gideon Israel Dsouza
Cc: "Rafael J. Wysocki"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gideon Israel Dsouza
2014-04-08 07:36:11 +0800
f0432d159 mm, mempolicy: remove per-process flag ... Browse Code »

PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
There's no significant performance degradation to checking
current->mempolicy rather than current->flags & PF_MEMPOLICY in the
allocation path, especially since this is considered unlikely().

Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
64GB of memory and without a mempolicy:

threads before after
16 1249409 1244487
32 1281786 1246783
48 1239175 1239138
64 1244642 1241841
80 1244346 1248918
96 1266436 1254316
112 1307398 1312135
128 1327607 1326502

Per-process flags are a scarce resource so we should free them up whenever
possible and make them available. We'll be using it shortly for memcg oom
reserves.

Signed-off-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Jianguo Wu
Cc: Tim Hockin
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-08 07:35:54 +0800
514ddb446 fork: collapse copy_flags into copy_process ... Browse Code »

copy_flags() does not use the clone_flags formal and can be collapsed
into copy_process() for cleaner code.

Signed-off-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Jianguo Wu
Cc: Tim Hockin
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-08 07:35:54 +0800
615d6e875 mm: per-thread vma caching ... Browse Code »
7

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-08 07:35:53 +0800
a0715cc22 mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE ... Browse Code »

Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
also adds a prctl control which allows us to set the THP disable bit in
mm->def_flags so that VMs will pick up the setting as they are created.

Signed-off-by: Alex Thorlton
Suggested-by: Oleg Nesterov
Cc: Gerald Schaefer
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Christian Borntraeger
Cc: Paolo Bonzini
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Acked-by: Rik van Riel
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Alexander Viro
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Thorlton
2014-04-08 07:35:52 +0800

04 Apr, 2014

1 commit

32d01dc7b Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:

- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.

After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.

- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.

- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.

- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...

Linus Torvalds
2014-04-04 04:05:42 +0800

29 Mar, 2014

1 commit

e8604cb43 cgroup: fix spurious lockdep warning in cgroup_exit() ... Browse Code »
10

cgroup_exit() is called in fork and exit path. If it's called in the
failure path during fork, PF_EXITING isn't set, and then lockdep will
complain.

Fix this by removing cgroup_exit() in that failure path. cgroup_fork()
does nothing that needs cleanup.

Reported-by: Sasha Levin
Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-03-29 21:15:53 +0800

11 Mar, 2014

1 commit

156654f49 sched/numa: Move task_numa_free() to __put_task_struct() ... Browse Code »

Bad idea on -rt:

[ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
[ 908.026145] [] task_numa_free+0x31/0x130
[ 908.026151] [] finish_task_switch+0xce/0x100
[ 908.026156] [] thread_return+0x48/0x4ae
[ 908.026160] [] schedule+0x25/0xa0
[ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
[ 908.026170] [] get_signal_to_deliver+0xaf/0x680
[ 908.026175] [] do_signal+0x3d/0x5b0
[ 908.026179] [] do_notify_resume+0x90/0xe0
[ 908.026186] [] int_signal+0x12/0x17
[ 908.026193] [] 0x7ff2a388b1cf

and since upstream does not mind where we do this, be a bit nicer ...

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc: Mel Gorman
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
Signed-off-by: Ingo Molnar

Mike Galbraith
2014-03-11 19:05:43 +0800

24 Jan, 2014

4 commits

98611e4e6 exec: kill task_struct->did_exec ... Browse Code »

We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually
exclusive. The patch kills ->did_exec because it has a single user.

Signed-off-by: Oleg Nesterov
Acked-by: KOSAKI Motohiro
Cc: Al Viro
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:02 +0800
68ce670b6 kernel/fork.c: remove redundant NULL check in dup_mm() ... Browse Code »

current->mm doesn't need a NULL check in dup_mm(). Becasue dup_mm() is
used only in copy_mm() and current->mm is checked whether it is NULL or
not in copy_mm() before calling dup_mm().

Signed-off-by: Daeseok Youn
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daeseok Youn
2014-01-24 08:37:02 +0800
5d59e1827 kernel/fork.c: fix coding style issues ... Browse Code »

Fix errors reported by checkpatch.pl. One error is parentheses, the other
is a whitespace issue.

Signed-off-by: Daeseok Youn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daeseok Youn
2014-01-24 08:37:02 +0800
ff252c1fc kernel/fork.c: make dup_mm() static ... Browse Code »

dup_mm() is used only in kernel/fork.c

Signed-off-by: Daeseok Youn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

DaeSeok Youn
2014-01-24 08:37:02 +0800

22 Jan, 2014

1 commit

0c740d0af introduce for_each_thread() to replace the buggy while_each_thread() ... Browse Code »
26

while_each_thread() and next_thread() should die, almost every lockless
usage is wrong.

1. Unless g == current, the lockless while_each_thread() is not safe.

while_each_thread(g, t) can loop forever if g exits, next_thread()
can't reach the unhashed thread in this case. Note that this can
happen even if g is the group leader, it can exec.

2. Even if while_each_thread() itself was correct, people often use
it wrongly.

It was never safe to just take rcu_read_lock() and loop unless
you verify that pid_alive(g) == T, even the first next_thread()
can point to the already freed/reused memory.

This patch adds signal_struct->thread_head and task->thread_node to
create the normal rcu-safe list with the stable head. The new
for_each_thread(g, t) helper is always safe under rcu_read_lock() as
long as this task_struct can't go away.

Note: of course it is ugly to have both task_struct->thread_node and the
old task_struct->thread_group, we will kill it later, after we change
the users of while_each_thread() to use for_each_thread().

Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node. But
we can't do this right now because this will lead to subtle behavioural
changes. For example, do/while_each_thread() always sees at least one
task, while for_each_thread() can do nothing if the whole thread group
has died. Or thread_group_empty(), currently its semantics is not clear
unless thread_group_leader(p) and we need to audit the callers before we
can change it.

So this patch adds the new interface which has to coexist with the old
one for some time, hopefully the next changes will be more or less
straightforward and the old one will go away soon.

Signed-off-by: Oleg Nesterov
Reviewed-by: Sergey Dyasly
Tested-by: Sergey Dyasly
Reviewed-by: Sameer Nanda
Acked-by: David Rientjes
Cc: "Eric W. Biederman"
Cc: Frederic Weisbecker
Cc: Mandeep Singh Baines
Cc: "Ma, Xindong"
Cc: Michal Hocko
Cc: "Tu, Xiaobing"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-22 08:19:46 +0800

21 Jan, 2014

1 commit

a0fa1dd3c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:

- Add the initial implementation of SCHED_DEADLINE support: a real-time
scheduling policy where tasks that meet their deadlines and
periodically execute their instances in less than their runtime quota
see real-time scheduling and won't miss any of their deadlines.
Tasks that go over their quota get delayed (Available to privileged
users for now)

- Clean up and fix preempt_enable_no_resched() abuse all around the
tree

- Do sched_clock() performance optimizations on x86 and elsewhere

- Fix and improve auto-NUMA balancing

- Fix and clean up the idle loop

- Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
sched: Fix __sched_setscheduler() nice test
sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
sched: Fix up attr::sched_priority warning
sched: Fix up scheduler syscall LTP fails
sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
sched/core: Fix htmldocs warnings
sched/deadline: No need to check p if dl_se is valid
sched/deadline: Remove unused variables
sched/deadline: Fix sparse static warnings
m68k: Fix build warning in mac_via.h
sched, thermal: Clean up preempt_enable_no_resched() abuse
sched, net: Fixup busy_loop_us_clock()
sched, net: Clean up preempt_enable_no_resched() abuse
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
sched/preempt, locking: Rework local_bh_{dis,en}able()
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
sched/clock: Fix up clear_sched_clock_stable()
sched/clock, x86: Use a static_key for sched_clock_stable
sched/clock: Remove local_irq_disable() from the clocks
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
...

Linus Torvalds
2014-01-21 02:42:08 +0800

18 Jan, 2014

1 commit

48ba620aa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace fixes from Eric Biederman:
"This is a set of 3 regression fixes.

This fixes /proc/mounts when using "ip netns add " to display
the actual mount point.

This fixes a regression in clone that broke lxc-attach.

This fixes a regression in the permission checks for mounting /proc
that made proc unmountable if binfmt_misc was in use. Oops.

My apologies for sending this pull request so late. Al Viro gave
interesting review comments about the d_path fix that I wanted to
address in detail before I sent this pull request. Unfortunately a
bad round of colds kept from addressing that in detail until today.
The executive summary of the review was:

Al: Is patching d_path really sufficient?
The prepend_path, d_path, d_absolute_path, and __d_path family of
functions is a really mess.

Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
No it is not appropriate to rewrite all of d_path for a regression
that has existed for entirely too long already, when a two line
change will do"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Fix a regression in mounting proc
fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
vfs: In d_path don't call d_dname on a mount point

Linus Torvalds
2014-01-18 09:29:36 +0800

13 Jan, 2014

4 commits

2d3d891d3 sched/deadline: Add SCHED_DEADLINE inheritance logic ... Browse Code »

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:

- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;

- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Signed-off-by: Dario Faggioli
Signed-off-by: Juri Lelli
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Dario Faggioli
2014-01-13 20:42:56 +0800
fb00aca47 rtmutex: Turn the plist into an rb-tree ... Browse Code »
5

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra
Signed-off-by: Dario Faggioli
Signed-off-by: Juri Lelli
Signed-off-again-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-01-13 20:41:50 +0800
aab03e05e sched/deadline: Add SCHED_DEADLINE structures & implementation ... Browse Code »

Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.

Signed-off-by: Dario Faggioli
Signed-off-by: Michael Trimarchi
Signed-off-by: Fabio Checconi
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Dario Faggioli
2014-01-13 20:41:06 +0800
56b481103 Merge branch 'sched/urgent' into sched/core ... Browse Code »

Pick up the latest fixes before applying new changes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-01-13 20:35:28 +0800

19 Dec, 2013

1 commit

208414059 mm: fix TLB flush race between migration, and change_protection_range ... Browse Code »

There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.

This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).

The basic race looks like this:

CPU A CPU B CPU C

load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel
Signed-off-by: Mel Gorman
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2013-12-19 11:04:51 +0800

27 Nov, 2013

2 commits

bb8cbbfee tasks/fork: Remove unnecessary child->exit_state ... Browse Code »

A zombie task obviously can't fork(), remove the unnecessary
initialization of child->exit_state. It is zero anyway after
dup_task_struct().

Note: copy_process() is huge and it has a lot of chaotic
initializations, probably it makes sense to move them into the
new helper called by dup_task_struct().

Signed-off-by: Oleg Nesterov
Cc: David Laight
Cc: Geert Uytterhoeven
Cc: Tejun Heo
Cc: Andrew Morton
Cc: Linus Torvalds
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20131113143612.GA10540@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2013-11-27 20:50:50 +0800
1f7f4dde5 fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) ... Browse Code »

Serge Hallyn writes:
> Hi Oleg,
>
> commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
> "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
> breaks lxc-attach in 3.12. That code forks a child which does
> setns() and then does a clone(CLONE_PARENT). That way the
> grandchild can be in the right namespaces (which the child was
> not) and be a child of the original task, which is the monitor.
>
> lxc-attach in 3.11 was working fine with no side effects that I
> could see. Is there a real danger in allowing CLONE_PARENT
> when current->nsproxy->pidns_for_children is not our pidns,
> or was this done out of an "over-abundance of caution"? Can we
> safely revert that new extra check?

The two fundamental things I know we can not allow are:
- A shared signal queue aka CLONE_THREAD. Because we compute the pid
and uid of the signal when we place it in the queue.

- Changing the pid and by extention pid_namespace of an existing
process.

From a parents perspective there is nothing special about the pid
namespace, to deny CLONE_PARENT, because the parent simply won't know or
care.

From the childs perspective all that is special really are shared signal
queues.

User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
in different pid namespaces is almost certainly going to break because
it is complicated. But shared signal handlers can look at per thread
information to know which pid namespace a process is in, so I don't know
of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
at the kernel level. It would be absolutely stupid to implement but
that is a different thing.

So hmm.

Because it can do no harm, and because it is a regression let's remove
the CLONE_PARENT check and send it stable.

Cc: stable@vger.kernel.org
Acked-by: Oleg Nesterov
Acked-by: Andy Lutomirski
Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-11-27 12:54:15 +0800

15 Nov, 2013

2 commits

e009bb30c mm: implement split page table lock for PMD level ... Browse Code »

The basic idea is the same as with PTE level: the lock is embedded into
struct page of table's page.

We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
take mm->page_table_lock anymore. Let's reuse page->lru of table's page
for that.

pgtable_pmd_page_ctor() returns true, if initialization is successful
and false otherwise. Current implementation never fails, but assumption
that constructor can fail will help to port it to -rt where spinlock_t
is rather huge and cannot be embedded into struct page -- dynamic
allocation is required.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Reviewed-by: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:15 +0800
e1f56c89b mm: convert mm->nr_ptes to atomic_long_t ... Browse Code »

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: Naoya Horiguchi
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800

12 Nov, 2013

1 commit

39cf275a1 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle are:

- (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
van Riel, Peter Zijlstra et al. Yay!

- optimize preemption counter handling: merge the NEED_RESCHED flag
into the preempt_count variable, by Peter Zijlstra.

- wait.h fixes and code reorganization from Peter Zijlstra

- cfs_bandwidth fixes from Ben Segall

- SMP load-balancer cleanups from Peter Zijstra

- idle balancer improvements from Jason Low

- other fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
stop_machine: Fix race between stop_two_cpus() and stop_cpus()
sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
sched: Fix asymmetric scheduling for POWER7
sched: Move completion code from core.c to completion.c
sched: Move wait code from core.c to wait.c
sched: Move wait.c into kernel/sched/
sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
sched: Avoid throttle_cfs_rq() racing with period_timer stopping
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
sched: Remove extra put_online_cpus() inside sched_setaffinity()
sched/rt: Fix task_tick_rt() comment
sched/wait: Fix build breakage
sched/wait: Introduce prepare_to_wait_event()
sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
sched: Remove get_online_cpus() usage
sched: Fix race in migrate_swap_stop()
...

Linus Torvalds
2013-11-12 09:20:12 +0800

30 Oct, 2013

2 commits

3ab679661 uprobes: Teach uprobe_copy_process() to handle CLONE_VFORK ... Browse Code »

uprobe_copy_process() does nothing if the child shares ->mm with
the forking process, but there is a special case: CLONE_VFORK.
In this case it would be more correct to do dup_utask() but avoid
dup_xol(). This is not that important, the child should not unwind
its stack too much, this can corrupt the parent's stack, but at
least we need this to allow to ret-probe __vfork() itself.

Note: in theory, it would be better to check task_pt_regs(p)->sp
instead of CLONE_VFORK, we need to dup_utask() if and only if the
child can return from the function called by the parent. But this
needs the arch-dependant helper, and I think that nobody actually
does clone(same_stack, CLONE_VM).

Reported-by: Martin Cermak
Reported-by: David Smith
Signed-off-by: Oleg Nesterov

Oleg Nesterov
2013-10-30 01:02:55 +0800
b68e07491 uprobes: Change the callsite of uprobe_copy_process() ... Browse Code »

Preparation for the next patches.

Move the callsite of uprobe_copy_process() in copy_process() down
to the succesfull return. We do not care if copy_process() fails,
uprobe_free_utask() won't be called in this case so the wrong
->utask != NULL doesn't matter.

OTOH, with this change we know that copy_process() can't fail when
uprobe_copy_process() is called, the new task should either return
to user-mode or call do_exit(). This way uprobe_copy_process() can:

1. setup p->utask != NULL if necessary

2. setup uprobes_state.xol_area

3. use task_work_add(p)

Also, move the definition of uprobe_copy_process() down so that it
can see get_utask().

Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju

Oleg Nesterov
2013-10-30 01:02:48 +0800

09 Oct, 2013

2 commits

5e1576ed0 sched/numa: Stay on the same node if CLONE_VM ... Browse Code »

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel
Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar

Rik van Riel
2013-10-09 20:47:57 +0800
b726b7dfb Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" ... Browse Code »

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar

Mel Gorman
2013-10-09 18:40:17 +0800

14 Sep, 2013

1 commit

9bf12df31 Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio changes from Ben LaHaise:
"First off, sorry for this pull request being late in the merge window.
Al had raised a couple of concerns about 2 items in the series below.
I addressed the first issue (the race introduced by Gu's use of
mm_populate()), but he has not provided any further details on how he
wants to rework the anon_inode.c changes (which were sent out months
ago but have yet to be commented on).

The bulk of the changes have been sitting in the -next tree for a few
months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
aio: rcu_read_lock protection for new rcu_dereference calls
aio: fix race in ring buffer page lookup introduced by page migration support
aio: fix rcu sparse warnings introduced by ioctx table lookup patch
aio: remove unnecessary debugging from aio_free_ring()
aio: table lookup: verify ctx pointer
staging/lustre: kiocb->ki_left is removed
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
aio: convert the ioctx list to table lookup v3
aio: double aio_max_nr in calculations
aio: Kill ki_dtor
aio: Kill ki_users
aio: Kill unneeded kiocb members
aio: Kill aio_rw_vect_retry()
aio: Don't use ctx->tail unnecessarily
aio: io_cancel() no longer returns the io_event
aio: percpu ioctx refcount
aio: percpu reqs_available
aio: reqs_active -> reqs_available
aio: fix build when migration is disabled
...

Linus Torvalds
2013-09-14 01:55:58 +0800

12 Sep, 2013

4 commits

ef0855d33 mm: mempolicy: turn vma_set_policy() into vma_dup_policy() ... Browse Code »
16

Simple cleanup. Every user of vma_set_policy() does the same work, this
looks a bit annoying imho. And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.

Signed-off-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:57:00 +0800
40a0d32d1 fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks ... Browse Code »

do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

Then later copy_process() denies CLONE_SIGHAND if the new process will
be in a different pid namespace (task_active_pid_ns() doesn't match
current->nsproxy->pid_ns).

This looks confusing and inconsistent. CLONE_NEWPID is very similar to
the case when ->pid_ns was already unshared, we want the same
restrictions so copy_process() should also nack CLONE_PARENT.

And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
just for consistency.

Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
copy_process() to do the same check along with ->pid_ns check we already
have.

Signed-off-by: Oleg Nesterov
Acked-by: Andy Lutomirski
Cc: "Eric W. Biederman"
Cc: Colin Walters
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:56:20 +0800
5167246a8 pidns: kill the unnecessary CLONE_NEWPID in copy_process() ... Browse Code »

Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
the same check.

Remove the CLONE_NEWPID check to cleanup the code and prepare for the
next change.

Test-case:

static int child(void *arg)
{
return 0;
}

static char stack[16 * 1024];

int main(void)
{
pid_t pid;

assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

pid = clone(child, stack + sizeof(stack) / 2,
CLONE_NEWPID | SIGCHLD, NULL);
assert(pid < 0 && errno == EINVAL);

return 0;
}

clone(CLONE_NEWPID) correctly fails with or without this change.

Signed-off-by: Oleg Nesterov
Acked-by: Andy Lutomirski
Cc: "Eric W. Biederman"
Cc: Colin Walters
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:56:19 +0800
e79f525e9 pidns: fix vfork() after unshare(CLONE_NEWPID) ... Browse Code »

Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:

int main(void)
{
assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
assert(vfork() >= 0);
_exit(0);
return 0;
}

fails without this patch.

Change this check to use CLONE_SIGHAND instead. This also forbids
CLONE_THREAD automatically, and this is what the comment implies.

We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this. The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.

Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."

Signed-off-by: Oleg Nesterov
Reported-by: Colin Walters
Acked-by: Andy Lutomirski
Reviewed-by: "Eric W. Biederman"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:56:19 +0800

08 Sep, 2013

1 commit

c7c4591db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.

The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.

Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users

Linus Torvalds
2013-09-08 05:35:32 +0800

31 Aug, 2013

1 commit

6e556ce20 pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD ... Browse Code »

I goofed when I made unshare(CLONE_NEWPID) only work in a
single-threaded process. There is no need for that requirement and in
fact I analyzied things right for setns. The hard requirement
is for tasks that share a VM to all be in the pid namespace and
we properly prevent that in do_fork.

Just to be certain I took a look through do_wait and
forget_original_parent and there are no cases that make it any harder
for children to be in the multiple pid namespaces than it is for
children to be in the same pid namespace. I also performed a check to
see if there were in uses of task->nsproxy_pid_ns I was not familiar
with, but it is only used when allocating a new pid for a new task,
and in checks to prevent craziness from happening.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-08-31 14:44:00 +0800

28 Aug, 2013

1 commit

c2b1df2eb Rename nsproxy.pid_ns to nsproxy.pid_ns_for_children ... Browse Code »

nsproxy.pid_ns is *not* the task's pid namespace. The name should clarify
that.

This makes it more obvious that setns on a pid namespace is weird --
it won't change the pid namespace shown in procfs.

Signed-off-by: Andy Lutomirski
Reviewed-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Andy Lutomirski
2013-08-28 01:52:52 +0800

14 Aug, 2013

1 commit

dfa9771a7 microblaze: fix clone syscall ... Browse Code »

Fix inadvertent breakage in the clone syscall ABI for Microblaze that
was introduced in commit f3268edbe6fe ("microblaze: switch to generic
fork/vfork/clone").

The Microblaze syscall ABI for clone takes the parent tid address in the
4th argument; the third argument slot is used for the stack size. The
incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd
slot.

This commit restores the original ABI so that existing userspace libc
code will work correctly.

All kernel versions from v3.8-rc1 were affected.

Signed-off-by: Michal Simek
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Simek
2013-08-14 08:57:48 +0800

31 Jul, 2013

1 commit

db446a08c aio: convert the ioctx list to table lookup v3 ... Browse Code »

On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> > 32.51% [guest.kernel] [g] lookup_ioctx
> > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
> > 4.40% [guest.kernel] [g] lock_release
> > 4.19% [guest.kernel] [g] sched_clock_local
> > 3.86% [guest.kernel] [g] local_clock
> > 3.68% [guest.kernel] [g] native_sched_clock
> > 3.08% [guest.kernel] [g] sched_clock_cpu
> > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
> > 2.60% [guest.kernel] [g] memcpy
> > 2.33% [guest.kernel] [g] lock_acquired
> > 2.25% [guest.kernel] [g] lock_acquire
> > 1.84% [guest.kernel] [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores 1 2 4 8 16 32
> > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
> > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
> > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
> > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
> > % of radix
> > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list 58892 ms
> > %rsd 0.91%
> > radix 59404 ms
> > %rsd 0.81%
> > % of radix
> > relative 100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
...

I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx. It's not finished yet, and
it needs to be tidied up, but is most of the way there.

-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo> -Kent

bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.

Signed-off-by: Benjamin LaHaise

Benjamin LaHaise
2013-07-31 00:56:36 +0800

15 Jul, 2013

1 commit

0db0628d9 kernel: delete __cpuinit usage from all core kernel files ... Browse Code »
5

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.

[1] https://lkml.org/lkml/2013/5/20/589

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2013-07-15 07:36:59 +0800