20 Oct, 2007
5 commits
-
When a task enters a new namespace via a clone() or unshare(), a new cgroup
is created and the task moves into it.This version names cgroups which are automatically created using
cgroup_clone() as "node_" where pid is the pid of the unsharing or
cloned process. (Thanks Pavel for the idea) This is safe because if the
process unshares again, it will create/cgroups/(...)/node_/node_
The only possibilities (AFAICT) for a -EEXIST on unshare are
1. pid wraparound
2. a process fails an unshare, then tries again.Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
node_ will be empty and can be rmdir'ed to make the subsequent unshare()
succeed.Changelog:
Name cloned cgroups as "node_".[clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
Signed-off-by: Serge E. Hallyn
Cc: Paul Menage
Signed-off-by: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This example subsystem exports debugging information as an aid to diagnosing
refcount leaks, etc, in the cgroup framework.Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.Portions contributed by Balbir Singh
Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystemThe "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Generic Process Control Groups
--------------------------There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernelThis patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Oct, 2007
3 commits
-
Kconfig.preempt is not included on some archs (for example, m68k). On those
archs, the Kconfig machinery complains that KVM selects an undefined symbol
PREEMPT_NOTIFIERS (which lives in Kconfig.preempt).So move the offending symbol into a Kconfig file which is included by
everyone.Cc: Roman Zippel
Cc: Geert Uytterhoeven
Signed-off-by: Avi Kivity
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Grouping pages by mobility can be disabled at compile-time. This was
considered undesirable by a number of people. However, in the current stack of
patches, it is not a simple case of just dropping the configurable patch as it
would cause merge conflicts. This patch backs out the configuration option.Signed-off-by: Mel Gorman
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The grouping mechanism has some memory overhead and a more complex allocation
path. This patch allows the strategy to be disabled for small memory systems
or if it is known the workload is suffering because of the strategy. It also
acts to show where the page groupings strategy interacts with the standard
buddy allocator.Signed-off-by: Mel Gorman
Signed-off-by: Joel Schopp
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Oct, 2007
6 commits
-
Fix coding style issues reported by Randy Dunlap and others
Signed-off-by: Dhaval Giani
Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Ingo Molnar
Reviewed-by: Thomas Gleixner -
enable CONFIG_FAIR_GROUP_SCHED=y by default.
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Reviewed-by: Thomas Gleixner -
fair-group sched, cleanups.
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Reviewed-by: Thomas Gleixner -
Enable user-id based fair group scheduling. This is useful for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.A separate scheduling group (i.e struct task_grp) is automatically created for
every new user added to the system. Upon uid change for a task, it is made to
move to the corresponding scheduling group.A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Dhaval Giani
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Reviewed-by: Thomas Gleixner -
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Dhaval Giani
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Reviewed-by: Thomas Gleixner -
Add interface to control cpu bandwidth allocation to task-groups.
(not yet configurable, due to missing CONFIG_CONTAINERS)
Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Dhaval Giani
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
20 Sep, 2007
1 commit
-
There is still some confusion and disagreement over what this interface should
actually do. So it is best that we disable it in 2.6.23 until we get that
fully sorted out.(sys_timerfd() was present in 2.6.22 but it was apparently broken, so here we
assume that nobody is using it yet).Cc: Michael Kerrisk
Cc: Davide Libenzi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Aug, 2007
2 commits
-
Remove the top level menu "Code maturity level options", and moves its
options into menu "General setup".This makes Kconfig less cluttered and easier to setup.
Signed-off-by: Al Boldi
Acked-by: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There doesn't seem to be a good reason for ANON_INODES being
an user visible option.Signed-off-by: Adrian Bunk
Acked-by: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Jul, 2007
1 commit
-
Presently we only use this with CONFIG_EXPERIMENTAL, but it is
something that can be supported commonly.Signed-off-by: Paul Mundt
18 Jul, 2007
1 commit
-
There are some reports that 2.6.22 has SLUB as the default. Not
true!This will make SLUB the default for 2.6.23.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jul, 2007
4 commits
-
Basically, it will allow a process to unshare its user_struct table,
resetting at the same time its own user_struct and all the associated
accounting.A new root user (uid == 0) is added to the user namespace upon creation.
Such root users have full privileges and it seems that theses privileges
should be controlled through some means (process capabilities ?)The unshare is not included in this patch.
Changes since [try #4]:
- Updated get_user_ns and put_user_ns to accept NULL, and
get_user_ns to return the namespace.Changes since [try #3]:
- moved struct user_namespace to files user_namespace.{c,h}Changes since [try #2]:
- removed struct user_namespace* argument from find_user()Changes since [try #1]:
- removed struct user_namespace* argument from find_user()
- added a root_user per user namespaceSigned-off-by: Cedric Le Goater
Signed-off-by: Serge E. Hallyn
Acked-by: Pavel Emelianov
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Eric W. Biederman
Cc: Chris Wright
Cc: Stephen Smalley
Cc: James Morris
Cc: Andrew Morgan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CONFIG_UTS_NS and CONFIG_IPC_NS have very little value as they only
deactivate the unshare of the uts and ipc namespaces and do not improve
performance.Signed-off-by: Cedric Le Goater
Acked-by: "Serge E. Hallyn"
Cc: Eric W. Biederman
Cc: Herbert Poetzl
Cc: Pavel Emelianov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Change menuconfig objects from "menu, config" into "menuconfig" so that the
user can disable the whole feature without entering its menu first.Signed-off-by: Jan Engelhardt
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently slob is disabled if we're using sparsemem, due to an earlier
patch from Goto-san. Slob and static sparsemem work without any trouble as
it is, and the only hiccup is a missing slab_is_available() in the case of
sparsemem extreme. With this, we're rid of the last set of restrictions
for slob usage.Signed-off-by: Paul Mundt
Acked-by: Pekka Enberg
Acked-by: Matt Mackall
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Jul, 2007
1 commit
-
"menu, endmenu" that did not get cleaned up in the block patch
[ http://lkml.org/lkml/2007/4/10/251 ]Signed-off-by: Jan Engelhardt
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe
17 May, 2007
2 commits
-
No arch sets ARCH_USES_SLAB_PAGE_STRUCT anymore.
Remove the experimental dependency as well since we want to have it as
a real alternative to SLAB.It all comes down to killing a single line from init/Kconfig.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The SLOB allocator should implement SLAB_DESTROY_BY_RCU correctly, because
even on UP, RCU freeing semantics are not equivalent to simply freeing
immediately. This also allows SLOB to be used on SMP.Signed-off-by: Nick Piggin
Acked-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 May, 2007
5 commits
-
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only). It can be used instead of pipe(2) in all cases where those
would simply be used to signal events. Their kernel overhead is much lower
than pipes, and they do not consume two fds. When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it. The API is:int eventfd(unsigned int count);
The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).The write(2) call writes an __u64 count value, and adds it to the current
counter. The eventfd fd supports O_NONBLOCK also.On the kernel side, we have:
struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).The kernel can then call eventfd_signal() every time it wants to post an event
to userspace. The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:http://www.xmailserver.org/eventfd-bench.c
This is the eventfd-based version of pipetest-4 (pipe(2) based):
http://www.xmailserver.org/pipetest-4.c
Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch introduces a new system call for timers events delivered though
file descriptors. This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2). As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.The system call is defined as:
int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
returned.The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.A quick test program, shows timerfd working correctly on my amd64 box:
http://www.xmailserver.org/timerfd-test.c
[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch add an anonymous inode source, to be used for files that need
and inode only in order to create a file*. We do not care of having an
inode for each file, and we do not even care of having different names in
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd
(and whatever else there'll be).Signed-off-by: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Otherwise people get asked about SLUB_DEBUG even if they have another
slab allocator enabled.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 May, 2007
3 commits
-
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...Fix trivial comment conflict in kernel/relay.c.
-
Fix some of the spelling issues. Fix sentences. Discourage SLOB use
since SLUB can pack objects denser.Signed-off-by: Christoph Lameter
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components
of SLUB. Thus SLUB will be able to replace SLOB. SLUB can arrange objects in
a denser way than SLOB and the code size should be minimal without debugging
and sysfs support.Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG.
CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB. SLUB enables
debugging via a boot parameter. SLUB debug code should always be present.CONFIG_SLUB_DEBUG can be modified in the embedded config section.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 May, 2007
3 commits
-
Remove the reference to an external mqueue library since that was
merged into glibc in 2004.Signed-off-by: Robert P. J. Day
Signed-off-by: Adrian Bunk -
Fix several typos in help text in Kconfig* files.
Signed-off-by: David Sterba
Signed-off-by: Adrian Bunk -
Several people have observed that perhaps LOG_BUF_SHIFT should be in a more
obvious place than under DEBUG_KERNEL. Under some circumstances (such as the
PARISC architecture), DEBUG_KERNEL can increase kernel size, which is an
undesirable trade off for something as trivial as increasing the kernel log
buffer size.Instead, move LOG_BUF_SHIFT into "General Setup", so that people are more
likely to be able to change it such a circumstance that the default buffer
size is insufficient.Signed-off-by: Alistair John Strachan
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2007
2 commits
-
This adds support for the Analog Devices Blackfin processor architecture, and
currently supports the BF533, BF532, BF531, BF537, BF536, BF534, and BF561
(Dual Core) devices, with a variety of development platforms including those
avaliable from Analog Devices (BF533-EZKit, BF533-STAMP, BF537-STAMP,
BF561-EZKIT), and Bluetechnix! Tinyboards.The Blackfin architecture was jointly developed by Intel and Analog Devices
Inc. (ADI) as the Micro Signal Architecture (MSA) core and introduced it in
December of 2000. Since then ADI has put this core into its Blackfin
processor family of devices. The Blackfin core has the advantages of a clean,
orthogonal,RISC-like microprocessor instruction set. It combines a dual-MAC
(Multiply/Accumulate), state-of-the-art signal processing engine and
single-instruction, multiple-data (SIMD) multimedia capabilities into a single
instruction-set architecture.The Blackfin architecture, including the instruction set, is described by the
ADSP-BF53x/BF56x Blackfin Processor Programming Reference
http://blackfin.uclinux.org/gf/download/frsrelease/29/2549/Blackfin_PRM.pdfThe Blackfin processor is already supported by major releases of gcc, and
there are binary and source rpms/tarballs for many architectures at:
http://blackfin.uclinux.org/gf/project/toolchain/frs There is complete
documentation, including "getting started" guides available at:
http://docs.blackfin.uclinux.org/ which provides links to the sources and
patches you will need in order to set up a cross-compiling environment for
bfin-linux-uclibcThis patch, as well as the other patches (toolchain, distribution,
uClibc) are actively supported by Analog Devices Inc, at:
http://blackfin.uclinux.org/We have tested this on LTP, and our test plan (including pass/fails) can
be found at:
http://docs.blackfin.uclinux.org/doku.php?id=testing_the_linux_kernel[m.kozlowski@tuxland.pl: balance parenthesis in blackfin header files]
Signed-off-by: Bryan Wu
Signed-off-by: Mariusz Kozlowski
Signed-off-by: Aubrey Li
Signed-off-by: Jie Zhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.A. Management of object queues
A particular concern was the complex management of the numerous object
queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
each allocating CPU and use objects from a slab directly instead of
queueing them up.B. Storage overhead of object queues
SLAB Object queues exist per node, per CPU. The alien cache queue even
has a queue array that contain a queue for each processor on each
node. For very large systems the number of queues and the number of
objects that may be caught in those queues grows exponentially. On our
systems with 1k nodes / processors we have several gigabytes just tied up
for storing references to objects for those queues This does not include
the objects that could be on those queues. One fears that the whole
memory of the machine could one day be consumed by those queues.C. SLAB meta data overhead
SLAB has overhead at the beginning of each slab. This means that data
cannot be naturally aligned at the beginning of a slab block. SLUB keeps
all meta data in the corresponding page_struct. Objects can be naturally
aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
boundaries and can fit tightly into a 4k page with no bytes left over.
SLAB cannot do this.D. SLAB has a complex cache reaper
SLUB does not need a cache reaper for UP systems. On SMP systems
the per CPU slab may be pushed back into partial list but that
operation is simple and does not require an iteration over a list
of objects. SLAB expires per CPU, shared and alien object queues
during cache reaping which may cause strange hold offs.E. SLAB has complex NUMA policy layer support
SLUB pushes NUMA policy handling into the page allocator. This means that
allocation is coarser (SLUB does interleave on a page level) but that
situation was also present before 2.6.13. SLABs application of
policies to individual slab objects allocated in SLAB is
certainly a performance concern due to the frequent references to
memory policies which may lead a sequence of objects to come from
one node after another. SLUB will get a slab full of objects
from one node and then will switch to the next.F. Reduction of the size of partial slab lists
SLAB has per node partial lists. This means that over time a large
number of partial slabs may accumulate on those lists. These can
only be reused if allocator occur on specific nodes. SLUB has a global
pool of partial slabs and will consume slabs from that pool to
decrease fragmentation.G. Tunables
SLAB has sophisticated tuning abilities for each slab cache. One can
manipulate the queue sizes in detail. However, filling the queues still
requires the uses of the spin lock to check out slabs. SLUB has a global
parameter (min_slab_order) for tuning. Increasing the minimum slab
order can decrease the locking overhead. The bigger the slab order the
less motions of pages between per CPU and partial lists occur and the
better SLUB will be scaling.G. Slab merging
We often have slab caches with similar parameters. SLUB detects those
on boot up and merges them into the corresponding general caches. This
leads to more effective memory use. About 50% of all caches can
be eliminated through slab merging. This will also decrease
slab fragmentation because partial allocated slabs can be filled
up again. Slab merging can be switched off by specifying
slub_nomerge on boot up.Note that merging can expose heretofore unknown bugs in the kernel
because corrupted objects may now be placed differently and corrupt
differing neighboring objects. Enable sanity checks to find those.H. Diagnostics
The current slab diagnostics are difficult to use and require a
recompilation of the kernel. SLUB contains debugging code that
is always available (but is kept out of the hot code paths).
SLUB diagnostics can be enabled via the "slab_debug" option.
Parameters can be specified to select a single or a group of
slab caches for diagnostics. This means that the system is running
with the usual performance and it is much more likely that
race conditions can be reproduced.I. Resiliency
If basic sanity checks are on then SLUB is capable of detecting
common error conditions and recover as best as possible to allow the
system to continue.J. Tracing
Tracing can be enabled via the slab_debug=T, option
during boot. SLUB will then protocol all actions on that slabcache
and dump the object contents on free.K. On demand DMA cache creation.
Generally DMA caches are not needed. If a kmalloc is used with
__GFP_DMA then just create this single slabcache that is needed.
For systems that have no ZONE_DMA requirement the support is
completely eliminated.L. Performance increase
Some benchmarks have shown speed improvements on kernbench in the
range of 5-10%. The locking overhead of slub is based on the
underlying base allocation size. If we can reliably allocate
larger order pages then it is possible to increase slub
performance much further. The anti-fragmentation patches may
enable further performance increases.Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + SimulatorSLUB Boot options
slub_nomerge Disable merging of slabs
slub_min_order=x Require a minimum order for slab caches. This
increases the managed chunk size and therefore
reduces meta data and locking overhead.
slub_min_objects=x Mininum objects per slab. Default is 8.
slub_max_order=x Avoid generating slabs larger than order specified.
slub_debug Enable all diagnostics for all caches
slub_debug= Enable selective options for all caches
slub_debug=, Enable selective options for a certain set of
cachesAvailable Debug options
F Double Free checking, sanity and resiliency
R Red zoning
P Object / padding poisoning
U Track last free / alloc
T Trace all allocs / frees (only use for individual slabs).To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 May, 2007
1 commit
-
Clarify the creation of the LOCALVERSION_AUTO string during kernel
configuration, and fix a couple typoes while we're there.Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Sam Ravnborg