Eric Lee / smarc-fsl-linux-kernel

08 Feb, 2008

9 commits

0eea10301 Memory controller improve user interface ... Browse Code »

Change the interface to use bytes instead of pages. Page sizes can vary
across platforms and configurations. A new strategy routine has been added
to the resource counters infrastructure to format the data as desired.

Suggested by David Rientjes, Andrew Morton and Herbert Poetzl

Tested on a UML setup with the config for memory control enabled.

[kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
Signed-off-by: Balbir Singh
Signed-off-by: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:18 +0800
78fb74669 Memory controller: accounting setup ... Browse Code »

Basic setup routines, the mm_struct has a pointer to the cgroup that
it belongs to and the the page has a page_cgroup associated with it.

Signed-off-by: Pavel Emelianov
Signed-off-by: Balbir Singh
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelianov
2008-02-08 00:42:18 +0800
e552b6617 Memory controller: resource counters ... Browse Code »

With fixes from David Rientjes

Introduce generic structures and routines for resource accounting.

Each resource accounting cgroup is supposed to aggregate it,
cgroup_subsystem_state and its resource-specific members within.

Signed-off-by: Pavel Emelianov
Signed-off-by: Balbir Singh
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Vaidyanathan Srinivasan
Signed-off-by: David Rientjes
Cc: Pavel Emelianov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelianov
2008-02-08 00:42:18 +0800
e9685a03c kernel/cgroup.c: make 2 functions static ... Browse Code »

cgroup_is_releasable() and notify_on_release() should be static,
not global inline.

Signed-off-by: Adrian Bunk
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-08 00:42:18 +0800
8dc4f3e17 cgroups: move cgroups destroy() callbacks to cgroup_diput() ... Browse Code »

Move the calls to the cgroup subsystem destroy() methods from
cgroup_rmdir() to cgroup_diput(). This allows control file reads and
writes to access their subsystem state without having to be concerned with
locking against cgroup destruction - the control file dentry will keep the
cgroup and its subsystem state objects alive until the file is closed.

The documentation is updated to reflect the changed semantics of destroy();
additionally the locking comments for destroy() and some other methods were
clarified and decrustified.

Signed-off-by: Paul Menage
Cc: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-02-08 00:42:18 +0800
622d42cac cgroup simplify space stripping ... Browse Code »

Simplify the space stripping code in cgroup file write.

[akpm@linux-foundation.org: s/BUG_ON/BUILD_BUG_ON/]
Signed-off-by: Paul Jackson
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:18 +0800
e18f6318e cgroup brace coding style fix ... Browse Code »

Coding style fix - one line conditionals don't get braces.

Signed-off-by: Paul Jackson
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:17 +0800
3cdeed298 kernel/cgroup.c: remove dead code ... Browse Code »

This patch removes dead code spotted by the Coverity checker
(look at the "(nbytes >= PATH_MAX)" check).

Signed-off-by: Adrian Bunk
Cc: Paul Jackson
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-08 00:42:17 +0800
eccba0689 gfs2: make gfs2_glock.gl_owner_pid be a struct pid * ... Browse Code »

The gl_owner_pid field is used to get the lock owning task by its pid, so make
it in a proper manner, i.e. by using the struct pid pointer and pid_task()
function.

The pid_task() becomes exported for the gfs2 module.

Signed-off-by: Pavel Emelyanov
Cc: "Eric W. Biederman"
Acked-by: Steven Whitehouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-02-08 00:42:06 +0800

07 Feb, 2008

19 commits

f47cd9b55 kprobes: kretprobe user entry-handler ... Browse Code »

Provide support to add an optional user defined callback to be run at
function entry of a kretprobe'd function. Also modify the kprobe smoke
tests to include an entry-handler during the kretprobe sanity test.

Signed-off-by: Abhishek Sagar
Cc: Prasanna S Panchamukhi
Cc: Ananth N Mavinakayanahalli
Cc: Anil S Keshavamurthy
Acked-by: Jim Keniston
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Abhishek Sagar
2008-02-07 02:41:11 +0800
ec03d7073 speed up jiffies conversion functions if HZ==USER_HZ ... Browse Code »

Avoid calling do_div(x, 1) in this case.

Cc: David Fries
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-02-07 02:41:10 +0800
6ffc787a4 system timer: fix crash in <100Hz system timer ... Browse Code »

The kernel has a divide by zero crash when trying to run the system timer
less than 100Hz. The problem is x/(HZ/USER_HZ) and related. Now
x*(USER_HZ/HZ) will be used if HZ
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Fries
2008-02-07 02:41:10 +0800
1bf47346d kernel/sys.c: get rid of expensive divides in groups_sort() ... Browse Code »

groups_sort() can be quite long if user loads a large gid table.

This is because GROUP_AT(group_info, some_integer) uses an integer divide.
So having to do XXX thousand divides during one syscall can lead to very
high latencies. (NGROUPS_MAX=65536)

In the past (25 Mar 2006), an analog problem was found in groups_search()
(commit d74beb9f33a5f16d2965f11b275e401f225c949d ) and at that time I
changed some variables to unsigned int.

I believe that a more generic fix is to make sure NGROUPS_PER_BLOCK is
unsigned.

Signed-off-by: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2008-02-07 02:41:09 +0800
6b2fb3c65 idle_regs() must be __cpuinit ... Browse Code »

Fix the following section mismatch with CONFIG_HOTPLUG=n,
CONFIG_HOTPLUG_CPU=y:

WARNING: vmlinux.o(.text+0x399a6): Section mismatch: reference to .init.text.5:idle_regs (between 'fork_idle' and 'get_task_mm')

Signed-off-by: Adrian Bunk
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-07 02:41:08 +0800
eb38a996e kernel/params.c: remove sparse-warning (different signedness) ... Browse Code »

Fixing:
CHECK kernel/params.c
kernel/params.c:329:41: warning: incorrect type in argument 8 (different signedness)
kernel/params.c:329:41: expected int *num
kernel/params.c:329:41: got unsigned int *

Signed-off-by: Richard Knutsson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Knutsson
2008-02-07 02:41:08 +0800
6c6080f74 stopmachine: semaphore to mutex ... Browse Code »

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Daniel Walker
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Walker
2008-02-07 02:41:08 +0800
1a669c2f1 Add arch_ptrace_stop ... Browse Code »

This adds support to allow asm/ptrace.h to define two new macros,
arch_ptrace_stop_needed and arch_ptrace_stop. These control special
machine-specific actions to be done before a ptrace stop. The new code
compiles away to nothing when the new macros are not defined. This is the
case on all machines to begin with.

On ia64, these macros will be defined to solve the long-standing issue of
ptrace vs register backing store.

Signed-off-by: Roland McGrath
Cc: Petr Tesarik
Cc: Tony Luck
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2008-02-07 02:41:07 +0800
a1e096129 relay: nopage ... Browse Code »

Convert relay from nopage to fault.
Remove redundant vma range checks.
Switch from OOM to SIGBUS if the resource is not available.

Signed-off-by: Nick Piggin
Cc: Tom Zanussi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-02-07 02:41:07 +0800
9cfe015aa get rid of NR_OPEN and introduce a sysctl_nr_open ... Browse Code »

NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.

Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.

Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.

This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.

[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet
Cc: Alan Cox
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: "David S. Miller"
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2008-02-07 02:41:06 +0800
0a76fe8e5 do_wait: remove one "else if" branch ... Browse Code »

Minor cleanup. We can remove one "else if" branch.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-02-07 02:41:04 +0800
eed4a2aba printk.c: use unsigned ints instead of longs for logbuf index ... Browse Code »

Stop using unsigned _longs_ for printk buffer indexes. Log buffer is way
smaller than 2 gigabytes and unsigned ints will work too . Indeed, they do
work nicely on all 32-bit platforms where longs and ints are the same.

With this patch, we have following size savings on amd64:

text data bss dec hex filename
5997 313 17736 24046 5dee 2.6.23.1.t64/kernel/printk.o
5858 313 17700 23871 5d3f 2.6.23.1.printk.t64/kernel/printk.o

Signed-off-by: Denys Vlasenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denys Vlasenko
2008-02-07 02:41:04 +0800
5e2cb1018 time: fix sysfs_show_{available,current}_clocksources() buffer overflow problem ... Browse Code »

I found that there is a buffer overflow problem in the following code.

Version: 2.6.24-rc2,
File: kernel/time/clocksource.c:417-432
--------------------------------------------------------------------
static ssize_t
sysfs_show_available_clocksources(struct sys_device *dev, char *buf)
{
struct clocksource *src;
char *curr = buf;

spin_lock_irq(&clocksource_lock);
list_for_each_entry(src, &clocksource_list, list) {
curr += sprintf(curr, "%s ", src->name);
}
spin_unlock_irq(&clocksource_lock);

curr += sprintf(curr, "\n");

return curr - buf;
}
-----------------------------------------------------------------------

sysfs_show_current_clocksources() also has the same problem though in practice
the size of current clocksource's name won't exceed PAGE_SIZE.

I fix the bug by using snprintf according to the specification of the kernel
(Version:2.6.24-rc2,File:Documentation/filesystems/sysfs.txt)

Fix sysfs_show_available_clocksources() and sysfs_show_current_clocksources()
buffer overflow problem with snprintf().

Signed-off-by: Miao Xie
Cc: WANG Cong
Cc: Thomas Gleixner
Cc: john stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2008-02-07 02:41:03 +0800
c166f23cb kernel/notifier.c should #include <linux/reboot.h> ... Browse Code »

Every file should include the headers containing the prototypes for its global
functions (in this case {,un}register_reboot_notifier()).

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-07 02:41:02 +0800
bb695170d make srcu_readers_active() static ... Browse Code »

Make the needlessly global srcu_readers_active() static.

Signed-off-by: Adrian Bunk
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-07 02:41:02 +0800
f17d30a80 kernel/ptrace.c should #include <linux/syscalls.h> ... Browse Code »

Every file should include the headers containing the prototypes for its global
functions (in this case sys_ptrace()).

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-07 02:41:02 +0800
a3b81113f remove support for un-needed _extratext section ... Browse Code »

When passing a zero address to kallsyms_lookup(), the kernel thought it was
a valid kernel address, even if it is not. This is because is_ksym_addr()
called is_kernel_extratext() and checked against labels that don't exist on
many archs (which default as zero). Since PPC was the only kernel which
defines _extra_text, (in 2005), and no longer needs it, this patch removes
_extra_text support.

For some history (provided by Jon):
http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019734.html
http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019736.html
http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019751.html

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Robin Getz
Cc: David Woodhouse
Cc: Jon Loeliger
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Getz
2008-02-07 02:41:01 +0800
d9ae90ac4 use __set_task_state() for TRACED/STOPPED tasks ... Browse Code »

1. It is much easier to grep for ->state change if __set_task_state() is used
instead of the direct assignment.

2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
unneeded mb() (btw even if we use mb() it is still possible that do_wait()
sees the new ->state but not ->exit_code, but this is ok).

Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-02-07 02:41:00 +0800
06b8e878a taskstats scaled time cleanup ... Browse Code »

This moves the ability to scale cputime into generic code. This allows us
to fix the issue in kernel/timer.c (noticed by Balbir) where we could only
add an unscaled value to the scaled utime/stime.

This adds a cputime_to_scaled function. As before, the POWERPC version
does the scaling based on the last SPURR/PURR ratio calculated. The
generic and s390 (only other arch to implement asm/cputime.h) versions are
both NOPs.

Also moves the SPURR and PURR snapshots closer.

Signed-off-by: Michael Neuling
Cc: Jay Lan
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Neuling
2008-02-07 02:41:00 +0800

06 Feb, 2008

12 commits

f011e2e2d latency.c: use QoS infrastructure ... Browse Code »

Replace latency.c use with pm_qos_params use.

Signed-off-by: mark gross
Cc: "John W. Linville"
Cc: Len Brown
Cc: Jaroslav Kysela
Cc: Takashi Iwai
Cc: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mark Gross
2008-02-06 01:44:22 +0800
d82b35186 pm qos infrastructure and interface ... Browse Code »

The following patch is a generalization of the latency.c implementation done
by Arjan last year. It provides infrastructure for more than one parameter,
and exposes a user mode interface for processes to register pm_qos
expectations of processes.

This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
one of the parameters.

Currently we have {cpu_dma_latency, network_latency, network_throughput} as
the initial set of pm_qos parameters.

The infrastructure exposes multiple misc device nodes one per implemented
parameter. The set of parameters implement is defined by pm_qos_power_init()
and pm_qos_params.h. This is done because having the available parameters
being runtime configurable or changeable from a driver was seen as too easy to
abuse.

For each parameter a list of performance requirements is maintained along with
an aggregated target value. The aggregated target value is updated with
changes to the requirement list or elements of the list. Typically the
aggregated target value is simply the max or min of the requirement values
held in the parameter list elements.

>From kernel mode the use of this interface is simple:

pm_qos_add_requirement(param_id, name, target_value):

Will insert a named element in the list for that identified PM_QOS
parameter with the target value. Upon change to this list the new target is
recomputed and any registered notifiers are called only if the target value
is now different.

pm_qos_update_requirement(param_id, name, new_target_value):

Will search the list identified by the param_id for the named list element
and then update its target value, calling the notification tree if the
aggregated target is changed. with that name is already registered.

pm_qos_remove_requirement(param_id, name):

Will search the identified list for the named element and remove it, after
removal it will update the aggregate target and call the notification tree
if the target was changed as a result of removing the named requirement.

>From user mode:

Only processes can register a pm_qos requirement. To provide for
automatic cleanup for process the interface requires the process to register
its parameter requirements in the following way:

To register the default pm_qos target for the specific parameter, the
process must open one of /dev/[cpu_dma_latency, network_latency,
network_throughput]

As long as the device node is held open that process has a registered
requirement on the parameter. The name of the requirement is
"process_" derived from the current->pid from within the open system
call.

To change the requested target value the process needs to write a s32
value to the open device node. This translates to a
pm_qos_update_requirement call.

To remove the user mode request for a target value simply close the device
node.

[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix build again]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: mark gross
Cc: "John W. Linville"
Cc: Len Brown
Cc: Jaroslav Kysela
Cc: Takashi Iwai
Cc: Arjan van de Ven
Cc: Venki Pallipadi
Cc: Adam Belay
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mark Gross
2008-02-06 01:44:22 +0800
4ef7229ff make kernel_shutdown_prepare() static ... Browse Code »

kernel_shutdown_prepare() can now become static.

Signed-off-by: Adrian Bunk
Acked-by: Pavel Machek
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-06 01:44:22 +0800
47a460d5a kernel/power/disk.c: make code static ... Browse Code »

resume_file[] and create_image() can become static.

Signed-off-by: Adrian Bunk
Acked-by: Pavel Machek
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-02-06 01:44:22 +0800
3b7391de6 capabilities: introduce per-process capability bounding set ... Browse Code »

The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.

This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.

One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.

The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do

prctl(PR_CAPBSET_DROP, CAP_MKNOD);
cap_t cap = cap_get_proc();
cap_value_t caparray[1];
caparray[0] = CAP_MKNOD;
cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
cap_set_proc(cap);
cap_free(cap);

The following test program will get and set the bounding
set (but not pI). For instance

./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)

************************************************************
cap_bound.c
************************************************************
#include
#include
#include
#include
#include
#include
#include

#ifndef PR_CAPBSET_READ
#define PR_CAPBSET_READ 23
#endif

#ifndef PR_CAPBSET_DROP
#define PR_CAPBSET_DROP 24
#endif

int usage(char *me)
{
printf("Usage: %s get\n", me);
printf(" %s drop \n", me);
return 1;
}

#define numcaps 32
char *captable[numcaps] = {
"cap_chown",
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"
};

int getbcap(void)
{
int comma=0;
unsigned long i;
int ret;

printf("i know of %d capabilities\n", numcaps);
printf("capability bounding set:");
for (i=0; i< 0)
perror("prctl");
else if (ret==1)
printf("%s%s", (comma++) ? ", " : " ", captable[i]);
}
printf("\n");
return 0;
}

int capdrop(char *str)
{
unsigned long i;

int found=0;
for (i=0; i
Signed-off-by: Andrew G. Morgan
Cc: Stephen Smalley
Cc: James Morris
Cc: Chris Wright
Cc: Casey Schaufler a
Signed-off-by: "Serge E. Hallyn"
Tested-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-02-06 01:44:20 +0800
e338d263a Add 64-bit capability support to the kernel ... Browse Code »

The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities. If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.

FWIW libcap-2.00 supports this change (and earlier capability formats)

http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan
Cc: Stephen Smalley
Acked-by: Serge Hallyn
Cc: Chris Wright
Cc: James Morris
Cc: Casey Schaufler
Signed-off-by: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morgan
2008-02-06 01:44:20 +0800
195cf453d mm/page-writeback: highmem_is_dirtyable option ... Browse Code »

Add vm.highmem_is_dirtyable toggle

A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.

Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.

[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana
Cc: Ethan Solomita
Cc: Peter Zijlstra
Cc: WU Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bron Gondwana
2008-02-06 01:44:18 +0800
5e5419734 add mm argument to pte/pmd/pud/pgd_free ... Browse Code »

(with Martin Schwidefsky )

The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.

[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Martin Schwidefsky
Cc:
Signed-off-by: Kamalesh Babulal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-02-06 01:44:18 +0800
9f8f21725 Page allocator: clean up pcp draining functions ... Browse Code »

- Add comments explaing how drain_pages() works.

- Eliminate useless functions

- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.

- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.

- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c

- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter
Acked-by: Mel Gorman
Cc: "Rafael J. Wysocki"
Cc: Daniel Walker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:17 +0800
4d672e7ac timerfd: new timerfd API ... Browse Code »

This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.

The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not
NULL).

The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.

The timerfd_gettime() API returns the next expiration time of the timer, or
{0, 0} if the timerfd has not been set yet.

Like the previous timerfd API implementation, read(2) and poll(2) are
supported (with the same interface). Here's a simple test program I used to
exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c

[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: fix ia64 build]
[akpm@linux-foundation.org: fix m68k build]
[akpm@linux-foundation.org: fix mips build]
[akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
[heiko.carstens@de.ibm.com: fix s390]
[akpm@linux-foundation.org: fix powerpc build]
[akpm@linux-foundation.org: fix sparc64 more]
Signed-off-by: Davide Libenzi
Cc: Michael Kerrisk
Cc: Thomas Gleixner
Cc: Davide Libenzi
Cc: Michael Kerrisk
Cc: Martin Schwidefsky
Signed-off-by: Heiko Carstens
Cc: Michael Kerrisk
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davide Libenzi
2008-02-06 01:44:07 +0800
ed5d2cac1 exec: rework the group exit and fix the race with kill ... Browse Code »

As Roland pointed out, we have the very old problem with exec. de_thread()
sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
clears signal->flags. All signals (even fatal ones) sent in this window
(which is not too small) will be lost.

With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
the new helper, should be used to detect exit_group() or exec() in progress.
It can have more users, but this patch does only strictly necessary changes.

Signed-off-by: Oleg Nesterov
Cc: Davide Libenzi
Cc: Ingo Molnar
Cc: Robin Holt
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-02-06 01:44:07 +0800
f558b7e40 remove handle_group_stop() in favor of do_signal_stop() ... Browse Code »

Every time we set SIGNAL_GROUP_EXIT or clear SIGNAL_STOP_DEQUEUED we also
reset ->group_stop_count.

This means that the SIGNAL_GROUP_EXIT check in handle_group_stop() is not
needed, and do_signal_stop() should check SIGNAL_STOP_DEQUEUED only when
->group_stop_count == 0. With these changes handle_group_stop() becomes the
subset of do_signal_stop(), we can kill it and use do_signal_stop() instead.

Also, a preparation for the next patch.

Signed-off-by: Oleg Nesterov
Cc: Davide Libenzi
Cc: Ingo Molnar
Cc: Robin Holt
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-02-06 01:44:07 +0800