24 Mar, 2011
40 commits
-
When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel
ring buffer. But a root user without CAP_SYS_ADMIN is able to reset
dmesg_restrict to 0.This is an issue when e.g. LXC (Linux Containers) are used and complete
user space is running without CAP_SYS_ADMIN. A unprivileged and jailed
root user can bypass the dmesg_restrict protection.With this patch writing to dmesg_restrict is only allowed when root has
CAP_SYS_ADMIN.Signed-off-by: Richard Weinberger
Acked-by: Dan Rosenberg
Acked-by: Serge E. Hallyn
Cc: Eric Paris
Cc: Kees Cook
Cc: James Morris
Cc: Eugene Teo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add boundaries of allowed input ranges for: dirty_expire_centisecs,
drop_caches, overcommit_memory, page-cluster and panic_on_oom.Signed-off-by: Petr Holasek
Acked-by: Dave Young
Cc: David Rientjes
Cc: Wu Fengguang
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Drop dead code.
Signed-off-by: Denis Kirjanov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since the for loop checks for the table->procname drop useless
table->procname checks inside the loop bodySigned-off-by: Denis Kirjanov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If rio is not a switch then "rswitch" is null.
Signed-off-by: Dan Carpenter
Cc: Matt Porter
Cc: Kumar Gala
Signed-off-by: Alexandre Bounine
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Removes resource reservation from the common sybsystem initialization code
and make it part of mport driver initialization. This resolves conflict
with resource reservation by device specific mport drivers.Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Changes mport ID and host destination ID assignment to implement unified
method common to all mport drivers. Makes "riohdid=" kernel command line
parameter common for all architectures with support for more that one host
destination ID assignment.Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Subsystem initialization sequence modified to support presence of multiple
RapidIO controllers in the system. The new sequence is compatible with
initialization of PCI devices.Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
1. Add an option to include RapidIO support if the PCI is available.
2. Add FSL_RIO configuration option to enable controller selection.
3. Add RapidIO support option into x86 and MIPS architectures.Signed-off-by: Alexandre Bounine
Acked-by: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This set of patches eliminates RapidIO dependency on PowerPC architecture
and makes it available to other architectures (x86 and MIPS). It also
enables support of new platform independent RapidIO controllers such as
PCI-to-SRIO and PCI Express-to-SRIO.This patch:
Extend number of mport callback functions to eliminate direct linking of
architecture specific mport operations.Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add RapidIO documentation files as it was discussed earlier (see thread
http://marc.info/?l=linux-kernel&m=129202338918062&w=2)Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add new sysfs attributes.
1. Routing information required to to reach the RIO device:
destid - device destination ID (real for for endpoint, route for switch)
hopcount - hopcount for maintenance requests (switches only)2. device linking information:
lprev - name of device that precedes the given device in the enumeration
or discovery order (displayed along with of the port to which it
is attached).
lnext - names of devices (with corresponding port numbers) that are
attached to the given device as next in the enumeration or
discovery order (switches only)Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Reduce the lines of code and simplify the logic.
Signed-off-by: Changli Gao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert calls to func_enter on leaving a function to func_exit.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)//
@@
@@- func_enter();
+ func_exit();
return...;
//Signed-off-by: Julia Lawall
Cc: Roger Wolff
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
put_tty_driver calls tty_driver_kref_put on its argument, and then
tty_driver_kref_put calls kref_put on the address of a field of this
argument. kref_put checks for NULL, but in this case the field is likely
to have some offset and so the result of taking its address will not be
NULL. Labels are added to be able to skip over the call to put_tty_driver
when the argument will be NULL.The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)//
@@
expression *x;
@@*if (x == NULL)
{ ...
* put_tty_driver(x);
...
return ...;
}
//Signed-off-by: Julia Lawall
Cc: Torben Hohn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add smd_pkt driver which provides device interface to smd packet ports.
Signed-off-by: Niranjana Vishwanathapura
Cc: Brian Swetland
Cc: Greg KH
Cc: Alan Cox
Cc: David Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
commit d2478521afc2022 ("char/ipmi: fix OOPS caused by
pnp_unregister_driver on unregistered driver") introduced a section
mismatch by calling __exit cleanup_ipmi_si from __devinit init_ipmi_si.Remove __exit annotation from cleanup_ipmi_si.
Signed-off-by: Sergey Senozhatsky
Acked-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While mm->start_stack was protected from cross-uid viewing (commit
f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
processes")), the start_code and end_code values were not. This would
allow the text location of a PIE binary to leak, defeating ASLR.Note that the value "1" is used instead of "0" for a protected value since
"ps", "killall", and likely other readers of /proc/pid/stat, take
start_code of "0" to mean a kernel thread and will misbehave. Thanks to
Brad Spengler for pointing this out.Addresses CVE-2011-0726
Signed-off-by: Kees Cook
Cc:
Cc: Alexey Dobriyan
Cc: David Howells
Cc: Eugene Teo
Cc: Martin Schwidefsky
Cc: Brad Spengler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
1. namelen is declared "unsigned short" which hints for "maybe space savings".
Indeed in 2.4 struct proc_dir_entry looked like:struct proc_dir_entry {
unsigned short low_ino;
unsigned short namelen;Now, low_ino is "unsigned int", all savings were gone for a long time.
"struct proc_dir_entry" is not that countless to worry about it's size,
anyway.2. converting from unsigned short to int/unsigned int can only create
problems, we better play it safe.Space is not really conserved, because of natural alignment for the next
field. sizeof(struct proc_dir_entry) remains the same.Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
[root@wei 1]# cat /proc/1/mem
cat: /proc/1/mem: No such processerror code -ESRCH is wrong in this situation. Return -EPERM instead.
Signed-off-by: Jovi Zhang
Reviewed-by: KOSAKI Motohiro
Cc: Alexey Dobriyan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The current code fails to print the "[heap]" marking if the heap is split
into multiple mappings.Fix the check so that the marking is displayed in all possible cases:
1. vma matches exactly the heap
2. the heap vma is merged e.g. with bss
3. the heap vma is splitted e.g. due to locked pagesTest cases. In all cases, the process should have mapping(s) with
[heap] marking:(1) vma matches exactly the heap
#include
#include
#includeint main (void)
{
if (sbrk(4096) != (void *)-1) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}# ./test1
check /proc/553/maps
[1] + Stopped ./test1
# cat /proc/553/maps | head -4
00008000-00009000 r-xp 00000000 01:00 3113640 /test1
00010000-00011000 rw-p 00000000 01:00 3113640 /test1
00011000-00012000 rw-p 00000000 00:00 0 [heap]
4006f000-40070000 rw-p 00000000 00:00 0(2) the heap vma is merged
#include
#include
#includechar foo[4096] = "foo";
char bar[4096];int main (void)
{
if (sbrk(4096) != (void *)-1) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}# ./test2
check /proc/556/maps
[2] + Stopped ./test2
# cat /proc/556/maps | head -4
00008000-00009000 r-xp 00000000 01:00 3116312 /test2
00010000-00012000 rw-p 00000000 01:00 3116312 /test2
00012000-00014000 rw-p 00000000 00:00 0 [heap]
4004a000-4004b000 rw-p 00000000 00:00 0(3) the heap vma is splitted (this fails without the patch)
#include
#include
#include
#includeint main (void)
{
if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
(sbrk(4096) != (void *)-1)) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}# ./test3
check /proc/559/maps
[1] + Stopped ./test3
# cat /proc/559/maps|head -4
00008000-00009000 r-xp 00000000 01:00 3119108 /test3
00010000-00011000 rw-p 00000000 01:00 3119108 /test3
00011000-00012000 rw-p 00000000 00:00 0 [heap]
00012000-00013000 rw-p 00000000 00:00 0 [heap]It looks like the bug has been there forever, and since it only results in
some information missing from a procfile, it does not fulfil the -stable
"critical issue" criteria.Signed-off-by: Aaro Koskinen
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This file is readable for the task owner. Hide kernel addresses from
unprivileged users, leave them function names and offsets.Signed-off-by: Konstantin Khlebnikov
Acked-by: Kees Cook
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Chaning cpuset->mems/cpuset->cpus should be protected under
callback_mutex.cpuset_clone() doesn't follow this rule. It's ok because it's
called when creating and initializing a cgroup, but we'd better
hold the lock to avoid subtil break in the future.Signed-off-by: Li Zefan
Acked-by: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Those functions that use NODEMASK_ALLOC() can't propagate errno
to users, but will fail silently.Fix it by using a static nodemask_t variable for each function, and
those variables are protected by cgroup_mutex;[akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment]
Signed-off-by: Li Zefan
Cc: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
to cpuset_migrate_mm().Signed-off-by: Li Zefan
Cc: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().As spotted by Paul, a side effect is we fix a bug that the function can
return -ENOMEM but the caller doesn't expect negative return value.
Therefore change the return value of cpuset_sprintf_cpulist() and
cpuset_sprintf_memlist() from int to size_t.Signed-off-by: Li Zefan
Acked-by: Paul Menage
Acked-by: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When a memcg is oom and current has already received a SIGKILL, then give
it access to memory reserves with a higher scheduling priority so that it
may quickly exit and free its memory.This is identical to the global oom killer and is done even before
checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
selected is guaranteed to have come from userspace; the thread only needs
access to memory reserves to exit and thus we don't unnecessarily panic
the machine until the kernel has no last resort to free memory.Signed-off-by: David Rientjes
Cc: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
fs/fuse/dev.c::fuse_try_move_page() does
(1) remove a page by ->steal()
(2) re-add the page to page cache
(3) link the page to LRU if it was not on LRU at (1)This implies the page is _on_ LRU when it's added to radix-tree. So, the
page is added to memory cgroup while it's on LRU. because LRU is lazy and
no one flushs it.This is the same behavior as SwapCache and needs special care as
- remove page from LRU before overwrite pc->mem_cgroup.
- add page to LRU after overwrite pc->mem_cgroup.And we need to taking care of pagevec.
If PageLRU(page) is set before we add PCG_USED bit, the page will not be
added to memcg's LRU (in short period). So, regardlress of PageLRU(page)
value before commit_charge(), we need to check PageLRU(page) after
commit_charge().Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432
Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Acked-by: Daisuke Nishimura
Cc: Miklos Szeredi
Cc: Balbir Singh
Reported-by: Daniel Poelzleithner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
PageReserved because we never store the array on reserved pages (neither
alloc_pages_exact nor vmalloc use those pages).So we can replace the check by a BUG_ON.
Signed-off-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we are allocating a single page_cgroup array per memory section
(stored in mem_section->base) when CONFIG_SPARSEMEM is selected. This is
correct but memory inefficient solution because the allocated memory
(unless we fall back to vmalloc) is not kmalloc friendly:- 32b - 16384 entries (20B per entry) fit into 327680B so the
524288B slab cache is used
- 32b with PAE - 131072 entries with 2621440B fit into 4194304B
- 64b - 32768 entries (40B per entry) fit into 2097152 cacheThis is ~37% wasted space per memory section and it sumps up for the whole
memory. On a x86_64 machine it is something like 6MB per 1GB of RAM.We can reduce the internal fragmentation by using alloc_pages_exact which
allocates PAGE_SIZE aligned blocks so we will get down to
Cc: Dave Hansen
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm/memcontrol.c: In function 'mem_cgroup_force_empty':
mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this functionIt's a false positive.
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The statistic counters are in units of pages, there is no reason to make
them 64-bit wide on 32-bit machines.Make them native words. Since they are signed, this leaves 31 bit on
32-bit machines, which can represent roughly 8TB assuming a page size of
4k.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Weiner
Signed-off-by: Greg Thelen
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For increasing and decreasing per-cpu cgroup usage counters it makes sense
to use signed types, as single per-cpu values might go negative during
updates. But this is not the case for only-ever-increasing event
counters.All the counters have been signed 64-bit so far, which was enough to count
events even with the sign bit wasted.This patch:
- divides s64 counters into signed usage counters and unsigned
monotonically increasing event counters.
- converts unsigned event counters into 'unsigned long' rather than
'u64'. This matches the type used by the /proc/vmstat event counters.The next patch narrows the signed usage counters type (on 32-bit CPUs,
that is).Signed-off-by: Johannes Weiner
Signed-off-by: Greg Thelen
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is no clear pattern when we pass a page count and when we pass a
byte count that is a multiple of PAGE_SIZE.We never charge or uncharge subpage quantities, so convert it all to page
counts.Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We never uncharge subpage quantities.
Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We never keep subpage quantities in the per-cpu stock.
Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have two charge cancelling functions: one takes a page count, the other
a page size. The second one just divides the parameter by PAGE_SIZE and
then calls the first one. This is trivial, no need for an extra function.Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The reclaim_param_lock is only taken around single reads and writes to
integer variables and is thus superfluous. Drop it.Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
page_cgroup_zoneinfo() will never return NULL for a charged page, remove
the check for it in mem_cgroup_get_reclaim_stat_from_page().Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In struct page_cgroup, we have a full word for flags but only a few are
reserved. Use the remaining upper bits to encode, depending on
configuration, the node or the section, to enable page_cgroup-to-page
lookups without a direct pointer.This saves a full word for every page in a system with memory cgroups
enabled.Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds