09 Jan, 2009
40 commits
-
This is to avoid name clashes for the introduction of a global swap()
macro.Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In preparation for the introduction of a generic swap() macro.
Signed-off-by: Wu Fengguang
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In preparation for the introduction of a generic swap() macro.
Signed-off-by: Wu Fengguang
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Cc: Ingo Molnar
Cc: Thomas Gleixner
Acked-by: Theodore Ts'o
Acked-by: Mark Fasheh
Acked-by: David S. Miller
Cc: James Morris
Acked-by: Casey Schaufler
Acked-by: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
romfs_strnlen() returns int
unsigned X >= 0 is always true[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: roel kluin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove the saved_max_pfn check from the /proc/vmcore function
read_from_oldmem(). No need to verify, we should be able to just trust
that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
command line like we do with other parameters.The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
sense to use saved_max_pfn. For oldmem it is used to determine when to
stop reading. For vmcore we already have the elf header info pointing out
the physical memory regions, no need to pass the end-of- old-memory twice.Removing the saved_max_pfn check from vmcore makes it possible for
architectures to skip oldmem but still support crash dump through vmcore -
without the need for the old saved_max_pfn cruft.Architectures that want to play safe can do the saved_max_pfn check in
copy_oldmem_page(). Not sure why anyone would want to do that, but that's
even safer than today - the saved_max_pfn check in vmcore removed by this
patch only checks the first page.Signed-off-by: Magnus Damm
Acked-by: Vivek Goyal
Acked-by: Simon Horman
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Send completion status of the commands to the userspace. Message and
protocol are described in the documentation.Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Command which allows to reset the bus.
Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This small patchset extendes existing commands with reset, master IO and
status messages. Reset is used to reset the bus for given master device,
master IO command allows to initiate IO against bus itself not selecting
slave device first, which can be used to probe the device for example.
And status messages carry command completion status back to the userspace
(namely very useful to get -ENODEV from when requested device was not
found).Great thanks to Paul Alfille of OWFS for testing and commands suggestions.
This patch:
Allow starting of IO not against already found slave devices, but against
the bus itself, which can be used for example to probe devices.[akpm@linux-foundation.org: reindent switch statements]
Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Initiates search (or alarm search) and returns all found devices to
userspace. Found devices are not added into the system (i.e. they are
not attached to family devices or bus masters), it will be done via (if
was not done yet) usual timed searching.Signed-off-by: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Writes and returns sampled data back to userspace.
Signed-off-by: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch series introduces and extends several userspace commands
used with netlink protocol.Touch block command allows to write data and return sampled data to
the userspace.Extended search and alarm seach commands to return list of slave
devices found during given search.List masters command allows to send all registered master IDs to the
userspace.Great thanks to Paul Alfille (owfs) who
tested this implementation and wrote w1-to-network daemon
http://sourceforge.net/projects/w1repeater/ andFrederik Deweerdt and Randy Dunlap for review.
This patch:
Returns list of registered bus master devices.
Signed-off-by: Evgeniy Polyakov
Cc: Paul Alfille
Cc: Frederik Deweerdt
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch adds support for the 1-wire master interface for i.MX27 and
i.MX31.Signed-off-by: Luotao Fu
Signed-off-by: Sascha Hauer
Signed-off-by: Evgeniy Polyakov
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Marcel Selhorst
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a driver for controlling Dell-specific backlight and rfkill interfaces.
This driver makes use of the dcdbas interface to the Dell firmware to
allow the backlight and rfkill interfaces on Dell systems to be driven
through the standardised sysfs interfaces.Signed-off-by: Matthew Garrett
Cc: Matt Domsch
Cc: Ivo van Doorn
Cc: Len Brown
Cc: Richard Purdie
Cc: Henrique de Moraes Holschuh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The dcdbas code allows calls to be made into the firmware on Dell systems.
Exporting this to other drivers allows them to implement Dell-specific
functionality in a safe way.Signed-off-by: Matthew Garrett
Cc: Matt Domsch
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled. This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html
Ulrich said:
glibc needs right after startup a bit of random data for internal
protections (stack canary etc). What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it. For
every process startup. That's slow....
The solution is to provide a limited amount of random data to the
starting process in the aux vector. I suggested 16 bytes and this is
what the patch implements. If we need only 16 bytes or less we use the
data directly. If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose. It might not
be the same pool as that for /dev/urandom.Concerns were expressed about the depletion of the randomness pool. But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.Signed-off-by: Kees Cook
Cc: Jakub Jelinek
Cc: Andi Kleen
Cc: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a process registers for asynchronous notification on a POSIX message
queue, it gets a signal and a siginfo_t structure when a message arrives
on the message queue. The si_pid in the siginfo_t structure is set to the
PID of the process that sent the message to the message queue.The principle is the following:
. when mq_notify(SIGEV_SIGNAL) is called, the caller registers for
notification when a msg arrives. The associated pid structure is stroed into
inode_info->notify_owner. Let's call this process P1.
. when mq_send() is called by say P2, P2 sends a signal to P1 to notify
him about msg arrival.The way .si_pid is set today is not correct, since it doesn't take into account
the fact that the process that is sending the message might not be in the
same namespace as the notified one.This patch proposes to set si_pid to the sender's pid into the notify_owner
namespace.Signed-off-by: Nadia Derbey
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Bastian Blank
Cc: Pavel Emelyanov
Cc: Eric W. Biederman
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently task_active_pid_ns is not safe to call after a task becomes a
zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
reading the pid namespace from the pid of the task we can trivially solve
this problem at the cost of one extra memory read in what should be the
same cacheline as we read the namespace from.When moving things around I have made task_active_pid_ns out of line
because keeping it in pid_namespace.h would require adding includes of
pid.h and sched.h that I don't think we want.This change does make task_active_pid_ns unsafe to call during
copy_process until we attach a pid on the task_struct which seems to be a
reasonable trade off.Signed-off-by: Eric W. Biederman
Signed-off-by: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Bastian Blank
Cc: Pavel Emelyanov
Cc: Nadia Derbey
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A current problem with the pid namespace is that it is easy to do pid
related work after exit_task_namespaces which drops the nsproxy pointer.However if we are doing pid namespace related work we are always operating
on some struct pid which retains the pid_namespace pointer of the pid
namespace it was allocated in.So provide ns_of_pid which allows us to find the pid namespace a pid was
allocated in.Using this we have the needed infrastructure to do pid namespace related
work at anytime we have a struct pid, removing the chance of accidentally
having a NULL pointer dereference when accessing current->nsproxy.Signed-off-by: Eric W. Biederman
Signed-off-by: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Bastian Blank
Cc: Pavel Emelyanov
Cc: Nadia Derbey
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Impact: cleanups, use new cpumask API
Final trivial cleanups: mainly s/cpumask_t/struct cpumask
Note there is a FIXME in generate_sched_domains(). A future patch will
change struct cpumask *doms to struct cpumask *doms[].
(I suppose Rusty will do this.)Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Impact: use new cpumask API
This patch mainly does the following things:
- change cs->cpus_allowed from cpumask_t to cpumask_var_t
- call alloc_bootmem_cpumask_var() for top_cpuset in cpuset_init_early()
- call alloc_cpumask_var() for other cpusets
- replace cpus_xxx() to cpumask_xxx()Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Impact: cleanups, reduce stack usage
This patch prepares for the next patch. When we convert
cpuset.cpus_allowed to cpumask_var_t, (trialcs = *cs) no longer works.Another result of this patch is reducing stack usage of trialcs.
sizeof(*cs) can be as large as 148 bytes on x86_64, so it's really not
good to have it on stack.Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Impact: reduce stack usage
Allocate a global cpumask_var_t at boot, and use it in cpuset_attach(), so
we won't fail cpuset_attach().Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Impact: reduce stack usage
Just use cs->cpus_allowed, and no need to allocate a cpumask_var_t.
Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patchset converts cpuset to use new cpumask API, and thus
remove on stack cpumask_t to reduce stack usage.Before:
# cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
21
After:
# cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t
0This patch:
Impact: reduce stack usage
It's safe to call cpulist_scnprintf inside callback_mutex, and thus we can
just remove the cpumask_t and no need to allocate a cpumask_var_t.Signed-off-by: Li Zefan
Cc: Ingo Molnar
Cc: Rusty Russell
Acked-by: Mike Travis
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I found a bug on my dual-cpu box. I created a sub cpuset in top cpuset
and assign 1 to its cpus. And then we attach some tasks into this sub
cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset
are moved into top cpuset automatically because there is no cpu in sub
cpuset. Then we online CPU1, we find all the tasks which doesn't belong
to top cpuset originally just run on CPU0.We fix this bug by setting task's cpu_allowed to cpu_possible_map when
attaching it into top cpuset. This method needn't modify the current
behavior of cpusets on CPU hotplug, and all of tasks in top cpuset use
cpu_possible_map to initialize their cpu_allowed.Signed-off-by: Miao Xie
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_cs() calls task_subsys_state().
We must use rcu_read_lock() to protect cgroup_subsys_state().
It's correct that top_cpuset is never freed, but cgroup_subsys_state()
accesses css_set, this css_set maybe freed when task_cs() called.We use use rcu_read_lock() to protect it.
Signed-off-by: Lai Jiangshan
Acked-by: Paul Menage
Cc: KAMEZAWA Hiroyuki
Cc: Pavel Emelyanov
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add css_tryget(), that obtains a counted reference on a CSS. It is used
in situations where the caller has a "weak" reference to the CSS, i.e.
one that does not protect the cgroup from removal via a reference count,
but would instead be cleaned up by a destroy() callback.css_tryget() will return true on success, or false if the cgroup is being
removed.This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but
with the difference that in the event of css_tryget() racing with a
cgroup_rmdir(), css_tryget() will only return false if the cgroup really
does get removed.This implementation is done by biasing css->refcnt, so that a refcnt of 1
means "releasable" and 0 means "released or releasing". In the event of a
race, css_tryget() distinguishes between "released" and "releasing" by
checking for the CSS_REMOVED flag in css->flags.Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Update the memory controller to use its hierarchy_mutex rather than
calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir()
from occurring in its hierarchy.Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These patches introduce new locking/refcount support for cgroups to
reduce the need for subsystems to call cgroup_lock(). This will
ultimately allow the atomicity of cgroup_rmdir() (which was removed
recently) to be restored.These three patches give:
1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
use to prevent changes to its own cgroup tree2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
memory controller3/3 - introduce a css_tryget() function similar to the one recently
proposed by Kamezawa, but avoiding spurious refcount failures in
the event of a race between a css_tryget() and an unsuccessful
cgroup_rmdir()Future patches will likely involve:
- using hierarchy mutex in place of cgroup_lock() in more subsystems
where appropriate- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()
This patch:
Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
the hierarchy observed by that subsystem. It is taken by the cgroup
subsystem (in addition to cgroup_mutex) for the following operations:- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
bind() callback)Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now, you can see following even when swap accounting is enabled.
1. Create Group 01, and 02.
2. allocate a "file" on tmpfs by a task under 01.
3. swap out the "file" (by memory pressure)
4. Read "file" from a task in group 02.
5. the charge of "file" is moved to group 02.This is not ideal behavior. This is because SwapCache which was loaded
by read-ahead is not taken into account..This is a patch to fix shmem's swapcache behavior.
- remove mem_cgroup_cache_charge_swapin().
- Add SwapCache handler routine to mem_cgroup_cache_charge().
By this, shmem's file cache is charged at add_to_page_cache()
with GFP_NOWAIT.
- pass the page of swapcache to shrink_mem_cgroup.Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Paul Menage
Cc: Li Zefan
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now, a page can be deleted from SwapCache while do_swap_page().
memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is
still broken. (above behavior broke assumption of memcg-synchronized-lru
patch.)This patch is a fix for LRU handling (especially for per-zone counters).
At charging SwapCache,
- Remove page_cgroup from LRU if it's not used.
- Add page cgroup to LRU if it's not linked to.Reported-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
From:KAMEZAWA Hiroyuki
css_tryget() newly is added and we can know css is alive or not and get
refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is
not called.)This patch replaces css_get() to css_tryget(), where I cannot explain
why css_get() is safe. And removes memcg->obsolete flag.Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Daisuke Nishimura
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
1. Fix double-free BUG in error route of mem_cgroup_create().
mem_cgroup_free() itself frees per-zone-info.
2. Making refcnt of memcg simple.
Add 1 refcnt at creation and call free when refcnt goes down to 0.Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Daisuke Nishimura
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix swapin charge operation of memcg.
Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page. I.e. If
PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence(Page fault with WRITE)
try_charge() (charge += PAGESIZE)
commit_charge() (Check page_cgroup is used or not..)
reuse_swap_page()
-> delete_from_swapcache()
-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
......
New charge is uncharged soon....
To avoid this, move commit_charge() after page_mapcount() goes up to 1.
By this,try_charge() (usage += PAGESIZE)
reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
commit_charge() (If page_cgroup is not marked as PCG_USED,
add new charge.)
Accounting will be correct.Changelog (v2) -> (v3)
- fixed invalid charge to swp_entry==0.
- updated documentation.
Changelog (v1) -> (v2)
- fixed comment.[nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Tested-by: Balbir Singh
Cc: Hugh Dickins
Cc: Daisuke Nishimura
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy
now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of
try_to_free_mem_cgroup_pages(), it should be used in many cases.The only exception is force_empty. The group has no children in this
case.Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mpol_rebind_mm(), which can be called from cpuset_attach(), does
down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
called under cgroup_mutex.OTOH, page fault path does down_read(mm->mmap_sem) and calls
mem_cgroup_try_charge_xxx(), which may eventually calls
mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
cgroup_lock(). This means cgroup_lock() can be called under
down_read(mm->mmap_sem).If those two paths race, deadlock can happen.
This patch avoid this deadlock by:
- remove cgroup_lock() from mem_cgroup_out_of_memory().
- define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
(->attach handler of memory cgroup) and mem_cgroup_out_of_memory.Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds