18 Dec, 2009
4 commits
-
…/rusty/linux-2.6-for-linus
* 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
cpumask: rename tsk_cpumask to tsk_cpus_allowed
cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt
cpumask: avoid dereferencing struct cpumask
cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t
cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c
cpumask: avoid deprecated function in mm/slab.c
cpumask: use cpu_online in kernel/perf_event.c -
…s/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
NOMMU: Optimise away the {dac_,}mmap_min_addr tests
security/min_addr.c: make init_mmap_min_addr() static
keys: PTR_ERR return of wrong pointer in keyctl_get_security() -
* 'kmemleak' of git://linux-arm.org/linux-2.6:
kmemleak: fix kconfig for crc32 build error
kmemleak: Reduce the false positives by checking for modified objects
kmemleak: Show the age of an unreferenced object
kmemleak: Release the object lock before calling put_object()
kmemleak: Scan the _ftrace_events section in modules
kmemleak: Simplify the kmemleak_scan_area() function prototype
kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE -
I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
is unpluged to improve throughput on especially RAID environment.The normal case is, if page N become uptodate at time T(N), then T(N)
Acked-by: Wu Fengguang
Cc: Jens Axboe
Cc: KOSAKI Motohiro
Tested-by: Ronald
Cc: Bart Van Assche
Cc: Vladislav Bolkhovitin
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Dec, 2009
11 commits
-
These days we use cpumask_empty() which takes a pointer.
Signed-off-by: Rusty Russell
Acked-by: Christoph Lameter -
Replacing
error = 0;
if (error)
op
with nothing is not quite an equivalent transformation ;-)Signed-off-by: Al Viro
-
In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
skipped by the compiler. We do this as the minimum mmap address doesn't make
any sense in NOMMU mode.mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.
Signed-off-by: David Howells
Acked-by: Eric Paris
Signed-off-by: James Morris -
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
HWPOISON: Remove stray phrase in a comment
HWPOISON: Try to allocate migration page on the same node
HWPOISON: Don't do early filtering if filter is disabled
HWPOISON: Add a madvise() injector for soft page offlining
HWPOISON: Add soft page offline support
HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
HWPOISON: Use new shake_page in memory_failure
HWPOISON: Use correct name for MADV_HWPOISON in documentation
HWPOISON: mention HWPoison in Kconfig entry
HWPOISON: Use get_user_page_fast in hwpoison madvise
HWPOISON: add an interface to switch off/on all the page filters
HWPOISON: add memory cgroup filter
memcg: add accessor to mem_cgroup.css
memcg: rename and export try_get_mem_cgroup_from_page()
HWPOISON: add page flags filter
mm: export stable page flags
HWPOISON: limit hwpoison injector to known page types
HWPOISON: add fs/device filters
HWPOISON: return 0 to indicate success reliably
HWPOISON: make semantics of IGNORED/DELAYED clear
... -
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
direct I/O fallback sync simplification
ocfs: stop using do_sync_mapping_range
cleanup blockdev_direct_IO locking
make generic_acl slightly more generic
sanitize xattr handler prototypes
libfs: move EXPORT_SYMBOL for d_alloc_name
vfs: force reval of target when following LAST_BIND symlinks (try #7)
ima: limit imbalance msg
Untangling ima mess, part 3: kill dead code in ima
Untangling ima mess, part 2: deal with counters
Untangling ima mess, part 1: alloc_file()
O_TRUNC open shouldn't fail after file truncation
ima: call ima_inode_free ima_inode_free
IMA: clean up the IMA counts updating code
ima: only insert at inode creation time
ima: valid return code from ima_inode_alloc
fs: move get_empty_filp() deffinition to internal.h
Sanitize exec_permission_lite()
Kill cached_lookup() and real_lookup()
Kill path_lookup_open()
...Trivial conflicts in fs/direct-io.c
-
In the case of direct I/O falling back to buffered I/O we sync data
twice currently: once at the end of generic_file_buffered_write using
filemap_write_and_wait_range and once a little later in
__generic_file_aio_write using do_sync_mapping_range with all flags set.The wait before write of the do_sync_mapping_range call does not make
any sense, so just keep the filemap_write_and_wait_range call and move
it to the right spot.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
Now that we cache the ACL pointers in the generic inode all the generic_acl
cruft can go away and generic_acl.c can directly implement xattr handlers
dealing with the full Posix ACL semantics for in-memory filesystems.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
Add a flags argument to struct xattr_handler and pass it to all xattr
handler methods. This allows using the same methods for multiple
handlers, e.g. for the ACL methods which perform exactly the same action
for the access and default ACLs, just using a different underlying
attribute. With a little more groundwork it'll also allow sharing the
methods for the regular user/trusted/secure handlers in extN, ocfs2 and
jffs2 like it's already done for xfs in this patch.Also change the inode argument to the handlers to a dentry to allow
using the handlers mechnism for filesystems that require it later,
e.g. cifs.[with GFS2 bits updated by Steven Whitehouse ]
Signed-off-by: Christoph Hellwig
Reviewed-by: James Morris
Acked-by: Joel Becker
Signed-off-by: Al Viro -
There are 2 groups of alloc_file() callers:
* ones that are followed by ima_counts_get
* ones giving non-regular files
So let's pull that ima_counts_get() into alloc_file();
it's a no-op in case of non-regular files.Signed-off-by: Al Viro
-
... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
16 Dec, 2009
25 commits
-
Variable `progress' isn't used in mem_cgroup_resize_limit() any more.
Remove it.[akpm@linux-foundation.org: cleanup]
Signed-off-by: Bob Liu
Cc: Daisuke Nishimura
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
caused by race between oom and cpuset_attach) instead of cgroup_mutex to
fix a deadlock problem. The cgroup_mutex, which was removed by the
commit, in mem_cgroup_out_of_memory() was originally introduced at commit
c7ba5c9e (Memory controller: OOM handling).IIUC, the intention of this cgroup_mutex was to prevent task move during
select_bad_process() so that situations like below can be avoided.Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
1. Process A, which has been in cgroup "baa" and uses large memory, is just
moved to cgroup "foo". Process A can be the candidates for being killed.
2. Process B, which has been in cgroup "foo" and uses large memory, is just
moved from cgroup "foo". Process B can be excluded from the candidates for
being killed.But these race window exists anyway even if we hold a lock, because
__mem_cgroup_try_charge() decides wether it should trigger oom or not
outside of the lock. So the original cgroup_mutex in
mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And
IMHO, those races are not so critical for users.This patch removes it and make codes simpler.
Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_in_mem_cgroup(), which is called by select_bad_process() to check
whether a task can be a candidate for being oom-killed from memcg's limit,
checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
to).But this check return true(it's false positive) when:
/aa use_hierarchy == 0 /aa/00 use_hierarchy == 1
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_move_parent() calls try_charge first and cancel_charge on
failure. IMHO, charge/uncharge(especially charge) is high cost operation,
so we should avoid it as far as possible.This patch tries to delay try_charge in mem_cgroup_move_parent() by
re-ordering checks it does.And this patch renames mem_cgroup_move_account() to
__mem_cgroup_move_account(), changes the return value of
__mem_cgroup_move_account() from int to void, and adds a new
wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
for moving account and calls __mem_cgroup_move_account().This patch removes the last caller of trylock_page_cgroup(), so removes
its definition too.Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are some places calling both res_counter_uncharge() and css_put() to
cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().This patch introduces mem_cgroup_cancel_charge() and call it in those
places.Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes
grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPEDAnd in global VM, mapped shared memory is accounted into FILE_MAPPED.
But memcg doesn't. fix it.
Note:
page_is_file_cache() just checks SwapBacked or not.
So, we need to check PageAnon.Cc: Balbir Singh
Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a patch for coalescing access to res_counter at charging by percpu
caching. At charge, memcg charges 64pages and remember it in percpu
cache. Because it's cache, drain/flush if necessary.This version uses public percpu area.
2 benefits for using public percpu area.
1. Sum of stocked charge in the system is limited to # of cpus
not to the number of memcg. This shows better synchonization.
2. drain code for flush/cpuhotplug is very easy (and quick)The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoiding
false sharing.On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.[without memcg config]
40156968 page-faults # 0.085 M/sec ( +- 0.046% )
27.67 cache-miss/faults[root cgroup]
36659599 page-faults # 0.077 M/sec ( +- 0.247% )
31.58 cache miss/faults[in a child cgroup]
18444157 page-faults # 0.039 M/sec ( +- 0.133% )
69.96 cache miss/faults[ + coalescing uncharge patch]
27133719 page-faults # 0.057 M/sec ( +- 0.155% )
47.16 cache miss/faults[ + coalescing uncharge patch + this patch ]
34224709 page-faults # 0.072 M/sec ( +- 0.173% )
34.69 cache miss/faultsChangelog (since Oct/2):
- updated comments
- replaced get_cpu_var() with __get_cpu_var() if possible.
- removed mutex for system-wide drain. adds a counter instead of it.
- removed CONFIG_HOTPLUG_CPUChangelog (old):
- rebased onto the latest mmotm
- moved charge size check before __GFP_WAIT check for avoiding unnecesary
- added asynchronous flush routine.
- fixed bugs pointed out by Nishimura-san.[akpm@linux-foundation.org: tweak comments]
[nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In massive parallel enviroment, res_counter can be a performance
bottleneck. One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. A reason for per-task batching is for making use
of caller's context information. We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).The degree of coalescing depends on callers
- at invalidate/trucate... pagevec size
- at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.[without memcg config]
40156968 page-faults # 0.085 M/sec ( +- 0.046% )
27.67 cache-miss/faults
[root cgroup]
36659599 page-faults # 0.077 M/sec ( +- 0.247% )
31.58 miss/faults
[in a child cgroup]
18444157 page-faults # 0.039 M/sec ( +- 0.133% )
69.96 miss/faults
[child with this patch]
27133719 page-faults # 0.057 M/sec ( +- 0.155% )
47.16 miss/faultsWe can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.Changelog(since 2009/10/02):
- renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
- some clean up and commentary/description updates.
- added initialize code to copy_process(). (possible bug fix)Changelog(old):
- fixed !CONFIG_MEM_CGROUP case.
- rebased onto the latest mmotm + softlimit fix patches.
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum
of the usage of pages and swapents in the cgroup. Presently the root
cgroup's memsw.usage_in_bytes shows the wrong value - the number of
swapents are not added.So take MEM_CGROUP_STAT_SWAPOUT into account.
Signed-off-by: Kirill A. Shutemov
Reviewed-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix node-oriented allocation handling in oom-kill.c I myself think of this
as a bugfix not as an ehnancement.In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())So, current oom-killer's check function is wrong.
This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc: Daisuke Nishimura
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In a typical oom analysis scenario, we frequently want to know whether the
killed process has a memory leak or not at the first step. This patch
adds vsz and rss information to the oom log to help this analysis. To
save time for the debugging.example:
===================================================================
rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
Call Trace:
[] ?_spin_unlock+0x2b/0x40
[] oom_kill_process+0xbe/0x2b0(snip)
492283 pages non-shared
Out of memory: kill process 2341 (memhog) score 527276 or a child
Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
===========================================================================
^
|
here[rientjes@google.com: fix race, add pid & comm to message]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Better to have complete sentences.
Signed-off-by: Andi Kleen
-
Signed-off-by: Andi Kleen
-
Signed-off-by: Andi Kleen
-
Process based injection is much easier to handle for test programs,
who can first bring a page into a specific state and then test.
So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
to the existing hard offline injector.Signed-off-by: Andi Kleen
-
This is a simpler, gentler variant of memory_failure() for soft page
offlining controlled from user space. It doesn't kill anything, just
tries to invalidate and if that doesn't work migrate the
page away.This is useful for predictive failure analysis, where a page has
a high rate of corrected errors, but hasn't gone bad yet. Instead
it can be offlined early and avoided.The offlining is controlled from sysfs, including a new generic
entry point for hard page offlining for symmetry too.We use the page isolate facility to prevent re-allocation
race. Normally this is only used by memory hotplug. To avoid
races with memory allocation I am using lock_system_sleep().
This avoids the situation where memory hotplug is about
to isolate a page range and then hwpoison undoes that work.
This is a big hammer currently, but the simplest solution
currently.When the page is not free or LRU we try to free pages
from slab and other caches. The slab freeing is currently
quite dumb and does not try to focus on the specific slab
cache which might own the page. This could be potentially
improved later.Thanks to Fengguang Wu and Haicheng Li for some fixes.
[Added fix from Andrew Morton to adapt to new migrate_pages prototype]
Signed-off-by: Andi Kleen -
Signed-off-by: Andi Kleen
-
shake_page handles more types of page caches than
the much simpler lru_add_drain_all:- slab (quite inefficiently for now)
- any other caches with a shrinker callback
- per cpu page allocator pages
- per CPU LRUUse this call to try to turn pages into free or LRU pages.
Then handle the case of the page becoming free after drain everything.Signed-off-by: Andi Kleen
-
Signed-off-by: Andi Kleen
-
The previous version didn't take the mmap_sem before calling gup(),
which is racy.Use get_user_pages_fast() instead which doesn't need any locks.
This is also faster of course, but then it doesn't really matter
because this is just a testing path.Based on report from Nick Piggin.
Cc: npiggin@suse.deSigned-off-by: Andi Kleen
-
In some use cases, user doesn't need extra filtering. E.g. user program
can inject errors through madvise syscall to its own pages, however it
might not know what the page state exactly is or which inode the page
belongs to.So introduce an one-off interface "corrupt-filter-enable".
Echo 0 to switch off page filters, and echo 1 to switch on the filters.
[AK: changed default to 0]Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
The hwpoison test suite need to inject hwpoison to a collection of
selected task pages, and must not touch pages not owned by them and
thus kill important system processes such as init. (But it's OK to
mis-hwpoison free/unowned pages as well as shared clean pages.
Mis-hwpoison of shared dirty pages will kill all tasks, so the test
suite will target all or non of such tasks in the first place.)The memory cgroup serves this purpose well. We can put the target
processes under the control of a memory cgroup, and tell the hwpoison
injection code to only kill pages associated with some active memory
cgroup.The prerequisite for doing hwpoison stress tests with mem_cgroup is,
the mem_cgroup code tracks task pages _accurately_ (unless page is
locked). Which we believe is/should be true.The benefits are simplification of hwpoison injector code. Also the
mem_cgroup code will automatically be tested by hwpoison test cases.The alternative interfaces pin-pfn/unpin-pfn can also delegate the
(process and page flags) filtering functions reliably to user space.
However prototype implementation shows that this scheme adds more
complexity than we wanted.Example test case:
mkdir /cgroup/hwpoison
usemem -m 100 -s 1000 &
echo `jobs -p` > /cgroup/hwpoison/tasksmemcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcgpage-types -p `pidof init` --hwpoison # shall do nothing
page-types -p `pidof usemem` --hwpoison # poison its pages[AK: Fix documentation]
[Add fix for problem noticed by Li Zefan ;
dentry in the css could be NULL]CC: KOSAKI Motohiro
CC: Hugh Dickins
CC: Daisuke Nishimura
CC: Balbir Singh
CC: KAMEZAWA Hiroyuki
CC: Li Zefan
CC: Paul Menage
CC: Nick Piggin
CC: Andi Kleen
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
So that an outside user can free the reference count grabbed by
try_get_mem_cgroup_from_page().CC: KOSAKI Motohiro
CC: Hugh Dickins
CC: Daisuke Nishimura
CC: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
So that the hwpoison injector can get mem_cgroup for arbitrary page
and thus know whether it is owned by some mem_cgroup task(s).[AK: Merged with latest git tree]
CC: KOSAKI Motohiro
CC: Hugh Dickins
CC: Daisuke Nishimura
CC: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
When specified, only poison pages if ((page_flags & mask) == value).
- corrupt-filter-flags-mask
- corrupt-filter-flags-valueThis allows stress testing of many kinds of pages.
Strictly speaking, the buddy pages requires taking zone lock, to avoid
setting PG_hwpoison on a "was buddy but now allocated to someone" page.
However we can just do nothing because we set PG_locked in the beginning,
this prevents the page allocator from allocating it to someone. (It will
BUG() on the unexpected PG_locked, which is fine for hwpoison testing.)[AK: Add select PROC_PAGE_MONITOR to satisfy dependency]
CC: Nick Piggin
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen