Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 Nov, 2011

11 commits

c9f01245b oom: remove oom_disable_count ... Browse Code »

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and

- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes
Reported-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
7b0d44fa4 oom: avoid killing kthreads if they assume the oom killed thread's mm ... Browse Code »

After selecting a task to kill, the oom killer iterates all processes and
kills all other threads that share the same mm_struct in different thread
groups. It would not otherwise be helpful to kill a thread if its memory
would not be subsequently freed.

A kernel thread, however, may assume a user thread's mm by using
use_mm(). This is only temporary and should not result in sending a
SIGKILL to that kthread.

This patch ensures that only user threads and not kthreads are sent a
SIGKILL if they share the same mm_struct as the oom killed task.

Signed-off-by: David Rientjes
Reviewed-by: Michal Hocko
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
f660daac4 oom: thaw threads if oom killed thread is frozen before deferring ... Browse Code »

If a thread has been oom killed and is frozen, thaw it before returning to
the page allocator. Otherwise, it can stay frozen indefinitely and no
memory will be freed.

Signed-off-by: David Rientjes
Reported-by: Konstantin Khlebnikov
Cc: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: "Rafael J. Wysocki"
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
d08c429b0 mm/page-writeback.c: document bdi_min_ratio ... Browse Code »

Looks like someone got distracted after adding the comment characters.

Signed-off-by: Johannes Weiner
Acked-by: Peter Zijlstra
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-01 08:30:45 +0800
3da367c3e vmscan: add block plug for page reclaim ... Browse Code »

per-task block plug can reduce block queue lock contention and increase
request merge. Currently page reclaim doesn't support it. I originally
thought page reclaim doesn't need it, because kswapd thread count is
limited and file cache write is done at flusher mostly.

When I test a workload with heavy swap in a 4-node machine, each CPU is
doing direct page reclaim and swap. This causes block queue lock
contention. In my test, without below patch, the CPU utilization is about
2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk
throughput isn't changed. This should improve normal kswapd write and
file cache write too (increase request merge for example), but might not
be so obvious as I explain above.

Signed-off-by: Shaohua Li
Cc: Jens Axboe
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-11-01 08:30:45 +0800
0dabec93d mm: migration: clean up unmap_and_move() ... Browse Code »

unmap_and_move() is one a big messy function. Clean it up.

Signed-off-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:45 +0800
f80c06736 mm: zone_reclaim: make isolate_lru_page() filter-aware ... Browse Code »

In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800
39deaf858 mm: compaction: make isolate_lru_page() filter-aware ... Browse Code »

In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800
4356f21d0 mm: change isolate mode from #define to bitwise type ... Browse Code »

Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800
b9e84ac15 mm: compaction: trivial clean up in acct_isolated() ... Browse Code »

acct_isolated of compaction uses page_lru_base_type which returns only
base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in conpact_control.

This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.

Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800
fcf634098 Cross Memory Attach ... Browse Code »

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Arnd Bergmann
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: James Morris
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christopher Yeoh
2011-11-01 08:30:44 +0800

29 Oct, 2011

1 commit

f362f98e7 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue ... Browse Code »

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
leases: fix write-open/read-lease race
nfs: drop unnecessary locking in llseek
ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
vfs: add generic_file_llseek_size
vfs: do (nearly) lockless generic_file_llseek
direct-io: merge direct_io_walker into __blockdev_direct_IO
direct-io: inline the complete submission path
direct-io: separate map_bh from dio
direct-io: use a slab cache for struct dio
direct-io: rearrange fields in dio/dio_submit to avoid holes
direct-io: fix a wrong comment
direct-io: separate fields only used in the submission path from struct dio
vfs: fix spinning prevention in prune_icache_sb
vfs: add a comment to inode_permission()
vfs: pass all mask flags check_acl and posix_acl_permission
vfs: add hex format for MAY_* flag values
vfs: indicate that the permission functions take all the MAY_* flags
compat: sync compat_stats with statfs.
vfs: add "device" tag to /proc/self/mountstats
cleanup: vfs: small comment fix for block_invalidatepage
...

Fix up trivial conflict in fs/gfs2/file.c (llseek changes)

Linus Torvalds
2011-10-29 01:49:34 +0800

28 Oct, 2011

1 commit

39be79c16 vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriately ... Browse Code »

Currently, when you call iov_iter_advance, then the pointer to the iovec
array can be incremented, but it does not decrement the nr_segs value in
the iov_iter struct. The result is a iov_iter struct with a nr_segs
value that goes beyond the end of the array.

While I'm not aware of anything that's specifically broken by this, it
seems odd and a bit dangerous not to decrement that value. If someone
were to trust the nr_segs value to be correct, then they could end up
walking off the end of the array.

Changing this might also provide some micro-optimization when dealing
with the last iovec in an array. Many of the other routines that deal
with iov_iter have optimized codepaths when nr_segs == 1.

Cc: Nick Piggin
Signed-off-by: Jeff Layton
Signed-off-by: Christoph Hellwig

Jeff Layton
2011-10-28 19:55:08 +0800

26 Oct, 2011

1 commit

e182a345d Merge branches 'slab/next' and 'slub/partial' into slab/for-linus Browse Code »

Pekka Enberg
2011-10-26 23:09:12 +0800

25 Oct, 2011

2 commits

59e525341 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...

Linus Torvalds
2011-10-25 18:11:02 +0800
36b8d186e Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security ... Browse Code »

* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
TOMOYO: Fix incomplete read after seek.
Smack: allow to access /smack/access as normal user
TOMOYO: Fix unused kernel config option.
Smack: fix: invalid length set for the result of /smack/access
Smack: compilation fix
Smack: fix for /smack/access output, use string instead of byte
Smack: domain transition protections (v3)
Smack: Provide information for UDS getsockopt(SO_PEERCRED)
Smack: Clean up comments
Smack: Repair processing of fcntl
Smack: Rule list lookup performance
Smack: check permissions from user space (v2)
TOMOYO: Fix quota and garbage collector.
TOMOYO: Remove redundant tasklist_lock.
TOMOYO: Fix domain transition failure warning.
TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
TOMOYO: Simplify garbage collector.
TOMOYO: Fix make namespacecheck warnings.
target: check hex2bin result
encrypted-keys: check hex2bin result
...

Linus Torvalds
2011-10-25 15:45:31 +0800

20 Oct, 2011

1 commit

486cf46f3 mm: fix race between mremap and removing migration entry ... Browse Code »

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[] []
migration_entry_wait+0x156/0x160
[] handle_pte_fault+0xae1/0xaf0
[] ? __pte_alloc+0x42/0x120
[] ? do_huge_pmd_anonymous_page+0xab/0x310
[] handle_mm_fault+0x181/0x310
[] ? vma_adjust+0x537/0x570
[] do_page_fault+0x11d/0x4e0
[] ? do_mremap+0x2d5/0x570
[] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.

The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.

Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.

Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.

Reported-and-tested-by: Paweł Sikora
Signed-off-by: Hugh Dickins
Acked-by: Andrea Arcangeli
Acked-by: Mel Gorman
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-10-20 14:42:58 +0800

28 Sep, 2011

3 commits

dcc3be6a5 slub: Discard slab page when node partial > minimum partial number ... Browse Code »

Discarding slab should be done when node partial > min_partial. Otherwise,
node partial slab may eat up all memory.

Signed-off-by: Alex Shi
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Alex Shi
2011-09-28 04:03:31 +0800
9f2649041 slub: correct comments error for per cpu partial ... Browse Code »

Correct comment errors, that mistake cpu partial objects number as pages
number, may make reader misunderstand.

Signed-off-by: Alex Shi
Reviewed-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Alex Shi
2011-09-28 04:03:30 +0800
ab067e99d mm: restrict access to slab files under procfs and sysfs ... Browse Code »

Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:

1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.

2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.

3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).

4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.

5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.

6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.

Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.

Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:

http://thread.gmane.org/gmane.linux.kernel/1108378

To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:

groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER

And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):

chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*

Signed-off-by: Vasiliy Kulikov
Reviewed-by: Kees Cook
Reviewed-by: Dave Hansen
Acked-by: Christoph Lameter
Acked-by: David Rientjes
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds
CC: Alan Cox
Signed-off-by: Pekka Enberg

Vasiliy Kulikov
2011-09-28 03:59:27 +0800

22 Sep, 2011

1 commit

fed678dc8 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-block:
floppy: use del_timer_sync() in init cleanup
blk-cgroup: be able to remove the record of unplugged device
block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
mm: Add comment explaining task state setting in bdi_forker_thread()
mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
block: simplify force plug flush code a little bit
block: change force plug flush call order
block: Fix queue_flag update when rq_affinity goes from 2 to 1
block: separate priority boosting from REQ_META
block: remove READ_META and WRITE_META
xen-blkback: fixed indentation and comments
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

Linus Torvalds
2011-09-22 04:20:21 +0800

19 Sep, 2011

2 commits

b6a68a5ba Merge branch 'slab/urgent' of git://github.com/penberg/linux ... Browse Code »

* 'slab/urgent' of git://github.com/penberg/linux:
slub: add slab with one free object to partial list tail

Linus Torvalds
2011-09-19 23:02:41 +0800
d20bbfab0 Merge branch 'slab/urgent' into slab/next Browse Code »

Pekka Enberg
2011-09-19 22:46:07 +0800

15 Sep, 2011

9 commits

e060c3843 Merge branch 'master' into for-next ... Browse Code »

Fast-forward merge with Linus to be able to merge patches
based on more recent version of the tree.

Jiri Kosina
2011-09-15 21:08:18 +0800
8c1fec1ba mm: Convert vmalloc/memset to vzalloc ... Browse Code »

Signed-off-by: Joe Perches
Acked-by: Paul Menage
Signed-off-by: Jiri Kosina

Joe Perches
2011-09-15 19:56:56 +0800
cc39c6a9b mm: account skipped entries to avoid looping in find_get_pages ... Browse Code »

The found entries by find_get_pages() could be all swap entries. In
this case we skip the entries, but make sure the skipped entries are
accounted, so we don't keep looping.

Using nr_found > nr_skip to simplify code as suggested by Eric.

Reported-and-tested-by: Eric Dumazet
Signed-off-by: Shaohua Li
Signed-off-by: Linus Torvalds

Shaohua Li
2011-09-15 09:17:56 +0800
461ae488e mm: sync vmalloc address space page tables in alloc_vm_area() ... Browse Code »

Xen backend drivers (e.g., blkback and netback) would sometimes fail to
map grant pages into the vmalloc address space allocated with
alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
not find the page (in the L2 table) containing the PTEs it needed to
update.

(XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000

netback and blkback were making the hypercall from a kernel thread where
task->active_mm != &init_mm and alloc_vm_area() was only updating the page
tables for init_mm. The usual method of deferring the update to the page
tables of other processes (i.e., after taking a fault) doesn't work as a
fault cannot occur during the hypercall.

This would work on some systems depending on what else was using vmalloc.

Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all()
from alloc_vm_area()") and add a comment to explain why it's needed.

Signed-off-by: David Vrabel
Cc: Jeremy Fitzhardinge
Cc: Konrad Rzeszutek Wilk
Cc: Ian Campbell
Cc: Keir Fraser
Cc: [3.0.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Vrabel
2011-09-15 09:09:38 +0800
185efc0f9 memcg: Revert "memcg: add memory.vmscan_stat" ... Browse Code »

Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
memory.vmscan_stat").

The implementation of per-memcg reclaim statistics violates how memcg
hierarchies usually behave: hierarchically.

The reclaim statistics are accounted to child memcgs and the parent
hitting the limit, but not to hierarchy levels in between. Usually,
hierarchical statistics are perfectly recursive, with each level
representing the sum of itself and all its children.

Since this exports statistics to userspace, this may lead to confusion
and problems with changing things after the release, so revert it now,
we can try again later.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Michal Hocko
Cc: Ying Han
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-09-15 09:09:38 +0800
a4d3e9e76 mm: vmscan: fix force-scanning small targets without swap ... Browse Code »

Without swap, anonymous pages are not scanned. As such, they should not
count when considering force-scanning a small target if there is no swap.

Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.

This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Cc: Ying Han
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Daisuke Nishimura
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-09-15 09:09:37 +0800
0d6617c77 numa: fix NUMA compile error when sysfs and procfs are disabled ... Browse Code »

The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
yet it is referenced for per-node vmstat with CONFIG_NUMA:

drivers/built-in.o: In function `node_read_vmstat':
node.c:(.text+0x1106df): undefined reference to `vmstat_text'

Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
vmstats").

Define the array for CONFIG_NUMA as well.

[akpm@linux-foundation.org: remove unneeded ifdefs]
Signed-off-by: David Rientjes
Reported-by: Cong Wang
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-09-15 09:09:37 +0800
2bbff6c76 mm/mempolicy.c: make copy_from_user() provably correct ... Browse Code »

When compiling mm/mempolicy.c with struct user copy checks the following
warning is shown:

In file included from arch/x86/include/asm/uaccess.h:572,
from include/linux/uaccess.h:5,
from include/linux/highmem.h:7,
from include/linux/pagemap.h:10,
from include/linux/mempolicy.h:70,
from mm/mempolicy.c:68:
In function `copy_from_user',
inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
LD mm/built-in.o

Fix this by passing correct buffer size value.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-09-15 09:09:36 +0800
8aacc9f55 mm/mempolicy.c: fix pgoff in mbind vma merge ... Browse Code »

commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really
fix the mbind vma merge problem due to wrong pgoff value passing to
vma_merge(), which made vma_merge() always return NULL.

Before the patch applied, we are getting a result like:

addr = 0x7fa58f00c000
[snip]
7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0

here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
merged described as described in commit 9d8cebd.

Re-testing the patched kernel with the reproducer provided in commit
9d8cebd, we get the correct result:

addr = 0x7ffa5aaa2000
[snip]
7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack]

Signed-off-by: Caspar Zhang
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Caspar Zhang
2011-09-15 09:09:36 +0800

14 Sep, 2011

1 commit

12d79634f slub: Code optimization in get_partial_node() ... Browse Code »

I find a way to reduce a variable in get_partial_node(). That is also helpful
for code understanding.

Acked-by: Christoph Lameter
Signed-off-by: Alex Shi
Signed-off-by: Pekka Enberg

Alex,Shi
2011-09-14 01:41:25 +0800

03 Sep, 2011

2 commits

09f40f98b mm: Add comment explaining task state setting in bdi_forker_thread() ... Browse Code »

CC: Wu Fengguang
CC: Andrew Morton
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2011-09-03 07:17:02 +0800
5a042aa4b mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() ... Browse Code »

bdi_forker_thread() clears BDI_pending bit at the end of the main loop.
However clearing of this bit must not be done in some cases which is
handled by calling 'continue' from switch statement. That's kind of
unusual construct and without a good reason so change the function into
more intuitive code flow.

CC: Wu Fengguang
CC: Andrew Morton
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2011-09-03 07:17:02 +0800

27 Aug, 2011

2 commits

136333d10 slub: explicitly document position of inserting slab to partial list ... Browse Code »

Adding slab to partial list head/tail is sensitive to performance.
So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document
it to avoid we get it wrong.

Acked-by: Christoph Lameter
Signed-off-by: Shaohua Li
Signed-off-by: Shaohua Li
Signed-off-by: Pekka Enberg

Shaohua Li
2011-08-27 16:59:00 +0800
130655ef0 slub: add slab with one free object to partial list tail ... Browse Code »

The slab has just one free object, adding it to partial list head doesn't make
sense. And it can cause lock contentation. For example,
1. CPU takes the slab from partial list
2. fetch an object
3. switch to another slab
4. free an object, then the slab is added to partial list again
In this way n->list_lock will be heavily contended.
In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70%
against 3.0. This patch fixes it.

Acked-by: Christoph Lameter
Reported-by: Alex Shi
Signed-off-by: Shaohua Li
Signed-off-by: Shaohua Li
Signed-off-by: Pekka Enberg

Shaohua Li
2011-08-27 16:58:59 +0800

26 Aug, 2011

3 commits

23751be00 memcg: fix hierarchical oom locking ... Browse Code »

Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
counter") tried to oom lock the hierarchy and roll back upon
encountering an already locked memcg.

The code is confused when it comes to detecting a locked memcg, though,
so it would fail and rollback after locking one memcg and encountering
an unlocked second one.

The result is that oom-locking hierarchies fails unconditionally and
that every oom killer invocation simply goes to sleep on the oom
waitqueue forever. The tasks practically hang forever without anyone
intervening, possibly holding locks that trip up unrelated tasks, too.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-08-26 07:25:34 +0800
439423f68 vmscan: clear ZONE_CONGESTED for zone with good watermark ... Browse Code »

ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task. It's possible ZONE_CONGESTED isn't cleared in some cases:

1. the zone is already balanced just entering balance_pgdat() for
order-0 because concurrent tasks free memory. In this case, later
check will skip the zone as it's balanced so the flag isn't cleared.

2. high order balance fallbacks to order-0. quote from Mel: At the
end of balance_pgdat(), kswapd uses the following logic;

If reclaiming at high order {
for each zone {
if all_unreclaimable
skip
if watermark is not met
order = 0
loop again

/* watermark is met */
clear congested
}
}

i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not,
it restarts balancing at order-0. However, if the higher zones are
balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
that only happens after a zone is shrunk. This can mean that
wait_iff_congested() stalls unnecessarily.

This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.

Signed-off-by: Shaohua Li
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-08-26 07:25:34 +0800
f51bdd2e9 mm: fix a vmscan warning ... Browse Code »

I get the below warning:

BUG: using smp_processor_id() in preemptible [00000000] code: bash/746
caller is native_sched_clock+0x37/0x6e
Pid: 746, comm: bash Tainted: G W 3.0.0+ #254
Call Trace:
[] debug_smp_processor_id+0xc2/0xdc
[] native_sched_clock+0x37/0x6e
[] try_to_free_mem_cgroup_pages+0x7d/0x270
[] mem_cgroup_force_empty+0x24b/0x27a
[] ? sys_close+0x38/0x138
[] ? sys_close+0x38/0x138
[] mem_cgroup_force_empty_write+0x17/0x19
[] cgroup_file_write+0xa8/0xba
[] vfs_write+0xb3/0x138
[] sys_write+0x4a/0x71
[] ? sys_close+0xf0/0x138
[] system_call_fastpath+0x16/0x1b

sched_clock() can't be used with preempt enabled. And we don't need
fast approach to get clock here, so let's use ktime API.

Signed-off-by: Shaohua Li
Acked-by: KAMEZAWA Hiroyuki
Tested-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-08-26 07:25:34 +0800