Eric Lee / smarc-fsl-linux-kernel

17 Jan, 2011

2 commits

e20511728 configfs: change depends -> select SYSFS ... Browse Code »

This patch changes configfs to select SYSFS to fix the following:

warning: (TARGET_CORE && GFS2_FS) selects CONFIGFS_FS which has unmet direct dependencies (SYSFS)

Reported-by: Randy Dunlap
Signed-off-by: Nicholas A. Bellinger
Acked-by: Joel Becker

Nicholas Bellinger
2011-01-17 05:22:29 +0800
f652f6c5b Merge branch 'master' of /pub/scm/linux/kernel/git/jejb/scsi-post-merge-2.6 into for-linus Browse Code »

Nicholas Bellinger
2011-01-17 05:21:04 +0800

15 Jan, 2011

1 commit

c66ac9db8 [SCSI] target: Add LIO target core v4.0.0-rc6 ... Browse Code »

LIO target is a full featured in-kernel target framework with the
following feature set:

High-performance, non-blocking, multithreaded architecture with SIMD
support.

Advanced SCSI feature set:

* Persistent Reservations (PRs)
* Asymmetric Logical Unit Assignment (ALUA)
* Protocol and intra-nexus multiplexing, load-balancing and failover (MC/S)
* Full Error Recovery (ERL=0,1,2)
* Active/active task migration and session continuation (ERL=2)
* Thin LUN provisioning (UNMAP and WRITE_SAMExx)

Multiprotocol target plugins

Storage media independence:

* Virtualization of all storage media; transparent mapping of IO to LUNs
* No hard limits on number of LUNs per Target; maximum LUN size ~750 TB
* Backstores: SATA, SAS, SCSI, BluRay, DVD, FLASH, USB, ramdisk, etc.

Standards compliance:

* Full compliance with IETF (RFC 3720)
* Full implementation of SPC-4 PRs and ALUA

Significant code cleanups done by Christoph Hellwig.

[jejb: fix up for new block bdev exclusive interface. Minor fixes from
Randy Dunlap and Dan Carpenter.]
Signed-off-by: Nicholas A. Bellinger
Signed-off-by: James Bottomley

Nicholas Bellinger
2011-01-15 00:12:29 +0800

14 Jan, 2011

37 commits

f4013c387 [SCSI] sd,sr: kill compat SDEV_MEDIA_CHANGE event ... Browse Code »

SDEV_MEDIA_CHANGE event was first added by commit a341cd0f (SCSI: add
asynchronous event notification API) for SATA AN support and then
extended to cover generic media change events by commit 285e9670
([SCSI] sr,sd: send media state change modification events).

This event was mapped to block device in userland with all properties
stripped to simulate CHANGE event on the block device, which, in turn,
was used to trigger further userspace action on media change.

The recent addition of disk event framework kept this event for
backward compatibility but it turns out to be unnecessary and causes
erratic and inefficient behavior. The new disk event generates proper
events on the block devices and the compat events are mapped to block
device with all properties stripped, so the block device ends up
generating multiple duplicate events for single actual event.

This patch removes the compat event generation from both sr and sd as
suggested by Kay Sievers. Both existing and newer versions of udev
and the associated tools will behave better with the removal of these
events as they from the beginning were expecting events on the block
devices.

Signed-off-by: Tejun Heo
Acked-by: Kay Sievers
Signed-off-by: James Bottomley

Tejun Heo
2011-01-14 23:17:35 +0800
2bae0093c [SCSI] sd: implement sd_check_events() ... Browse Code »

Replace sd_media_change() with sd_check_events().

* Move media removed logic into set_media_not_present() and
media_not_present() and set sdev->changed iff an existing media is
removed or the device indicates UNIT_ATTENTION.

* Make sd_check_events() sets sdev->changed if previously missing
media becomes present.

* Event is reported only if sdev->changed is set.

This makes media presence event reported if scsi_disk->media_present
actually changed or the device indicated UNIT_ATTENTION. For backward
compatibility, SDEV_EVT_MEDIA_CHANGE is generated each time
sd_check_events() detects media change event.

[jejb: fix boot failure]
Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Signed-off-by: James Bottomley

Tejun Heo
2011-01-14 23:17:34 +0800
52cfd503a Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 ... Browse Code »

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
ACPI: fix resource check message
ACPI / Battery: Update information on info notification and resume
ACPI: Drop device flag wake_capable
ACPI: Always check if _PRW is present before trying to evaluate it
ACPI / PM: Check status of power resources under mutexes
ACPI / PM: Rename acpi_power_off_device()
ACPI / PM: Drop acpi_power_nocheck
ACPI / PM: Drop acpi_bus_get_power()
Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
ACPI / Fan: Rework the handling of power resources
ACPI / PM: Register power resource devices as soon as they are needed
ACPI / PM: Register acpi_power_driver early
ACPI / PM: Add function for updating device power state consistently
ACPI / PM: Add function for device power state initialization
ACPI / PM: Introduce __acpi_bus_get_power()
ACPI / PM: Introduce function for refcounting device power resources
ACPI / PM: Add functions for manipulating lists of power resources
ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
ACPICA: Update version to 20101209
...

Linus Torvalds
2011-01-14 12:15:35 +0800
dc8e7e3ec Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 ... Browse Code »

* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer
intel_idle: open broadcast clock event
cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific
cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle
cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions
SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW
cpuidle: delete NOP CPUIDLE_FLAG_POLL
ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs
cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL
ACPI, intel_idle: Cleanup idle= internal variables
cpuidle: Make cpuidle_enable_device() call poll_idle_init()
intel_idle: update Sandy Bridge core C-state residency targets

Linus Torvalds
2011-01-14 12:15:18 +0800
2c79c69ad Merge branch 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6 ... Browse Code »

* 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6:
SFI: use ioremap_cache() instead of ioremap()

Linus Torvalds
2011-01-14 12:15:02 +0800
db9effe99 Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/npiggin/linux-npiggin

* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes

Fixed up trivial conflicts in Documentation/filesystems/porting

Linus Torvalds
2011-01-14 12:14:13 +0800
f20877d94 fs: fix do_last error case when need_reval_dot ... Browse Code »

When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'

Should we keep and return the initialized 'error' in this case.

Signed-off-by: Nick Piggin

J. R. Okajima
2011-01-14 11:56:04 +0800
657e94b67 nfs: add missing rcu-walk check ... Browse Code »

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:48:39 +0800
9c4bc1c2b Merge branch 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen ... Browse Code »

* 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/p2m: Fix module linking error.
xen p2m: clear the old pte when adding a page to m2p_override
xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
xen: introduce gnttab_map_refs and gnttab_unmap_refs
xen p2m: transparently change the p2m mappings in the m2p override
xen/gntdev: Fix circular locking dependency
xen/gntdev: stop using "token" argument
xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
xen: add m2p override mechanism
xen: move p2m handling to separate file
xen/gntdev: add VM_PFNMAP to vma
xen/gntdev: allow usermode to map granted pages
xen: define gnttab_set_map_op/unmap_op

Fix up trivial conflict in drivers/xen/Kconfig

Linus Torvalds
2011-01-14 10:46:48 +0800
2c0076d8c Merge branch 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen ... Browse Code »

* 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen-platform: Fix compile errors if CONFIG_PCI is not enabled.
xen: rename platform-pci module to xen-platform-pci.
xen-platform: use PCI interfaces to request IO and MEM resources.

Linus Torvalds
2011-01-14 10:44:52 +0800
2c6755988 fs: hlist UP debug fixup ... Browse Code »

Po-Yu Chuang noticed that hlist_bl_set_first could
crash on a UP system when LIST_BL_LOCKMASK is 0, because

LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK));

always evaulates to true.

Fix the expression, and also avoid a dependency between bit spinlock
implementation and list bl code (list code shouldn't know anything
except that bit 0 is set when adding and removing elements). Eventually
if a good use case comes up, we might use this list to store 1 or more
arbitrary bits of data, so it really shouldn't be tied to locking either,
but for now they are helpful for debugging.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:36:43 +0800
90dbb77ba fs: fix dropping of rcu-walk from force_reval_path ... Browse Code »

As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:36:19 +0800
bb20c18db fs: force_reval_path drop rcu-walk before d_invalidate ... Browse Code »

d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.

(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:35:53 +0800
a82416da8 fs: small rcu-walk documentation fixes ... Browse Code »

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:26:53 +0800
50de1dd96 memcg: fix memory migration of shmem swapcache ... Browse Code »

In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".

But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid. As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.

This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.

Signed-off-by: Daisuke Nishimura
Reviewed-by: Minchan Kim
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Minchan Kim
Reviewed-by: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-01-14 09:32:51 +0800
17295c88a memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset ... Browse Code »

In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory. This can be more efficiently
achieved by using kzalloc() and vzalloc(). There's also one situation
where we can use kzalloc_node() - this is what's new in this version of
the patch.

Signed-off-by: Jesper Juhl
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Wu Fengguang
Cc: Balbir Singh
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2011-01-14 09:32:51 +0800
dfe076b09 memcg: fix deadlock between cpuset and memcg ... Browse Code »

Commit b1dd693e ("memcg: avoid deadlock between move charge and
try_charge()") can cause another deadlock about mmap_sem on task migration
if cpuset and memcg are mounted onto the same mount point.

After the commit, cgroup_attach_task() has sequence like:

cgroup_attach_task()
ss->can_attach()
cpuset_can_attach()
mem_cgroup_can_attach()
down_read(&mmap_sem) (1)
ss->attach()
cpuset_attach()
mpol_rebind_mm()
down_write(&mmap_sem) (2)
up_write(&mmap_sem)
cpuset_migrate_mm()
do_migrate_pages()
down_read(&mmap_sem)
up_read(&mmap_sem)
mem_cgroup_move_task()
mem_cgroup_clear_mc()
up_read(&mmap_sem)

We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

But the commit itself is necessary to fix deadlocks which have existed
before the commit like:

Ex.1)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() | down_write(&mmap_sem)
mc.moving_task = current | ..
mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
mem_cgroup_count_precharge() | prepare_to_wait()
down_read(&mmap_sem) | if (mc.moving_task)
-> cannot aquire the lock | -> true
| schedule()
| -> move charge should wake it up

Ex.2)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() |
mc.moving_task = current |
mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge() |
down_read(&mmap_sem) |
.. |
up_read(&mmap_sem) |
| down_write(&mmap_sem)
mem_cgroup_move_task() | ..
mem_cgroup_move_charge() | __mem_cgroup_try_charge()
down_read(&mmap_sem) | prepare_to_wait()
-> cannot aquire the lock | if (mc.moving_task)
| -> true
| schedule()
| -> move charge should wake it up

This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
all extra charges, wake up all waiters, and retry trylock.

Signed-off-by: Daisuke Nishimura
Reported-by: Ben Blum
Cc: Miao Xie
Cc: David Rientjes
Cc: Paul Menage
Cc: Hiroyuki Kamezawa
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-01-14 09:32:51 +0800
043d18b1e memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list() ... Browse Code »

Signed-off-by: Minchan Kim
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-01-14 09:32:50 +0800
f3e8eb70b memcg: fix unit mismatch in memcg oom limit calculation ... Browse Code »

Adding the number of swap pages to the byte limit of a memory control
group makes no sense. Convert the pages to bytes before adding them.

The only user of this code is the OOM killer, and the way it is used means
that the error results in a higher OOM badness value. Since the cgroup
limit is the same for all tasks in the cgroup, the error should have no
practical impact at the moment.

But let's not wait for future or changing users to trip over it.

Signed-off-by: Johannes Weiner
Cc: Greg Thelen
Cc: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-01-14 09:32:50 +0800
dbd4ea78f memcg: add lock to synchronize page accounting and migration ... Browse Code »

Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
accounting and migration code. This reworks the locking scheme of
_update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
which is always taken under IRQ disable.

1. If pages are being migrated from a memcg, then updates to that
memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
accounting will be updating memcg page accounting (specifically: num
writeback pages) from IRQ context (softirq). Avoid a deadlocking
nested spin lock attempt by disabling irq on the local processor when
grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race with
move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
a problem. The problem is between mem_cgroup_update_page_stat() and
mem_cgroup_move_account_page().

Trade-off:
* Changing lock_page_cgroup() to always disable IRQ (or
local_bh) has some impacts on performance and I think
it's bad to disable IRQ when it's not necessary.
* adding a new lock makes move_account() slower. Score is
here.

Performance Impact: moving a 8G anon process.

Before:
real 0m0.792s
user 0m0.000s
sys 0m0.780s

After:
real 0m0.854s
user 0m0.000s
sys 0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Greg Thelen
Reviewed-by: Minchan Kim
Acked-by: Daisuke Nishimura
Cc: Andrea Righi
Cc: Balbir Singh
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-14 09:32:50 +0800
2a7106f2c memcg: create extensible page stat update routines ... Browse Code »

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed. However, these more
general interfaces allow for new statistics to be more easily added. New
statistics are added with memcg dirty page accounting.

Signed-off-by: Greg Thelen
Signed-off-by: Andrea Righi
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2011-01-14 09:32:50 +0800
ece72400c memcg: document cgroup dirty memory interfaces ... Browse Code »

Document cgroup dirty memory interfaces and statistics.

[akpm@linux-foundation.org: fix use_hierarchy description]
Signed-off-by: Andrea Righi
Signed-off-by: Greg Thelen
Cc: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2011-01-14 09:32:50 +0800
db16d5ec1 memcg: add page_cgroup flags for dirty page tracking ... Browse Code »

This patchset provides the ability for each cgroup to have independent
dirty page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to
reclaim) page cache used by a cgroup. So, in case of multiple cgroup
writers, they will not be able to consume more than their designated share
of dirty pages and will be forced to perform write-out if they cross that
limit.

The patches are based on a series proposed by Andrea Righi in Mar 2010.

Overview:

- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
unstable.

- Extend mem_cgroup to record the total number of pages in each of the
interesting dirty states (dirty, writeback, unstable_nfs).

- Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_*
limits to mem_cgroup. The mem_cgroup dirty parameters are accessible
via cgroupfs control files.

- Consider both system and per-memcg dirty limits in page writeback when
deciding to queue background writeback or block for foreground writeback.

Known shortcomings:

- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
just inodes contributing dirty pages to the cgroup exceeding its limit.

- When memory.use_hierarchy is set, then dirty limits are disabled. This is a
implementation detail. An enhanced implementation is needed to check the
chain of parents to ensure that no dirty limit is exceeded.

Performance data:
- A page fault microbenchmark workload was used to measure performance, which
can be called in read or write mode:
f = open(foo. $cpu)
truncate(f, 4096)
alarm(60)
while (1) {
p = mmap(f, 4096)
if (write)
*p = 1
else
x = *p
munmap(p)
}

- The workload was called for several points in the patch series in different
modes:
- s_read is a single threaded reader
- s_write is a single threaded writer
- p_read is a 16 thread reader, each operating on a different file
- p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
--repeat 3". The -a option was used for parallel (p_*) runs.

- All numbers are page fault rate (M/sec). Higher is better.

- To compare the performance of a kernel without non-memcg compare the first and
last rows, neither has memcg configured. The first row does not include any
of these memcg patches.

- To compare the performance of using memcg dirty limits, compare the baseline
(2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
row titled "all patches").

root_cgroup child_cgroup
s_read s_write p_read p_write s_read s_write p_read p_write
mmotm w/o memcg 0.428 0.390 0.429 0.388
mmotm w/ memcg 0.411 0.378 0.391 0.362 0.412 0.377 0.385 0.363
all patches 0.384 0.360 0.370 0.348 0.381 0.363 0.368 0.347
all patches 0.431 0.402 0.427 0.395
w/o memcg

This patch:

Add additional flags to page_cgroup to track dirty pages within a
mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrea Righi
Signed-off-by: Greg Thelen
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2011-01-14 09:32:50 +0800
744ed1442 mm: batch activate_page() to reduce lock contention ... Browse Code »

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.

I tested some microbenchmarks:
case-anon-cow-rand-mt 0.58%
case-anon-cow-rand -3.30%
case-anon-cow-seq-mt -0.51%
case-anon-cow-seq -5.68%
case-anon-r-rand-mt 0.23%
case-anon-r-rand 0.81%
case-anon-r-seq-mt -0.71%
case-anon-r-seq -1.99%
case-anon-rx-rand-mt 2.11%
case-anon-rx-seq-mt 3.46%
case-anon-w-rand-mt -0.03%
case-anon-w-rand -0.50%
case-anon-w-seq-mt -1.08%
case-anon-w-seq -0.12%
case-anon-wx-rand-mt -5.02%
case-anon-wx-seq-mt -1.43%
case-fork 1.65%
case-fork-sleep -0.07%
case-fork-withmem 1.39%
case-hugetlb -0.59%
case-lru-file-mmap-read-mt -0.54%
case-lru-file-mmap-read 0.61%
case-lru-file-mmap-read-rand -2.24%
case-lru-file-readonce -0.64%
case-lru-file-readtwice -11.69%
case-lru-memcg -1.35%
case-mmap-pread-rand-mt 1.88%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq-mt 0.89%
case-mmap-pread-seq -69.72%
case-mmap-xread-rand-mt 0.71%
case-mmap-xread-seq-mt 0.38%

The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%

which use activate_page a lot. others are basically variations because
each run has slightly difference.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-01-14 09:32:50 +0800
d8505dee1 mm: simplify code of swap.c ... Browse Code »

Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-01-14 09:32:50 +0800
c06b1fca1 mm/page_alloc.c: don't cache `current' in a local ... Browse Code »

It's old-fashioned and unneeded.

akpm:/usr/src/25> size mm/page_alloc.o
text data bss dec hex filename
39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

Acked-by: David Rientjes
Acked-by: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:49 +0800
fd4a4663d mm: fix hugepage migration ... Browse Code »

2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().

I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.

Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: "Jun'ichi Nomura"
Cc: Andi Kleen
Cc: [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
1ce82b69e mm: fix migration hangs on anon_vma lock ... Browse Code »

Increased usage of page migration in mmotm reveals that the anon_vma
locking in unmap_and_move() has been deficient since 2.6.36 (or even
earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
("mm: fix hang on anon_vma->root->lock") missed the issue here: the
anon_vma to which we get a reference may already have been freed back to
its slab (it is in use when we check page_mapped, but that can change),
and so its anon_vma->root may be switched at any moment by reuse in
anon_vma_prepare.

Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
then we don't need any rcu read locking over here.

In removing the rcu_unlock label: since PageAnon is a bit in
page->mapping, it's impossible for a !page->mapping page to be anon; but
insert VM_BUG_ON in case the implementation ever changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins
Reviewed-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Naoya Horiguchi
Cc: "Jun'ichi Nomura"
Cc: Andi Kleen
Cc: [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
2919bfd07 ksm: drain pagevecs to lru ... Browse Code »

It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

Signed-off-by: Hugh Dickins
Reported-by: CAI Qian
Cc:Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
73ae31e59 hugetlb: fix handling of parse errors in sysfs ... Browse Code »

When parsing changes to the huge page pool sizes made from userspace via
the sysfs interface, bogus input values are being covered up by
nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
when strict_strtoul returns an error. This can cause an infinite loop in
the nr_hugepages_store code. This patch changes the return value for
these functions to -EINVAL when strict_strtoul returns an error.

Signed-off-by: Eric B Munson
Reported-by: CAI Qian
Cc: Andrea Arcangeli
Cc: Eric B Munson
Cc: Michal Hocko
Cc: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2011-01-14 09:32:49 +0800
adbe8726d hugetlb: do not allow pagesize >= MAX_ORDER pool adjustment ... Browse Code »

Huge pages with order >= MAX_ORDER must be allocated at boot via the
kernel command line, they cannot be allocated or freed once the kernel is
up and running. Currently we allow values to be written to the sysfs and
sysctl files controling pool size for these huge page sizes. This patch
makes the store functions for nr_hugepages and nr_overcommit_hugepages
return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

[akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
[caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
Signed-off-by: Eric B Munson
Reported-by: CAI Qian
Cc: Andrea Arcangeli
Cc: Michal Hocko
Cc: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2011-01-14 09:32:49 +0800
08d4a2465 hugetlb: check the return value of string conversion in sysctl handler ... Browse Code »

proc_doulongvec_minmax may fail if the given buffer doesn't represent a
valid number. If we provide something invalid we will initialize the
resulting value (nr_overcommit_huge_pages in this case) to a random value
from the stack.

The issue was introduced by a3d0c6aa when the default handler has been
replaced by the helper function where we do not check the return value.

Reproducer:
echo "" > /proc/sys/vm/nr_overcommit_hugepages

[akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
Signed-off-by: Michal Hocko
Cc: CAI Qian
Cc: Nishanth Aravamudan
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-01-14 09:32:49 +0800
cb9ef8d5e fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc ... Browse Code »

The sync_inodes_sb() function does not have a return value. Remove the
outdated documentation comment.

Signed-off-by: Stefan Hajnoczi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefan Hajnoczi
2011-01-14 09:32:48 +0800
684265d4a mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() ... Browse Code »

As it stands this code will degenerate into a busy-wait if the calling task
has signal_pending().

Cc: Rolf Eike Beer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:48 +0800
84bc227d7 mm/dmapool.c: take lock only once in dma_pool_free() ... Browse Code »

dma_pool_free() scans for the page to free in the pool list holding the
pool lock. Then it releases the lock basically to acquire it immediately
again. Modify the code to only take the lock once.

This will do some additional loops and computations with the lock held in
if memory debugging is activated. If it is not activated the only new
operations with this lock is one if and one substraction.

Signed-off-by: Rolf Eike Beer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2011-01-14 09:32:48 +0800
43506fad2 mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists ... Browse Code »
44

The previous approach of calucation of combined index was

page_idx & ~(1 << order))

but we have same result with

page_idx & buddy_idx

This reduces instructions slightly as well as enhances readability.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KyongHo Cho
2011-01-14 09:32:48 +0800
5520e8948 brk: fix min_brk lower bound computation for COMPAT_BRK ... Browse Code »

Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
be overriden by randomize_va_space sysctl.

If this is the case, the min_brk computation in sys_brk() implementation
is wrong, as it solely takes into account COMPAT_BRK setting, assuming
that brk start is not randomized. But that might not be the case if
randomize_va_space sysctl has been set to '2' at the time the binary has
been loaded from disk.

In such case, the check has to be done in a same way as in
!CONFIG_COMPAT_BRK case.

In addition to that, the check for the COMPAT_BRK case introduced back in
a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
but mm->end_data instead, as that's where the legacy applications expect
brk section to start (i.e. immediately after last global variable).

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Jiri Kosina
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2011-01-14 09:32:48 +0800