Eric Lee / smarc-fsl-linux-kernel

23 Nov, 2011

1 commit

8ba8ed54d Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: remove vm_dirties and task->dirties
writeback: hard throttle 1000+ dd on a slow USB stick
mm: Make task in balance_dirty_pages() killable

Linus Torvalds
2011-11-23 00:22:48 +0800

18 Nov, 2011

2 commits

b68445238 Merge branch 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/konrad/xen

* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen-gntalloc: signedness bug in add_grefs()
xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()
xen-gntdev: integer overflow in gntdev_alloc_map()
xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.
xen/balloon: Avoid OOM when requesting highmem
xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI
xen: map foreign pages for shared rings by updating the PTEs directly

Linus Torvalds
2011-11-18 23:18:07 +0800
15bd1cfb3 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-block:
block: add missed trace_block_plug
paride: fix potential information leak in pg_read()
bio: change some signed vars to unsigned
block: avoid unnecessary plug list flush
cciss: auto engage SCSI mid layer at driver load time
loop: cleanup set_status interface
include/linux/bio.h: use a static inline function for bio_integrity_clone()
loop: prevent information leak after failed read
block: Always check length of all iov entries in blk_rq_map_user_iov()
The Windows driver .inf disables ASPM on all cciss devices. Do the same.
backing-dev: ensure wakeup_timer is deleted
block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"

Linus Torvalds
2011-11-18 19:34:35 +0800

17 Nov, 2011

3 commits

468e6a20a writeback: remove vm_dirties and task->dirties ... Browse Code »

They are not used any more.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-11-17 20:49:06 +0800
1df647197 writeback: hard throttle 1000+ dd on a slow USB stick ... Browse Code »

The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
on every 1 4KB-page, which means it cannot throttle a task under
4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
10MB/s USB stick, its bdi dirty pages could grow out of control.

Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
means a limit of 4KB/s.

They can eventually be safeguarded by the global limit check
(nr_dirty < dirty_thresh). However if someone is also writing to an
HDD at the same time, it'll get poor HDD write performance.

We at least want to maintain good write performance for other devices
when one device is attacked by some "massive parallel" workload, or
suffers from slow write bandwidth, or somehow get stalled due to some
error condition (eg. NFS server not responding).

For a stalled device, we need to completely block its dirtiers, too,
before its bdi dirty pages grow all the way up to the global limit and
leave no space for the other functional devices.

So change the loop exit condition to

/*
* Always enforce global dirty limit; also enforce bdi dirty limit
* if the normal max_pause sleeps cannot keep things under control.
*/
if (nr_dirty < dirty_thresh &&
(bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
break;

which can be further simplified to

if (task_ratelimit)
break;

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-11-17 20:39:32 +0800
cd12909cb xen: map foreign pages for shared rings by updating the PTEs directly ... Browse Code »

When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).

After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().

Signed-off-by: David Vrabel
Acked-by: Andrew Morton
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Michal Simek

David Vrabel
2011-11-17 01:13:08 +0800

16 Nov, 2011

3 commits

499d05ecf mm: Make task in balance_dirty_pages() killable ... Browse Code »

There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).

There will be follow up patches to further abort the generic_perform_write()
and other filesystem write loops, to avoid large write + SIGKILL combination
exceeding the dirty limit and possibly strange OOM.

Reported-by: Kazuya Mio
Tested-by: Kazuya Mio
Reviewed-by: Neil Brown
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-11-16 19:53:44 +0800
ea4039a34 hugetlb: release pages in the error path of hugetlb_cow() ... Browse Code »
1

If we fail to prepare an anon_vma, the {new, old}_page should be released,
or they will leak.

Signed-off-by: Hillf Danton
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-11-16 08:41:52 +0800
5aecc85ab oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN ... Browse Code »

Commit c9f01245 ("oom: remove oom_disable_count") has removed the
oom_disable_count counter which has been used for early break out from
oom_badness so we could never select a task with oom_score_adj set to
OOM_SCORE_ADJ_MIN (oom disabled).

Now that the counter is gone we are always going through heuristics
calculation and we always return a non zero positive value. This means
that we can end up killing a task with OOM disabled because it is
indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks
with 3% usage of memory or tasks with oom_score_adj set but OOM enabled.

Let's break out early if the task should have OOM disabled.

Signed-off-by: Michal Hocko
Acked-by: David Rientjes
Acked-by: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-11-16 08:41:51 +0800

11 Nov, 2011

1 commit

7a401a972 backing-dev: ensure wakeup_timer is deleted ... Browse Code »
44

bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.

However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().

This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.

Fix this by redoing the del_timer_sync() in bdi_destory().

------------[ cut here ]------------
WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
Modules linked in:
Backtrace:
[] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
[] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
[] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
r3:00000009
[] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
r3:c02b1b29 r2:c02bc662
[] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
r6:c7964000 r5:00000001 r4:c7964000
[] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
[] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
[] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
r5:c79641f0 r4:c796420c
[] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
[] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
r5:c79643b4 r4:c79641f0
[] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
r4:c7964000
[] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
r5:00000000 r4:c794c400
[] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
r5:c794c400 r4:c0322824
[] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
r5:c78d5e00 r4:c74083c0
[] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
r3:00000000
[] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
[] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
[] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
r5:c794dc00 r4:c794dc00
[] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
r5:c794dc00 r4:c7942300
[] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
r6:c7942300 r5:00000000 r4:00000000 r3:00000000
[] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
---[ end trace e5c83c92ada51c76 ]---

Cc: stable@kernel.org
Signed-off-by: Rabin Vincent
Signed-off-by: Linus Walleij
Signed-off-by: Jens Axboe

Rabin Vincent
2011-11-11 20:29:04 +0800

07 Nov, 2011

3 commits

3a73dbbc9 writeback: fix uninitialized task_ratelimit ... Browse Code »

In balance_dirty_pages() task_ratelimit may be not initialized
(initialization skiped by goto pause), and then used when calling
tracing hook.

Fix it by moving the task_ratelimit assignment before goto pause.

Reported-by: Witold Baryluk
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-11-07 19:19:28 +0800
32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800
208bca086 Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Add a 'reason' to wb_writeback_work
writeback: send work item to queue_io, move_expired_inodes
writeback: trace event balance_dirty_pages
writeback: trace event bdi_dirty_ratelimit
writeback: fix ppc compile warnings on do_div(long long, unsigned long)
writeback: per-bdi background threshold
writeback: dirty position control - bdi reserve area
writeback: control dirty pause time
writeback: limit max dirty pause time
writeback: IO-less balance_dirty_pages()
writeback: per task dirty rate limit
writeback: stabilize bdi->dirty_ratelimit
writeback: dirty rate control
writeback: add bg_threshold parameter to __bdi_update_bandwidth()
writeback: dirty position control
writeback: account per-bdi accumulated dirtied pages

Linus Torvalds
2011-11-07 11:02:23 +0800

05 Nov, 2011

1 commit

b4fdcb02f Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c

Linus Torvalds
2011-11-05 08:06:58 +0800

03 Nov, 2011

10 commits

092f4c56c Merge branch 'akpm' (Andrew's incoming - part two) ... Browse Code »

Says Andrew:

"60 patches. That's good enough for -rc1 I guess. I have quite a lot
of detritus to be rechecked, work through maintainers, etc.

- most of the remains of MM
- rtc
- various misc
- cgroups
- memcg
- cpusets
- procfs
- ipc
- rapidio
- sysctl
- pps
- w1
- drivers/misc
- aio"

* akpm: (60 commits)
memcg: replace ss->id_lock with a rwlock
aio: allocate kiocbs in batches
drivers/misc/vmw_balloon.c: fix typo in code comment
drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
w1: disable irqs in critical section
drivers/w1/w1_int.c: multiple masters used same init_name
drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
drivers/power/ds2780_battery.c: add a nolock function to w1 interface
drivers/power/ds2780_battery.c: create central point for calling w1 interface
w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
pps gpio client: add missing dependency
pps: new client driver using GPIO
pps: default echo function
include/linux/dma-mapping.h: add dma_zalloc_coherent()
sysctl: make CONFIG_SYSCTL_SYSCALL default to n
sysctl: add support for poll()
RapidIO: documentation update
drivers/net/rionet.c: fix ethernet address macros for LE platforms
RapidIO: fix potential null deref in rio_setup_device()
RapidIO: add mport driver for Tsi721 bridge
...

Linus Torvalds
2011-11-03 07:07:27 +0800
61600f578 mm/page_cgroup.c: quiet sparse noise ... Browse Code »

warning: symbol 'swap_cgroup_ctrl' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten
Cc: Paul Menage
Cc: Li Zefan
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2011-11-03 07:07:00 +0800
4799401fe memcg: Fix race condition in memcg_check_events() with this_cpu usage ... Browse Code »

Various code in memcontrol.c () calls this_cpu_read() on the calculations
to be done from two different percpu variables, or does an open-coded
read-modify-write on a single percpu variable.

Disable preemption throughout these operations so that the writes go to
the correct palces.

[hannes@cmpxchg.org: added this_cpu to __this_cpu conversion]
Signed-off-by: Johannes Weiner
Signed-off-by: Steven Rostedt
Cc: Greg Thelen
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2011-11-03 07:07:00 +0800
a61ed3cec memcg: close race between charge and putback ... Browse Code »

There is a potential race between a thread charging a page and another
thread putting it back to the LRU list:

charge: putback:
SetPageCgroupUsed SetPageLRU
PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU

The order of setting one flag and checking the other is crucial, otherwise
the charge may observe !PageLRU while the putback observes !PageCgroupUsed
and the page is not linked to the memcg LRU at all.

Global memory pressure may fix this by trying to isolate and putback the
page for reclaim, where that putback would link it to the memcg LRU again.
Without that, the memory cgroup is undeletable due to a charge whose
physical page can not be found and moved out.

Signed-off-by: Johannes Weiner
Cc: Ying Han
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-03 07:07:00 +0800
9b272977e memcg: skip scanning active lists based on individual size ... Browse Code »

Reclaim decides to skip scanning an active list when the corresponding
inactive list is above a certain size in comparison to leave the assumed
working set alone while there are still enough reclaim candidates around.

The memcg implementation of comparing those lists instead reports whether
the whole memcg is low on the requested type of inactive pages,
considering all nodes and zones.

This can lead to an oversized active list not being scanned because of the
state of the other lists in the memcg, as well as an active list being
scanned while its corresponding inactive list has enough pages.

Not only is this wrong, it's also a scalability hazard, because the global
memory state over all nodes and zones has to be gathered for each memcg
and zone scanned.

Make these calculations purely based on the size of the two LRU lists
that are actually affected by the outcome of the decision.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Reviewed-by: Minchan Kim
Reviewed-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-03 07:07:00 +0800
0a619e587 memcg: do not expose uninitialized mem_cgroup_per_node to world ... Browse Code »

If somebody is touching data too early, it might be easier to diagnose a
problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
understand why mem_cgroup_per_zone is [un|partly]initialized.

Signed-off-by: Igor Mammedov
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Igor Mammedov
2011-11-03 07:07:00 +0800
715a5ee82 memcg: fix oom schedule_timeout() ... Browse Code »

Before calling schedule_timeout(), task state should be changed.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-11-03 07:06:59 +0800
c0ff4b854 memcg: rename mem variable to memcg ... Browse Code »

The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg". Rename all mem variables to memcg in source
file.

Signed-off-by: Raghavendra K T
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Raghavendra K T
2011-11-03 07:06:59 +0800
ff7ee93f4 cgroup/kmemleak: Annotate alloc_page() for cgroup allocations ... Browse Code »

When the cgroup base was allocated with kmalloc, it was necessary to
annotate the variable with kmemleak_not_leak(). But because it has
recently been changed to be allocated with alloc_page() (which skips
kmemleak checks) causes a warning on boot up.

I was triggering this output:

allocated 8388608 bytes of page_cgroup
please try 'cgroup_disable=memory' option if you don't want memory cgroups
kmemleak: Trying to color unknown object at 0xf5840000 as Grey
Pid: 0, comm: swapper Not tainted 3.0.0-test #12
Call Trace:
[] ? printk+0x1d/0x1f^M
[] paint_ptr+0x4f/0x78
[] kmemleak_not_leak+0x58/0x7d
[] ? __rcu_read_unlock+0x9/0x7d
[] kmemleak_init+0x19d/0x1e9
[] start_kernel+0x346/0x3ec
[] ? loglevel+0x18/0x18
[] i386_start_kernel+0xaa/0xb0

After a bit of debugging I tracked the object 0xf840000 (and others) down
to the cgroup code. The change from allocating base with kmalloc to
alloc_page() has the base not calling kmemleak_alloc() which adds the
pointer to the object_tree_root, but kmemleak_not_leak() adds it to the
crt_early_log[] table. On kmemleak_init(), the entry is found in the
early_log[] but not the object_tree_root, and this error message is
displayed.

If alloc_page() fails then it defaults back to vmalloc() which still uses
the kmemleak_alloc() which makes us still need the kmemleak_not_leak()
call. The solution is to call the kmemleak_alloc() directly if the
alloc_page() succeeds.

Reviewed-by: Michal Hocko
Signed-off-by: Steven Rostedt
Acked-by: Catalin Marinas
Signed-off-by: Jonathan Nieder
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2011-11-03 07:06:59 +0800
70b50f94f mm: thp: tail page refcounting fix ... Browse Code »
2

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli
Reported-by: Michel Lespinasse
Reviewed-by: Michel Lespinasse
Reviewed-by: Minchan Kim
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Benjamin Herrenschmidt
Cc: David Gibson
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-03 07:06:57 +0800

02 Nov, 2011

1 commit

6d6b77f16 filesystems: add missing nlink wrappers ... Browse Code »

Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2011-11-02 19:53:43 +0800

01 Nov, 2011

15 commits

a1cb2c60d mm/vmstat.c: cache align vm_stat ... Browse Code »

Avoid false sharing of the vm_stat array.

This was found to adversely affect tmpfs I/O performance.

Tests run on a 640 cpu UV system.

With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch: ~300 MB/sec
With vm_stat alignment: ~430 MB/sec

Signed-off-by: Dimitri Sivanich
Acked-by: Christoph Lameter
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dimitri Sivanich
2011-11-01 08:30:51 +0800
3d470fc38 mm: munlock use mapcount to avoid terrible overhead ... Browse Code »

A process spent 30 minutes exiting, just munlocking the pages of a large
anonymous area that had been alternately mprotected into page-sized vmas:
for every single page there's an anon_vma walk through all the other
little vmas to find the right one.

A general fix to that would be a lot more complicated (use prio_tree on
anon_vma?), but there's one very simple thing we can do to speed up the
common case: if a page to be munlocked is mapped only once, then it is our
vma that it is mapped into, and there's no need whatever to walk through
all the others.

Okay, there is a very remote race in munlock_vma_pages_range(), if between
its follow_page() and lock_page(), another process were to munlock the
same page, then page reclaim remove it from our vma, then another process
mlock it again. We would find it with page_mapcount 1, yet it's still
mlocked in another process. But never mind, that's much less likely than
the down_read_trylock() failure which munlocking already tolerates (in
try_to_unmap_one()): in due course page reclaim will discover and move the
page to unevictable instead.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Hugh Dickins
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-11-01 08:30:51 +0800
35d8c7ad7 mm/huge_memory: fix typo when updating mmu cache ... Browse Code »

There are three cases of update_mmu_cache() in the file, and the case in
function collapse_huge_page() has a typo, namely the last parameter used,
which is corrected based on the other two cases.

Due to the define of update_mmu_cache by X86, the only arch that
implements THP currently, the change here has no really crystal point, but
one or two minutes of efforts could be saved for those archs that are
likely to support THP in future.

Signed-off-by: Hillf Danton
Cc: Johannes Weiner
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-11-01 08:30:51 +0800
0089e4853 mm/huge_memory: fix copying user highpage ... Browse Code »

The THP copy-on-write handler falls back to regular-sized pages for a huge
page replacement upon allocation failure or if THP has been individually
disabled in the target VMA. The loop responsible for copying page-sized
chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
the virtual address arg for copy_user_highpage().

Signed-off-by: Hillf Danton
Acked-by: Johannes Weiner
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-11-01 08:30:50 +0800
df9d6985b mm: do not drain pagevecs for mlockall(MCL_FUTURE) ... Browse Code »

MCL_FUTURE does not move pages between lru list and draining the LRU per
cpu pagevecs is a nasty activity. Avoid doing it unecessarily.

Signed-off-by: Christoph Lameter
Cc: David Rientjes
Reviewed-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-11-01 08:30:50 +0800
e0c23279c vmscan: abort reclaim/compaction if compaction can proceed ... Browse Code »

If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps. This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Cc: Josh Boyer
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:50 +0800
e0887c19b vmscan: limit direct reclaim for higher order allocations ... Browse Code »

When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory. Due to the unfreeable
pages, compaction fails.

Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas. However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.

This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.

This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.

If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.

[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Reviewed-by: Minchan Kim
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-11-01 08:30:50 +0800
21ee9f398 vmscan: add barrier to prevent evictable page in unevictable list ... Browse Code »

When a race between putback_lru_page() and shmem_lock with lock=0 happens,
progrom execution order is as follows, but clear_bit in processor #1 could
be reordered right before spin_unlock of processor #1. Then, the page
would be stranded on the unevictable list.

spin_lock
SetPageLRU
spin_unlock
clear_bit(AS_UNEVICTABLE)
spin_lock
if PageLRU()
if !test_bit(AS_UNEVICTABLE)
move evictable list
smp_mb
if !test_bit(AS_UNEVICTABLE)
move evictable list
spin_unlock

But, pagevec_lookup() in scan_mapping_unevictable_pages() has
rcu_read_[un]lock() so it could protect reordering before reaching
test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
But it's a unexpected side effect and we should solve this problem
properly.

This patch adds a barrier after mapping_clear_unevictable.

I didn't meet this problem but just found during review.

Signed-off-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Lee Schermerhorn
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:50 +0800
2f1da6421 mm/huge_memory.c: quiet sparse noise ... Browse Code »

Quiet the sparse noise:

warning: symbol 'khugepaged_scan' was not declared. Should it be static?
warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock

Signed-off-by: H Hartley Sweeten
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2011-11-01 08:30:50 +0800
e754d79d3 mm/mempolicy.c: quiet sparse noise ... Browse Code »

Quiet the spares noise:

warning: symbol 'default_policy' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten
Cc: KOSAKI Motohiro
Cc: Stephen Wilson
Cc: Andrea Arcangeli
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2011-11-01 08:30:50 +0800
22d5368a0 mm/thrash.c: quiet sparse noise ... Browse Code »

Quiet the following sparse noise:

warning: symbol 'swap_token_memcg' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten
Cc: Rik van Riel
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2011-11-01 08:30:50 +0800
2d7d3eb2b mm/memblock.c: quiet sparse noise ... Browse Code »

Quiet the following sparse noise in this file:

warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten
Cc: Yinghai Lu
Cc: "H. Peter Anvin"
Cc: Benjamin Herrenschmidt
Cc: Tomi Valkeinen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2011-11-01 08:30:50 +0800
264e56d82 mm: disable user interface to manually rescue unevictable pages ... Browse Code »

At one point, anonymous pages were supposed to go on the unevictable list
when no swap space was configured, and the idea was to manually rescue
those pages after adding swap and making them evictable again. But
nowadays, swap-backed pages on the anon LRU list are not scanned without
available swap space anyway, so there is no point in moving them to a
separate list anymore.

The manual rescue could also be used in case pages were stranded on the
unevictable list due to race conditions. But the code has been around for
a while now and newly discovered bugs should be properly reported and
dealt with instead of relying on such a manual fixup.

In addition to the lack of a usecase, the sysfs interface to rescue pages
from a specific NUMA node has been broken since its introduction, so it's
unlikely that anybody ever relied on that.

This patch removes the functionality behind the sysctl and the
node-interface and emits a one-time warning when somebody tries to access
either of them.

Signed-off-by: Johannes Weiner
Reported-by: Kautuk Consul
Reviewed-by: Minchan Kim
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-01 08:30:49 +0800
3f380998a vmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node() ... Browse Code »

write_scan_unevictable_node() checks the value req returned by
strict_strtoul() and returns 1 if req is 0.

However, when strict_strtoul() returns 0, it means successful conversion
of buf to unsigned long.

Due to this, the function was not proceeding to scan the zones for
unevictable pages even though we write a valid value to the
scan_unevictable_pages sys file.

Change this check slightly to check for invalid value in buf as well as 0
value stored in res after successful conversion via strict_strtoul. In
both cases, we do not perform the scanning of this node's zones.

Signed-off-by: Kautuk Consul
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2011-11-01 08:30:49 +0800
4e9dc5df4 mm: fix kunmap_high() comment ... Browse Code »

Signed-off-by: Li Haifeng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Haifeng
2011-11-01 08:30:49 +0800