Eric Lee / smarc-fsl-linux-kernel

06 Oct, 2012

1 commit

17f3609c2 sections: fix section conflicts in mm/percpu.c ... Browse Code »

Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-10-06 02:04:44 +0800

03 Oct, 2012

5 commits

33c2a1741 Merge tag 'stable/for-linus-3.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm ... Browse Code »

Pull frontswap update from Konrad Rzeszutek Wilk:
"Features:
- Support exlusive get if backend is capable.
Bug-fixes:
- Fix compile warnings
- Add comments/cleanup doc
- Fix wrong if condition"

* tag 'stable/for-linus-3.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
frontswap: support exclusive gets if tmem backend is capable
mm: frontswap: fix a wrong if condition in frontswap_shrink
mm/frontswap: fix uninit'ed variable warning
mm/frontswap: cleanup doc and comment error
mm: frontswap: remove unneeded headers

Linus Torvalds
2012-10-03 13:08:14 +0800
aab174f0d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...

Linus Torvalds
2012-10-03 11:25:04 +0800
68d47a137 Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.

People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.

This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

Linus Torvalds
2012-10-03 01:52:28 +0800
c0e8a139a Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- xattr support added. The implementation is shared with tmpfs. The
usage is restricted and intended to be used to manage per-cgroup
metadata by system software. tmpfs changes are routed through this
branch with Hugh's permission.

- cgroup subsystem ID handling simplified.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
cgroup: Assign subsystem IDs during compile time
cgroup: Do not depend on a given order when populating the subsys array
cgroup: Wrap subsystem selection macro
cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
cgroup: net_prio: Do not define task_netpioidx() when not selected
cgroup: net_cls: Do not define task_cls_classid() when not selected
cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
xattr: mark variable as uninitialized to make both gcc and smatch happy
fs: add missing documentation to simple_xattr functions
cgroup: add documentation on extended attributes usage
cgroup: rename subsys_bits to subsys_mask
cgroup: add xattr support
cgroup: revise how we re-populate root directory
xattr: extract simple_xattr code from tmpfs

Linus Torvalds
2012-10-03 01:50:47 +0800
033d9959e Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.

* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.

* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.

These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.

* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.

All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.

* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.

* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.

There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...

Linus Torvalds
2012-10-03 00:54:49 +0800

02 Oct, 2012

2 commits

620e77533 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU changes from Ingo Molnar:

0. 'idle RCU':

Adds RCU APIs that allow non-idle tasks to enter RCU idle mode and
provides x86 code to make use of them, allowing RCU to treat
user-mode execution as an extended quiescent state when the new
RCU_USER_QS kernel configuration parameter is specified. (Work is
in progress to port this to a few other architectures, but is not
part of this series.)

1. A fix for a latent bug that has been in RCU ever since the addition
of CPU stall warnings. This bug results in false-positive stall
warnings, but thus far only on embedded systems with severely
cut-down userspace configurations.

2. Further reductions in latency spikes for huge systems, along with
additional boot-time adaptation to the actual hardware.

This is a large change, as it moves RCU grace-period initialization
and cleanup, along with quiescent-state forcing, from softirq to a
kthread. However, it appears to be in quite good shape (famous
last words).

3. Updates to documentation and rcutorture, the latter category
including keeping statistics on CPU-hotplug latencies and fixing
some initialization-time races.

4. CPU-hotplug fixes and improvements.

5. Idle-loop fixes that were omitted on an earlier submission.

6. Miscellaneous fixes and improvements

In certain RCU configurations new kernel threads will show up (rcu_bh,
rcu_sched), showing RCU processing overhead.

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling
rcu: Userspace RCU extended QS selftest
x86: Exit RCU extended QS on notify resume
x86: Use the new schedule_user API on userspace preemption
rcu: Exit RCU extended QS on user preemption
rcu: Exit RCU extended QS on kernel preemption after irq/exception
x86: Exception hooks for userspace RCU extended QS
x86: Unspaghettize do_general_protection()
x86: Syscall hooks for userspace RCU extended QS
rcu: Switch task's syscall hooks on context switch
rcu: Ignore userspace extended quiescent state by default
rcu: Allow rcu_user_enter()/exit() to nest
rcu: Settle config for userspace extended quiescent state
rcu: Make RCU_FAST_NO_HZ handle adaptive ticks
rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs
rcu: New rcu_user_enter() and rcu_user_exit() APIs
ia64: Add missing RCU idle APIs on idle loop
xtensa: Add missing RCU idle APIs on idle loop
score: Add missing RCU idle APIs on idle loop
parisc: Add missing RCU idle APIs on idle loop
...

Linus Torvalds
2012-10-02 01:16:42 +0800
99dbb1632 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull the trivial tree from Jiri Kosina:
"Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
doc: fix old config name of kprobetrace
fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
btrfs: fix the commment for the action flags in delayed-ref.h
btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
vfs: fix kerneldoc for generic_fh_to_parent()
treewide: fix comment/printk/variable typos
ipr: fix small coding style issues
doc: fix broken utf8 encoding
nfs: comment fix
platform/x86: fix asus_laptop.wled_type module parameter
mfd: printk/comment fixes
doc: getdelays.c: remember to close() socket on error in create_nl_socket()
doc: aliasing-test: close fd on write error
mmc: fix comment typos
dma: fix comments
spi: fix comment/printk typos in spi
Coccinelle: fix typo in memdup_user.cocci
tmiofb: missing NULL pointer checks
tools: perf: Fix typo in tools/perf
tools/testing: fix comment / output typos
...

Linus Torvalds
2012-10-02 00:06:36 +0800

28 Sep, 2012

1 commit

99a1300e1 thp: avoid VM_BUG_ON page_count(page) false positives in __collapse_huge_page_copy ... Browse Code »

Speculative cache pagecache lookups can elevate the refcount from
under us, so avoid the false positive. If the refcount is < 2 we'll be
notified by a VM_BUG_ON in put_page_testzero as there are two
put_page(src_page) in a row before returning from this function.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Rik van Riel
Reviewed-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-09-28 23:38:09 +0800

27 Sep, 2012

4 commits

2903ff019 switch simple cases of fget_light to fdget ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 10:20:08 +0800
cb0942b81 make get_file() return its argument ... Browse Code »

simplifies a bunch of callers...

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:25 +0800
132ea2479 switch readahead(2) to fget_light() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:04 +0800
611443783 switch fadvise(2) to fget_light() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:04 +0800

26 Sep, 2012

1 commit

593d1006c Merge remote-tracking branch 'tip/core/rcu' into next.2012.09.25b ... Browse Code »

Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
approach from https://lkml.org/lkml/2012/9/5/585.

Paul E. McKenney
2012-09-26 01:03:56 +0800

23 Sep, 2012

1 commit

58fac0956 kmemleak: Replace list_for_each_continue_rcu with new interface ... Browse Code »

This patch replaces list_for_each_continue_rcu() with
list_for_each_entry_continue_rcu() to save a few lines
of code and allow removing list_for_each_continue_rcu().

Signed-off-by: Michael Wang
Acked-by: Catalin Marinas
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Michael Wang
2012-09-23 22:42:52 +0800

21 Sep, 2012

2 commits

e3483a5f3 frontswap: support exclusive gets if tmem backend is capable ... Browse Code »

Tmem, as originally specified, assumes that "get" operations
performed on persistent pools never flush the page of data out
of tmem on a successful get, waiting instead for a flush
operation. This is intended to mimic the model of a swap
disk, where a disk read is non-destructive. Unlike a
disk, however, freeing up the RAM can be valuable. Over
the years that frontswap was in the review process, several
reviewers (and notably Hugh Dickins in 2010) pointed out that
this would result, at least temporarily, in two copies of the
data in RAM: one (compressed for zcache) copy in tmem,
and one copy in the swap cache. We wondered if this could
be done differently, at least optionally.

This patch allows tmem backends to instruct the frontswap
code that this backend performs exclusive gets. Zcache2
already contains hooks to support this feature. Other
backends are completely unaffected unless/until they are
updated to support this feature.

While it is not clear that exclusive gets are a performance
win on all workloads at all times, this small patch allows for
experimentation by backends.

P.S. Let's not quibble about the naming of "get" vs "read" vs
"load" etc. The naming is currently horribly inconsistent between
cleancache and frontswap and existing tmem backends, so will need
to be straightened out as a separate patch. "Get" is used
by the tmem architecture spec, existing backends, and
all documentation and presentation material so I am
using it in this patch.

Signed-off-by: Dan Magenheimer
Signed-off-by: Konrad Rzeszutek Wilk

Dan Magenheimer
2012-09-21 22:38:12 +0800
a00bb1e9f mm: frontswap: fix a wrong if condition in frontswap_shrink ... Browse Code »

pages_to_unuse is set to 0 to unuse all frontswap pages
But that doesn't happen since a wrong condition in frontswap_shrink
cancel it.

-v2: Add comment to explain return value of __frontswap_shrink,
as suggested by Dan Carpenter, thanks

Signed-off-by: Zhenzhong Duan
Signed-off-by: Konrad Rzeszutek Wilk

Zhenzhong Duan
2012-09-21 22:37:45 +0800

18 Sep, 2012

6 commits

f14851af0 memory hotplug: fix section info double registration bug ... Browse Code »

There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.

node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000

free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()

When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().

sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()

[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Acked-by: Mel Gorman
Cc: "Luck, Tony"
Cc: Yasuaki Ishimatsu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

qiuxishi
2012-09-18 06:00:38 +0800
0ba8f2d59 mm/page_alloc: fix the page address of higher page's buddy calculation ... Browse Code »

The heuristic method for buddy has been introduced since commit
43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists"). But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.

Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.

Signed-off-by: Haifeng Li
Signed-off-by: Gavin Shan
Reviewed-by: Michal Hocko
Cc: KyongHo Cho
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: [2.6.38+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Haifeng
2012-09-18 06:00:38 +0800
8ba00bb68 slub: consider pfmemalloc_match() in get_partial_node() ... Browse Code »

get_partial() is currently not checking pfmemalloc_match() meaning that
it is possible for pfmemalloc pages to leak to non-pfmemalloc users.
This is a problem in the following situation. Assume that there is a
request from normal allocation and there are no objects in the per-cpu
cache and no node-partial slab.

In this case, slab_alloc enters the slow path and new_slab_objects() is
called which may return a PFMEMALLOC page. As the current user is not
allowed to access PFMEMALLOC page, deactivate_slab() is called
([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc
checks]) and returns an object from PFMEMALLOC page.

Next time, when we get another request from normal allocation,
slab_alloc() enters the slow-path and calls new_slab_objects(). In
new_slab_objects(), we call get_partial() and get a partial slab which
was just deactivated but is a pfmemalloc page. We extract one object
from it and re-deactivate.

"deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly.

As a result, access to PFMEMALLOC page is not properly restricted and it
can cause a performance degradation due to frequent deactivation.
deactivation frequently.

This patch changes get_partial_node() to take pfmemalloc_match() into
account and prevents the "deactivate -> re-get in get_partial()
scenario. Instead, new_slab() is called.

Signed-off-by: Joonsoo Kim
Acked-by: David Rientjes
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Chuck Lever
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2012-09-18 06:00:38 +0800
d014dc2ed slab: fix starting index for finding another object ... Browse Code »

In array cache, there is a object at index 0, check it.

Signed-off-by: Joonsoo Kim
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Chuck Lever
Cc: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2012-09-18 06:00:38 +0800
30c29bea6 slab: do ClearSlabPfmemalloc() for all pages of slab ... Browse Code »

Right now, we call ClearSlabPfmemalloc() for first page of slab when we
clear SlabPfmemalloc flag. This is fine for most swap-over-network use
cases as it is expected that order-0 pages are in use. Unfortunately it
is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail
page and while this is harmless, it is sloppy. This patch ensures that
the head page is always used.

This problem was originally identified by Joonsoo Kim.

[js1304@gmail.com: Original implementation and problem identification]
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Chuck Lever
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-09-18 06:00:38 +0800
18b48d587 memory hotplug: reset pgdat->kswapd to NULL if creating kernel thread fails ... Browse Code »

If kthread_run() fails, pgdat->kswapd contains errno. When we stop this
thread, we only check whether pgdat->kswapd is NULL and access it. If
it contains errno, it will cause page fault. Reset pgdat->kswapd to
NULL when creating kernel thread fails can avoid this problem.

Signed-off-by: Wen Congyang
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2012-09-18 06:00:37 +0800

15 Sep, 2012

2 commits

7076cca9a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull a core sparse warning fix from Ingo Molnar

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm/memblock: Use NULL instead of 0 for pointers

Linus Torvalds
2012-09-15 08:43:14 +0800
8c7f6edbd cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them ... Browse Code »

Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.

For now, start with a single warning message. We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Li Zefan
Acked-by: Serge E. Hallyn
Cc: Glauber Costa
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Johannes Weiner
Cc: Thomas Graf
Cc: Vivek Goyal
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Neil Horman
Cc: Aneesh Kumar K.V

Tejun Heo
2012-09-15 03:01:16 +0800

07 Sep, 2012

1 commit

80de7c313 Remove user-triggerable BUG from mpol_to_str ... Browse Code »

Trivially triggerable, found by trinity:

kernel BUG at mm/mempolicy.c:2546!
Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
Call Trace:
show_numa_map+0xd5/0x450
show_pid_numa_map+0x13/0x20
traverse+0xf2/0x230
seq_read+0x34b/0x3e0
vfs_read+0xac/0x180
sys_pread64+0xa2/0xc0
system_call_fastpath+0x1a/0x1f
RIP: mpol_to_str+0x156/0x360

Cc: stable@vger.kernel.org
Signed-off-by: Dave Jones
Signed-off-by: Linus Torvalds

Dave Jones
2012-09-07 00:37:58 +0800

05 Sep, 2012

1 commit

15674868d mm/memblock: Use NULL instead of 0 for pointers ... Browse Code »

This type cleanup also fixes the following sparse warning:

mm/memblock.c:249:49: warning: Using plain integer as NULL pointer

Signed-off-by: Sachin Kamat
Acked-by: Tejun Heo
Cc: Andrew Morton
Cc: patches@linaro.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar

Sachin Kamat
2012-09-05 14:32:30 +0800

30 Aug, 2012

1 commit

51cd8e6ff mm, slab: lock the correct nodelist after reenabling irqs ... Browse Code »

cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
that we take list_lock on the correct nodelist.

This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
node was taken after growing the cache.

Reported-and-tested-by: Haggai Eran
Signed-off-by: David Rientjes
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-30 02:32:21 +0800

27 Aug, 2012

1 commit

0d4ba4d7b bootmem: Fix the short description of reserve_bootmem() ... Browse Code »

It marks pages as reserved, as the long description says.

Signed-off-by: Javi Merino
Acked-by: Johannes Weiner
Signed-off-by: Jiri Kosina

Javi Merino
2012-08-27 22:57:21 +0800

26 Aug, 2012

1 commit

a7e546f17 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block-related fixes from Jens Axboe:

- Improvements to the buffered and direct write IO plugging from
Fengguang.

- Abstract out the mapping of a bio in a request, and use that to
provide a blk_bio_map_sg() helper. Useful for mapping just a bio
instead of a full request.

- Regression fix from Hugh, fixing up a patch that went into the
previous release cycle (and marked stable, too) attempting to prevent
a loop in __getblk_slow().

- Updates to discard requests, fixing up the sizing and how we align
them. Also a change to disallow merging of discard requests, since
that doesn't really work properly yet.

- A few drbd fixes.

- Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
block: replace __getblk_slow misfix by grow_dev_page fix
drbd: Write all pages of the bitmap after an online resize
drbd: Finish requests that completed while IO was frozen
drbd: fix drbd wire compatibility for empty flushes
Documentation: update tunable options in block/cfq-iosched.txt
Documentation: update tunable options in block/cfq-iosched.txt
Documentation: update missing index files in block/00-INDEX
block: move down direct IO plugging
block: remove plugging at buffered write time
block: disable discard request merge temporarily
bio: Fix potential memory leak in bio_find_or_create_slab()
block: Don't use static to define "void *p" in show_partition_start()
block: Add blk_bio_map_sg() helper
block: Introduce __blk_segment_map_sg() helper
fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
block: split discard into aligned requests
block: reorganize rounding of max_discard_sectors

Linus Torvalds
2012-08-26 02:36:43 +0800

25 Aug, 2012

1 commit

38f386574 xattr: extract simple_xattr code from tmpfs ... Browse Code »

Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.

$ size vmlinux.o
text data bss dec hex filename
4658782 880729 5195032 10734543 a3cbcf vmlinux.o
$ size vmlinux.o
text data bss dec hex filename
4658957 880729 5195032 10734718 a3cc7e vmlinux.o

v7:
- checkpatch warnings fixed
- Implement the changes requested by Hugh Dickins:
- make simple_xattrs_init and simple_xattrs_free inline
- get rid of locking and list reinitialization in simple_xattrs_free,
they're not needed
v6:
- no changes
v5:
- no changes
v4:
- move simple_xattrs_free() to fs/xattr.c
v3:
- in kmem_xattrs_free(), reinitialize the list
- use simple_xattr_* prefix
- introduce simple_xattr_add() to prevent direct list usage

Original-patch-by: Li Zefan
Cc: Li Zefan
Cc: Hillf Danton
Cc: Lennart Poettering
Acked-by: Hugh Dickins
Signed-off-by: Li Zefan
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tejun Heo

Aristeu Rozanski
2012-08-25 06:55:33 +0800

24 Aug, 2012

1 commit

7ca63ee1b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"This tree contains misc fixlets: a perf script python binding fix, a
uprobes fix and a syscall tracing fix."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Add missing files to build the python binding
uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails
tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1

Linus Torvalds
2012-08-24 12:48:41 +0800

22 Aug, 2012

6 commits

c67fe3752 mm: compaction: Abort async compaction if locks are contended or taking too long ... Browse Code »

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

and then it goes to pot

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt
Tested-by: Jim Schutt
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-22 07:45:03 +0800
de74f1cc3 mm: have order > 0 compaction start near a pageblock with free pages ... Browse Code »

Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
left") introduced a caching mechanism to reduce the amount work the free
page scanner does in compaction. However, it has a problem. Consider
two process simultaneously scanning free pages

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

if (cc->order > 0 && (!cc->wrapped ||
zone->compact_cached_free_pfn >
cc->start_free_pfn))
pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner
skips almost the entire space it should have scanned. When there are
multiple processes compacting it can end in a situation where the entire
zone is not being scanned at all. Further, it is possible for two
processes to ping-pong update to compact_cached_free_pfn which is just
random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it
matters if two free scanners happen to be in the same block but with
racing updates it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-22 07:45:03 +0800
b121186ab mm: correct page->pfmemalloc to fix deactivate_slab regression ... Browse Code »

Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|---0.34%-- deactivate_slab
| __slab_alloc
| kmem_cache_alloc
| |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi
Acked-by: Mel Gorman
Cc: David Miller
Tested-by: Eric Dumazet
Tested-by: Sage Weil
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Shi
2012-08-22 07:45:03 +0800
c81758fbe mm/compaction.c: fix deferring compaction mistake ... Browse Code »

Commit aff622495c9a ("vmscan: only defer compaction for failed order and
higher") fixed bad deferring policy but made mistake about checking
compact_order_failed in __compact_pgdat(). So it can't update
compact_order_failed with the new order. This ends up preventing
correct operation of policy deferral. This patch fixes it.

Signed-off-by: Minchan Kim
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-08-22 07:45:03 +0800
f9aed62a2 mm: change nr_ptes BUG_ON to WARN_ON ... Browse Code »

Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating
that not all the page tables allocated could be found and freed when
exit_mmap() tore down the user address space.

There's usually nothing we can say about it, beyond that it's probably a
sign of some bad memory or memory corruption; though it might still
indicate a bug in vma or page table management (and did recently reveal a
race in THP, fixed a few months ago).

But one overdue change we can make is from BUG_ON to WARN_ON.

It's fairly likely that the system will crash shortly afterwards in some
other way (for example, the BUG_ON(page_mapped(page)) in
__delete_from_page_cache(), once an inode mapped into the lost page tables
gets evicted); but might tell us more before that.

Change the BUG_ON(page_mapped) to WARN_ON too? Later perhaps: I'm less
eager, since that one has several times led to fixes.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-08-22 07:45:02 +0800
203b42f73 workqueue: make deferrable delayed_work initializer names consistent ... Browse Code »

Initalizers for deferrable delayed_work are confused.

* __DEFERRED_WORK_INITIALIZER()
* DECLARE_DEFERRED_WORK()
* INIT_DELAYED_WORK_DEFERRABLE()

Rename them to

* __DEFERRABLE_WORK_INITIALIZER()
* DECLARE_DEFERRABLE_WORK()
* INIT_DEFERRABLE_WORK()

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo

Tejun Heo
2012-08-22 04:18:23 +0800

21 Aug, 2012

1 commit

c7a3a88c9 uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails ... Browse Code »

This patch fixes:

https://bugzilla.redhat.com/show_bug.cgi?id=843640

If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.

Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().

For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.

dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.

And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().

Reported-and-tested-by: William Cohen
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Cc: # v3.5
Cc: Anton Arapov
Cc: William Cohen
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2012-08-21 17:48:12 +0800

14 Aug, 2012

1 commit

15773b68f Merge branch 'stable/for-linus-3.6' into linux-next ... Browse Code »

* stable/for-linus-3.6:
mm/frontswap: fix uninit'ed variable warning

Konrad Rzeszutek Wilk
2012-08-14 03:41:53 +0800