Eric Lee / smarc-fsl-linux-kernel

16 Jan, 2016

11 commits

3c28c9cca Merge tag 'md/4.5' of git://neil.brown.name/md ... Browse Code »

Pull md updates from Neil Brown:
"Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and
other minor stuff.

One patch removes me from the MAINTAINERS file and adds a record of my
md maintainership to Credits"

Many thanks to Neil, who has been around for a _looong_ time.

* tag 'md/4.5' of git://neil.brown.name/md: (26 commits)
md/raid: only permit hot-add of compatible integrity profiles
Remove myself as MD Maintainer, and add to Credits.
raid5-cache: handle journal hotadd in quiesce
MD: add journal with array suspended
md: set MD_HAS_JOURNAL in correct places
md: Remove 'ready' field from mddev.
md: remove unnecesary md_new_event_inintr
raid5: allow r5l_io_unit allocations to fail
raid5-cache: use a mempool for the metadata block
raid5-cache: use a bio_set
raid5-cache: add journal hot add/remove support
drivers: md: use ktime_get_real_seconds()
md: avoid warning for 32-bit sector_t
raid5-cache: free meta_page earlier
raid5-cache: simplify r5l_move_io_unit_list
md: update comment for md_allow_write
md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
md-cluster: Protect communication with mutexes
md-cluster: Defer MD reloading to mddev->thread
md-cluster: update the documentation
...

Linus Torvalds
2016-01-16 04:28:00 +0800
4b43ea2a7 Merge tag 'regulator-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator ... Browse Code »

Pull regulator updates from Mark Brown:
"Aside from a fix for a spurious warning (which caused more problems
than it fixed in the fixing really) this is all driver updates,
including new drivers for Dialog PV88060/90 and TI LM363x and TPS65086
devices. The qcom_smd driver has had PM8916 and PMA8084 support
added"

* tag 'regulator-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (36 commits)
regulator: core: remove some dead code
regulator: core: use dev_to_rdev
regulator: lp872x: Get rid of duplicate reference to DVS GPIO
regulator: lp872x: Add missing of_match in regulators descriptions
regulator: axp20x: Fix GPIO LDO enable value for AXP22x
regulator: lp8788: constify regulator_ops structures
regulator: wm8*: constify regulator_ops structures
regulator: da9*: constify regulator_ops structures
regulator: mt6311: Use REGCACHE_RBTREE
regulator: tps65917/palmas: Add bypass ops for LDOs with bypass capability
regulator: qcom-smd: Add support for PMA8084
regulator: qcom-smd: Add PM8916 support
soc: qcom: documentation: Update SMD/RPM Docs
regulator: pv88090: logical vs bitwise AND typo
regulator: pv88090: Fix irq leak
regulator: pv88090: new regulator driver
regulator: wm831x-ldo: Use platform_register/unregister_drivers()
regulator: wm831x-dcdc: Use platform_register/unregister_drivers()
regulator: lp8788-ldo: Use platform_register/unregister_drivers()
regulator: core: Fix nested locking of supplies
...

Linus Torvalds
2016-01-16 04:14:47 +0800
7aca74e7c Merge branch 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration ... Browse Code »

Pull mailbox fixlet from Jussi Brar.

* 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
mailbox: constify mbox_chan_ops structure

Linus Torvalds
2016-01-16 04:13:58 +0800
1d3671df7 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull UDF fixes and quota cleanups from Jan Kara:
"Several UDF fixes and some minor quota cleanups"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Check output buffer length when converting name to CS0
udf: Prevent buffer overrun with multi-byte characters
quota: constify qtree_fmt_operations structures
udf: avoid uninitialized variable use
udf: Fix lost indirect extent block
udf: Factor out code for creating indirect extent
udf: limit the maximum number of indirect extents in a row
udf: limit the maximum number of TD redirections
fs: make quota/dquot.c explicitly non-modular
fs: make quota/netlink.c explicitly non-modular

Linus Torvalds
2016-01-16 03:51:51 +0800
875fc4f5d Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge first patch-bomb from Andrew Morton:

- A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
-stable

- A few misc fixes

- OCFS2 updates

- Part of MM. Including pretty large changes to page-flags handling
and to thp management which have been buffered up for 2-3 cycles now.

I have a lot of MM material this time.

[ It turns out the THP part wasn't quite ready, so that got dropped from
this series - Linus ]

* emailed patches from Andrew Morton : (117 commits)
zsmalloc: reorganize struct size_class to pack 4 bytes hole
mm/zbud.c: use list_last_entry() instead of list_tail_entry()
zram/zcomp: do not zero out zcomp private pages
zram: pass gfp from zcomp frontend to backend
zram: try vmalloc() after kmalloc()
zram/zcomp: use GFP_NOIO to allocate streams
mm: add tracepoint for scanning pages
drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
mm/page_isolation: use macro to judge the alignment
mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
mm: rework virtual memory accounting
include/linux/memblock.h: fix ordering of 'flags' argument in comments
mm: move lru_to_page to mm_inline.h
Documentation/filesystems: describe the shared memory usage/accounting
memory-hotplug: don't BUG() in register_memory_resource()
hugetlb: make mm and fs code explicitly non-modular
mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
mm: make sure isolate_lru_page() is never called for tail page
vmstat: make vmstat_updater deferrable again and shut down on idle
...

Linus Torvalds
2016-01-16 03:41:44 +0800
7dfa46122 zsmalloc: reorganize struct size_class to pack 4 bytes hole ... Browse Code »

Reoder the pages_per_zspage field in struct size_class which can
eliminate the 4 bytes hole between it and stats field.

Signed-off-by: Weijie Yang
Reviewed-by: Sergey Senozhatsky
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2016-01-16 03:40:52 +0800
f58fb5e7f mm/zbud.c: use list_last_entry() instead of list_tail_entry() ... Browse Code »

list_last_entry*( has been defined in list.h, so replace
list_tail_entry() with it.

Signed-off-by: Geliang Tang
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-16 03:40:52 +0800
e02d238c9 zram/zcomp: do not zero out zcomp private pages ... Browse Code »

Do not __GFP_ZERO allocated zcomp ->private pages. We keep allocated
streams around and use them for read/write requests, so we supply a
zeroed out ->private to compression algorithm as a scratch buffer only
once -- the first time we use that stream. For the rest of IO requests
served by this stream ->private usually contains some temporarily data
from the previous requests.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-01-16 03:40:52 +0800
75d8947a3 zram: pass gfp from zcomp frontend to backend ... Browse Code »

Each zcomp backend uses own gfp flag but it's pointless because the
context they could be called is driven by upper layer(ie, zcomp
frontend). As well, zcomp frondend could call them in different
context. One context(ie, zram init part) is it should be better to make
sure successful allocation other context(ie, further stream allocation
part for accelarating I/O speed) is just optional so let's pass gfp down
from driver (ie, zcomp frontend) like normal MM convention.

[sergey.senozhatsky@gmail.com: add missing __vmalloc zero and highmem gfps]
Signed-off-by: Minchan Kim
Signed-off-by: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-01-16 03:40:51 +0800
d913897ab zram: try vmalloc() after kmalloc() ... Browse Code »

When we're using LZ4 multi compression streams for zram swap, we found
out page allocation failure message in system running test. That was
not only once, but a few(2 - 5 times per test). Also, some failure
cases were continually occurring to try allocation order 3.

In order to make parallel compression private data, we should call
kzalloc() with order 2/3 in runtime(lzo/lz4). But if there is no order
2/3 size memory to allocate in that time, page allocation fails. This
patch makes to use vmalloc() as fallback of kmalloc(), this prevents
page alloc failure warning.

After using this, we never found warning message in running test, also
It could reduce process startup latency about 60-120ms in each case.

For reference a call trace :

Binder_1: page allocation failure: order:3, mode:0x10c0d0
CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
Call trace:
dump_backtrace+0x0/0x270
show_stack+0x10/0x1c
dump_stack+0x1c/0x28
warn_alloc_failed+0xfc/0x11c
__alloc_pages_nodemask+0x724/0x7f0
__get_free_pages+0x14/0x5c
kmalloc_order_trace+0x38/0xd8
zcomp_lz4_create+0x2c/0x38
zcomp_strm_alloc+0x34/0x78
zcomp_strm_multi_find+0x124/0x1ec
zcomp_strm_find+0xc/0x18
zram_bvec_rw+0x2fc/0x780
zram_make_request+0x25c/0x2d4
generic_make_request+0x80/0xbc
submit_bio+0xa4/0x15c
__swap_writepage+0x218/0x230
swap_writepage+0x3c/0x4c
shrink_page_list+0x51c/0x8d0
shrink_inactive_list+0x3f8/0x60c
shrink_lruvec+0x33c/0x4cc
shrink_zone+0x3c/0x100
try_to_free_pages+0x2b8/0x54c
__alloc_pages_nodemask+0x514/0x7f0
__get_free_pages+0x14/0x5c
proc_info_read+0x50/0xe4
vfs_read+0xa0/0x12c
SyS_read+0x44/0x74
DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB

[minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
[sergey.senozhatsky@gmail.com: tweak comments and styles]
Signed-off-by: Kyeongdon Kim
Signed-off-by: Minchan Kim
Acked-by: Sergey Senozhatsky
Sergey Senozhatsky
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kyeongdon Kim
2016-01-16 03:40:51 +0800
3d5fe03a3 zram/zcomp: use GFP_NOIO to allocate streams ... Browse Code »

We can end up allocating a new compression stream with GFP_KERNEL from
within the IO path, which may result is nested (recursive) IO
operations. That can introduce problems if the IO path in question is a
reclaimer, holding some locks that will deadlock nested IOs.

Allocate streams and working memory using GFP_NOIO flag, forbidding
recursive IO and FS operations.

An example:

inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes:
(jbd2_handle){+.+.?.}, at: start_this_handle+0x4ca/0x555
{IN-RECLAIM_FS-W} state was registered at:
__lock_acquire+0x8da/0x117b
lock_acquire+0x10c/0x1a7
start_this_handle+0x52d/0x555
jbd2__journal_start+0xb4/0x237
__ext4_journal_start_sb+0x108/0x17e
ext4_dirty_inode+0x32/0x61
__mark_inode_dirty+0x16b/0x60c
iput+0x11e/0x274
__dentry_kill+0x148/0x1b8
shrink_dentry_list+0x274/0x44a
prune_dcache_sb+0x4a/0x55
super_cache_scan+0xfc/0x176
shrink_slab.part.14.constprop.25+0x2a2/0x4d3
shrink_zone+0x74/0x140
kswapd+0x6b7/0x930
kthread+0x107/0x10f
ret_from_fork+0x3f/0x70
irq event stamp: 138297
hardirqs last enabled at (138297): debug_check_no_locks_freed+0x113/0x12f
hardirqs last disabled at (138296): debug_check_no_locks_freed+0x33/0x12f
softirqs last enabled at (137818): __do_softirq+0x2d3/0x3e9
softirqs last disabled at (137813): irq_exit+0x41/0x95

other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(jbd2_handle);

lock(jbd2_handle);

*** DEADLOCK ***
5 locks held by git/20158:
#0: (sb_writers#7){.+.+.+}, at: [] mnt_want_write+0x24/0x4b
#1: (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [] lock_rename+0xd9/0xe3
#2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [] lock_two_nondirectories+0x3f/0x6b
#3: (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [] lock_two_nondirectories+0x66/0x6b
#4: (jbd2_handle){+.+.?.}, at: [] start_this_handle+0x4ca/0x555

stack backtrace:
CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211
Call Trace:
dump_stack+0x4c/0x6e
mark_lock+0x384/0x56d
mark_held_locks+0x5f/0x76
lockdep_trace_alloc+0xb2/0xb5
kmem_cache_alloc_trace+0x32/0x1e2
zcomp_strm_alloc+0x25/0x73 [zram]
zcomp_strm_multi_find+0xe7/0x173 [zram]
zcomp_strm_find+0xc/0xe [zram]
zram_bvec_rw+0x2ca/0x7e0 [zram]
zram_make_request+0x1fa/0x301 [zram]
generic_make_request+0x9c/0xdb
submit_bio+0xf7/0x120
ext4_io_submit+0x2e/0x43
ext4_bio_write_page+0x1b7/0x300
mpage_submit_page+0x60/0x77
mpage_map_and_submit_buffers+0x10f/0x21d
ext4_writepages+0xc8c/0xe1b
do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x84/0x8b
filemap_flush+0x1c/0x1e
ext4_alloc_da_blocks+0xb8/0x117
ext4_rename+0x132/0x6dc
? mark_held_locks+0x5f/0x76
ext4_rename2+0x29/0x2b
vfs_rename+0x540/0x636
SyS_renameat2+0x359/0x44d
SyS_rename+0x1e/0x20
entry_SYSCALL_64_fastpath+0x12/0x6f

[minchan@kernel.org: add stable mark]
Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Kyeongdon Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-01-16 03:40:51 +0800

15 Jan, 2016

29 commits

7d1fc01af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
floppy: make local variable non-static
exynos: fixes an incorrect header guard
dt-bindings: fixes some incorrect header guards
cpufreq-dt: correct dead link in documentation
cpufreq: ARM big LITTLE: correct dead link in documentation
treewide: Fix typos in printk
Documentation: filesystem: Fix typo in fs/eventfd.c
fs/super.c: use && instead of & for warn_on condition
Documentation: fix sysfs-ptp
lib: scatterlist: fix Kconfig description

Linus Torvalds
2016-01-15 09:04:19 +0800
0f0836b7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching ... Browse Code »

Pull livepatching updates from Jiri Kosina:

- RO/NX attribute fixes for patch module relocations from Josh
Poimboeuf. As part of this effort, module.c has been cleaned up as
well and livepatching is piggy-backing on this cleanup. Rusty is OK
with this whole lot going through livepatching tree.

- symbol disambiguation support from Chris J Arges. That series is
also

Reviewed-by: Miroslav Benes

but this came in only after I've alredy pushed out. Didn't want to
rebase because of that, hence I am mentioning it here.

- symbol lookup fix from Miroslav Benes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Cleanup module page permission changes
module: keep percpu symbols in module's symtab
module: clean up RO/NX handling.
module: use a structure to encapsulate layout.
gcov: use within_module() helper.
module: Use the same logic for setting and unsetting RO/NX
livepatch: function,sympos scheme in livepatch sysfs directory
livepatch: add sympos as disambiguator field to klp_reloc
livepatch: add old_sympos as disambiguator field to klp_func

Linus Torvalds
2016-01-15 08:38:02 +0800
c2848f2ee Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid ... Browse Code »

Pull HID updates from Jiri Kosina:

- appoint Benjamin Tissoires as co-maintainer / designated reviewer

- sysfs report_descriptor visibility fix for unclaimed devices, from
Andy Lutomirski

- suspend/resume fixes for Sony driver from Frank Praznik

- IRQ deadlock fix from Ioan-Adrian Ratiu

- hid-i2c fixes affecting (at least) Yoga 900 from Mika Westerberg and
Srinivas Pandruvada

- a lot of new device support (especially, but not limited to, Wacom)
and assorted small misc fixes

- almost complete G920 support; the only bit that is missing is
switching the device to HID mode automatically; Simon Wood and Michal
Maly are working on it.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (46 commits)
Revert "INPUT: xpad: switch Logitech G920 Wheel into HID mode"
HID: sensor-hub: Add quirk for Lenovo Yoga 900 with ITE Chips
HID: Add new PID for Microchip Pick16F1454
HID: wacom: Use correct report to query pen ID from INTUOSHT2 devices
HID: i2c-hid: Prevent sending reports from racing with device reset
HID: use kobj_to_dev()
HID: wiimote: use dev_to_wii()
HID: add a new helper to_hid_driver()
HID: use to_hid_device()
HID: move to_hid_device() to hid.h
HID: usbhid: use to_usb_device
HID: corsair: Convert to use module_hid_driver
HID: input: ignore the battery in OKLICK Laser BTmouse
HID: wacom: Fix pad button range for CINTIQ_COMPANION_2
HID: wacom: Fix touchring value reporting
HID: wacom: Report 'strip2' values in ABS_RY
HID: wacom: Limit touchstrip data to 13 bits
HID: wacom: bitwise vs logical ORs
HID: wacom: Apply lowres quirk to BAMBOO_TOUCH devices
HID: enable hid device to suspend/resume asynchronously
...

Linus Torvalds
2016-01-15 08:20:42 +0800
75f26df6a Merge tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client updates from Trond Myklebust:
"Highlights include:

Stable fixes:
- Fix a regression in the SunRPC socket polling code
- Fix the attribute cache revalidation code
- Fix race in __update_open_stateid()
- Fix an lo->plh_block_lgets imbalance in layoutreturn
- Fix an Oopsable typo in ff_mirror_match_fh()

Features:
- pNFS layout recall performance improvements.
- pNFS/flexfiles: Support server-supplied layoutstats sampling period

Bugfixes + cleanups:
- NFSv4: Don't perform cached access checks before we've OPENed the
file
- Fix starvation issues with background flushes
- Reclaim writes should be flushed as unstable writes if there are
already entries in the commit lists
- Various bugfixes from Chuck to fix NFS/RDMA send queue ordering
problems
- Ensure that we propagate fatal layoutget errors back to the
application
- Fixes for sundry flexfiles layoutstats bugs
- Fix files/flexfiles to not cache invalidated layouts in the DS
commit buckets"

* tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
NFS: Fix a compile warning about unused variable in nfs_generic_pg_pgios()
NFSv4: Fix a compile warning about no prototype for nfs4_ioctl()
NFS: Use wait_on_atomic_t() for unlock after readahead
SUNRPC: Fixup socket wait for memory
NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
NFSv4.1/pNFS: Fix a race in initiate_file_draining()
NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
NFS: Relax requirements in nfs_flush_incompatible
NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
NFS: Allow multiple commit requests in flight per file
NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
SUNRPC: Fix a missing break in rpc_anyaddr()
pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
NFS: Fix attribute cache revalidation
NFS: Ensure we revalidate attributes before using execute_ok()
...

Linus Torvalds
2016-01-15 08:08:23 +0800
63f729cb4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fix from Al Viro:
"Don't put symlink bodies in pagecache into highmem"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Make sure that highmem pages are not added to symlink page cache

Linus Torvalds
2016-01-15 08:03:57 +0800
7d2eba055 mm: add tracepoint for scanning pages ... Browse Code »

This patch series makes swapin readahead up to a certain number to gain
more thp performance and adds tracepoint for khugepaged_scan_pmd,
collapse_huge_page, __collapse_huge_page_isolate.

This patch series was written to deal with programs that access most,
but not all, of their memory after they get swapped out. Currently
these programs do not get their memory collapsed into THPs after the
system swapped their memory out, while they would get THPs before
swapping happened.

This patch series was tested with a test program, it allocates 400MB of
memory, writes to it, and then sleeps. I force the system to swap out
all. Afterwards, the test program touches the area by writing and
leaves a piece of it without writing. This shows how much swap in
readahead made by the patch.

Test results:

After swapped out
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 90076 kB | 88064 kB | 309928 kB | %99 |
-------------------------------------------------------------------
Without patch | 194068 kB | 192512 kB | 205936 kB | %99 |
-------------------------------------------------------------------

After swapped in
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 201408 kB | 198656 kB | 198596 kB | %98 |
-------------------------------------------------------------------
Without patch | 292624 kB | 192512 kB | 107380 kB | %65 |
-------------------------------------------------------------------

This patch (of 3):

Using static tracepoints, data of functions is recorded. It is good to
automatize debugging without doing a lot of changes in the source code.

This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page
and __collapse_huge_page_isolate.

[dan.carpenter@oracle.com: add a missing tab]
Signed-off-by: Ebru Akagunduz
Acked-by: Kirill A. Shutemov
Acked-by: Rik van Riel
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Xie XiuQi
Cc: Cyrill Gorcunov
Cc: Mel Gorman
Cc: David Rientjes
Cc: Vlastimil Babka
Cc: Aneesh Kumar K.V
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2016-01-15 08:00:49 +0800
cb5490a5e drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64 ... Browse Code »

Fix a bug where a kernel warning is triggered when performing a memory
hotplug on ppc64. This warning may also occur on any architecture that
uses the memory_probe_store interface.

WARNING: at drivers/base/memory.c:200
CPU: 9 PID: 13042 Comm: systemd-udevd Not tainted 4.4.0-rc4-00113-g0bd0f1e-dirty #7
NIP [c00000000055e034] pages_correctly_reserved+0x134/0x1b0
LR [c00000000055e7f8] memory_subsys_online+0x68/0x140
Call Trace:
memory_subsys_online+0x68/0x140
device_online+0xb4/0x120
store_mem_state+0xb0/0x180
dev_attr_store+0x34/0x60
sysfs_kf_write+0x64/0xa0
kernfs_fop_write+0x17c/0x1e0
__vfs_write+0x40/0x160
vfs_write+0xb8/0x200
SyS_write+0x60/0x110
system_call+0x38/0xd0

The warning is triggered because there is a udev rule that automatically
tries to online memory after it has been added. The udev rule varies
from distro to distro, but will generally look something like:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

On any architecture that uses memory_probe_store to reserve memory, the
udev rule will be triggered after the first section of the block is
reserved and will subsequently attempt to online the entire block,
interrupting the memory reservation process and causing the warning.
This patch modifies memory_probe_store to add a block of memory with a
single call to add_memory as opposed to looping through and adding each
section individually. A single call to add_memory is protected by the
mem_hotplug mutex which will prevent the udev rule from onlining memory
until the reservation of the entire block is complete.

Signed-off-by: John Allen
Acked-by: Dave Hansen
Cc: Nathan Fontenot
Cc: Michael Ellerman
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Allen
2016-01-15 08:00:49 +0800
fec174d66 mm/page_isolation: use macro to judge the alignment ... Browse Code »

Signed-off-by: Wang Xiaoqiang
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-01-15 08:00:49 +0800
543dfb2df mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE() ... Browse Code »

Running sparse on drivers/staging/lustre results in dozens of warnings:
include/linux/gfp.h:281:41: warning: odd constant _Bool cast (400000
becomes 1)

Use "!!" to explicitly convert to bool and get rid of the warning.

Signed-off-by: Joshua Clayton
Cc: Mel Gorman
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joshua Clayton
2016-01-15 08:00:49 +0800
846383359 mm: rework virtual memory accounting ... Browse Code »

When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.

Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:

* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas

This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:

VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.

- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Cyrill Gorcunov
Cc: Quentin Casasnovas
Cc: Vegard Nossum
Acked-by: Linus Torvalds
Cc: Willy Tarreau
Cc: Andy Lutomirski
Cc: Kees Cook
Cc: Vladimir Davydov
Cc: Pavel Emelyanov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2016-01-15 08:00:49 +0800
d30b5545b include/linux/memblock.h: fix ordering of 'flags' argument in comments ... Browse Code »

for_each_free_mem_range() and for_each_free_mem_range_reverse() both
accept a 'flags' argument, the comment surrounding the macro placed the
'flags' documentation at the very end, while 'flags' is in fact the 3rd
argument to the macro, so let's preserve natural ordering here.

Fixes: fc6daaf931518 ("mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute")
Signed-off-by: Florian Fainelli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Florian Fainelli
2016-01-15 08:00:49 +0800
d72ee9111 mm: move lru_to_page to mm_inline.h ... Browse Code »

Move lru_to_page() from internal.h to mm_inline.h.

Signed-off-by: Geliang Tang
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
0bc126d46 Documentation/filesystems: describe the shared memory usage/accounting ... Browse Code »

The Shared Memory accounting support is present in Kernel since commit
4b02108ac1b3 ("mm: oom analysis: add shmem vmstat") and in userland
free(1) since 2014. This patch updates the Documentation to reflect
this change.

Signed-off-by: Rodrigo Freire
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rodrigo Freire
2016-01-15 08:00:49 +0800
6f754ba4c memory-hotplug: don't BUG() in register_memory_resource() ... Browse Code »

Out of memory condition is not a bug and while we can't add new memory
in such case crashing the system seems wrong. Propagating the return
value from register_memory_resource() requires interface change.

Signed-off-by: Vitaly Kuznetsov
Reviewed-by: Igor Mammedov
Acked-by: David Rientjes
Cc: Tang Chen
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: Sheng Yong
Cc: Zhu Guihua
Cc: Dan Williams
Cc: David Vrabel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Kuznetsov
2016-01-15 08:00:49 +0800
3e89e1c5e hugetlb: make mm and fs code explicitly non-modular ... Browse Code »

The Kconfig currently controlling compilation of this code is:

config HUGETLBFS
bool "HugeTLB file system support"

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that when
reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular case,
the init ordering gets moved to earlier levels when we use the more
appropriate initcalls here.

Originally I had the fs part and the mm part as separate commits, just
by happenstance of the nature of how I detected these non-modular use
cases. But that can possibly introduce regressions if the patch merge
ordering puts the fs part 1st -- as the 0-day testing reported a splat
at mount time.

Investigating with "initcall_debug" showed that the delta was
init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
both the fs change and the mm change are here together.

In addition, it worked before due to luck of link order, since they were
both in the same initcall category. So we now have the fs part using
fs_initcall, and the mm part using subsys_initcall, which puts it one
bucket earlier. It now passes the basic sanity test that failed in
earlier 0-day testing.

We delete the MODULE_LICENSE tag and capture that information at the top
of the file alongside author comments, etc.

We don't replace module.h with init.h since the file already has that.
Also note that MODULE_ALIAS is a no-op for non-modular code.

Signed-off-by: Paul Gortmaker
Reported-by: kernel test robot
Cc: Nadia Yvette Chambers
Cc: Alexander Viro
Cc: Naoya Horiguchi
Reviewed-by: Mike Kravetz
Cc: David Rientjes
Cc: Hillf Danton
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2016-01-15 08:00:49 +0800
0d576d20c mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations ... Browse Code »

Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
0e41e2779 mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd() ... Browse Code »

clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
VM_SOFTDIRTY was already cleared before walk_page_range().

Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Acked-by: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-01-15 08:00:49 +0800
bb5b85897 mm: make sure isolate_lru_page() is never called for tail page ... Browse Code »

The VM_BUG_ON_PAGE() would catch such cases if any still exists.

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-15 08:00:49 +0800
0eb77e988 vmstat: make vmstat_updater deferrable again and shut down on idle ... Browse Code »

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51 ("vmstat: do not use deferrable delayed work for
vmstat_update"). This in turn can cause multiple interruptions of the
applications because the vmstat updater may run at

Make vmstate_update deferrable again and provide a function that folds
the differentials when the processor is going to idle mode thus
addressing the issue of the above commit in a clean way.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51 ("vmstat: do not use deferrable delayed work for vmstat_update")
Signed-off-by: Christoph Lameter
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2016-01-15 08:00:49 +0800
686739f6a memcg: avoid vmpressure oops when memcg disabled ... Browse Code »

A CONFIG_MEMCG=y kernel booted with "cgroup_disable=memory" crashes on a
NULL memcg (but non-NULL root_mem_cgroup) when vmpressure kicks in.
Here's the patch I use to avoid that, but you might prefer a test on
mem_cgroup_disabled() somewhere.

Signed-off-by: Hugh Dickins
Acked-by: Johannes Weiner
Cc: David S. Miller
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-01-15 08:00:49 +0800
ef12947c9 mm: memcontrol: switch to the updated jump-label API ... Browse Code »

According to the direct use of struct static_key is
deprecated. Update the socket and slab accounting code accordingly.

Signed-off-by: Johannes Weiner
Acked-by: David S. Miller
Reported-by: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
8e8ae6452 mm: memcontrol: hook up vmpressure to socket pressure ... Browse Code »

Let the networking stack know when a memcg is under reclaim pressure so
that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
in the socket and tcp memory code that tells it to curb consumption
growth from sockets associated with said control group.

Traditionally, vmpressure reports for the entire subtree of a memcg
under pressure, which drops useful information on the individual groups
reclaimed. However, it's too late to change the userinterface, so add a
second reporting mode that reports on the level of reclaim instead of at
the level of pressure, and use that report for sockets.

vmpressure events are naturally edge triggered, so for hysteresis assert
socket pressure for a second to allow for subsequent vmpressure events
to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner
Acked-by: David S. Miller
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
f7e1cb6ec mm: memcontrol: account socket memory in unified hierarchy memory controller ... Browse Code »

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on the
boot commandline to override any runtime configuration and forcibly
exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner
Acked-by: David S. Miller
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
110920876 mm: memcontrol: move socket code for unified hierarchy accounting ... Browse Code »

The unified hierarchy memory controller will account socket memory.
Move the infrastructure functions accordingly.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
7941d2145 mm: memcontrol: do not account memory+swap on unified hierarchy ... Browse Code »

The unified hierarchy memory controller doesn't expose the memory+swap
counter to userspace, but its accounting is hardcoded in all charge
paths right now, including the per-cpu charge cache ("the stock").

To avoid adding yet more pointless memory+swap accounting with the
socket memory support in unified hierarchy, disable the counter
altogether when in unified hierarchy mode.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
80e95fe0f mm: memcontrol: generalize the socket accounting jump label ... Browse Code »

The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks. Move it to the memory
controller code and give it a more generic name.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
baac50bbc net: tcp_memcontrol: simplify linkage between socket and page counter ... Browse Code »

There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and link
sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
e805605c7 net: tcp_memcontrol: sanitize tcp memory accounting callbacks ... Browse Code »

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
80f23124f net: tcp_memcontrol: simplify the per-memcg limit access ... Browse Code »

tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
but it only ever sets these entries to the value of the memory_allocated
page_counter limit. Use the latter directly.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800