Eric Lee / smarc-fsl-linux-kernel

02 Apr, 2016

6 commits

05cf8077e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Missing device reference in IPSEC input path results in crashes
during device unregistration. From Subash Abhinov Kasiviswanathan.

2) Per-queue ISR register writes not being done properly in macb
driver, from Cyrille Pitchen.

3) Stats accounting bugs in bcmgenet, from Patri Gynther.

4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
Quentin Armitage.

5) SXGBE driver has off-by-one in probe error paths, from Rasmus
Villemoes.

6) Fix race in save/swap/delete options in netfilter ipset, from
Vishwanath Pai.

7) Ageing time of bridge not set properly when not operating over a
switchdev device. Fix from Haishuang Yan.

8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
Duyck.

9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

10) FEC driver should only access registers that actually exist on the
given chipset, fix from Fabio Estevam.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
net: mvneta: fix changing MTU when using per-cpu processing
stmmac: fix MDIO settings
Revert "stmmac: Fix 'eth0: No PHY found' regression"
stmmac: fix TX normal DESC
net: mvneta: use cache_line_size() to get cacheline size
net: mvpp2: use cache_line_size() to get cacheline size
net: mvpp2: fix maybe-uninitialized warning
tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
rtnl: fix msg size calculation in if_nlmsg_size()
fec: Do not access unexisting register in Coldfire
net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
bpf: make padding in bpf_tunnel_key explicit
ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
bnxt_en: Fix ethtool -a reporting.
bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
bnxt_en: Implement proper firmware message padding.
...

Linus Torvalds
2016-04-02 09:03:33 +0800
bbe3de256 mm/page_isolation: fix tracepoint to mirror check function behavior ... Browse Code »

Page isolation has not failed if the fin pfn extends beyond the end pfn
and test_pages_isolated checks this correctly. Fix the tracepoint to
report the same result as the actual check function.

Signed-off-by: Lucas Stach
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lucas Stach
2016-04-02 06:03:37 +0800
969e8d7e4 include/linux/huge_mm.h: return NULL instead of false for pmd_trans_huge_lock() ... Browse Code »

The return value of pmd_trans_huge_lock() is a pointer, not a boolean
value, so use NULL instead of false as the return value.

Signed-off-by: Chen Gang
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2016-04-02 06:03:37 +0800
a7657f128 stmmac: fix MDIO settings ... Browse Code »

Initially the phy_bus_name was added to manipulate the
driver name but it was recently just used to manage the
fixed-link and then to take some decision at run-time.
So the patch uses the is_pseudo_fixed_link and removes
the phy_bus_name variable not necessary anymore.

The driver can manage the mdio registration by using phy-handle,
dwmac-mdio and own parameter e.g. snps,phy-addr.
This patch takes care about all these possible configurations
and fixes the mdio registration in case of there is a real
transceiver or a switch (that needs to be managed by using
fixed-link).

Signed-off-by: Giuseppe Cavallaro
Reviewed-by: Andreas Färber
Tested-by: Frank Schäfer
Cc: Gabriel Fernandez
Cc: Dinh Nguyen
Cc: David S. Miller
Cc: Phil Reid
Signed-off-by: David S. Miller

Giuseppe CAVALLARO
2016-04-02 02:38:59 +0800
d7e944c8d Revert "stmmac: Fix 'eth0: No PHY found' regression" ... Browse Code »

This reverts commit 88f8b1bb41c6208f81b6a480244533ded7b59493.
due to problems on GeekBox and Banana Pi M1 board when
connected to a real transceiver instead of a switch via
fixed-link.

Signed-off-by: Giuseppe Cavallaro
Cc: Gabriel Fernandez
Cc: Andreas Färber
Cc: Frank Schäfer
Cc: Dinh Nguyen
Cc: David S. Miller
Signed-off-by: David S. Miller

Giuseppe CAVALLARO
2016-04-02 02:38:58 +0800
5a5abb1fa tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter ... Browse Code »

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

[ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
[ 52.765688] other info that might help us debug this:
[ 52.765695] rcu_scheduler_active = 1, debug_locks = 1
[ 52.765701] 1 lock held by a.out/1525:
[ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
[ 52.765721] stack backtrace:
[ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
[...]
[ 52.765768] Call Trace:
[ 52.765775] [] dump_stack+0x85/0xc8
[ 52.765784] [] lockdep_rcu_suspicious+0xd5/0x110
[ 52.765792] [] sk_detach_filter+0x82/0x90
[ 52.765801] [] tun_detach_filter+0x35/0x90 [tun]
[ 52.765810] [] __tun_chr_ioctl+0x354/0x1130 [tun]
[ 52.765818] [] ? selinux_file_ioctl+0x130/0x210
[ 52.765827] [] tun_chr_ioctl+0x13/0x20 [tun]
[ 52.765834] [] do_vfs_ioctl+0x96/0x690
[ 52.765843] [] ? security_file_ioctl+0x43/0x60
[ 52.765850] [] SyS_ioctl+0x79/0x90
[ 52.765858] [] do_syscall_64+0x62/0x140
[ 52.765866] [] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 994051625981 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.

Reported-by: Sasha Levin
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2016-04-02 02:33:46 +0800

31 Mar, 2016

1 commit

c0e760c9c bpf: make padding in bpf_tunnel_key explicit ... Browse Code »

Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
and tunnel_label members explicit. No issue has been observed, and
gcc/llvm does padding for the old struct already, where tunnel_label
was not yet present, so the current code works, but since it's part
of uapi, make sure we don't introduce holes in structs.

Therefore, add tunnel_ext that we can use generically in future
(f.e. to flag OAM messages for backends, etc). Also add the offset
to the compat tests to be sure should some compilers not padd the
tail of the old version of bpf_tunnel_key.

Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-03-31 07:01:33 +0800

29 Mar, 2016

1 commit

fc0c20281 x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem() ... Browse Code »

Update the definition of memcpy_from_pmem() to return 0 or a negative
error code. Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe().

Cc: Borislav Petkov
Cc: Tony Luck
Cc: Thomas Gleixner
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Andrew Morton
Cc: Linus Torvalds
Acked-by: Ingo Molnar
Reviewed-by: Ross Zwisler
Signed-off-by: Dan Williams

Dan Williams
2016-03-29 08:19:31 +0800

28 Mar, 2016

1 commit

596cf3fe5 netfilter: ipset: fix race condition in ipset save, swap and delete ... Browse Code »

This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:

ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters

ipset save &

ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */

Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

Both delete and swap will error out if ref_netlink != 0 on the set.

Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.

Reviewed-by: Joshua Hunt
Signed-off-by: Vishwanath Pai
Signed-off-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso

Vishwanath Pai
2016-03-28 23:57:45 +0800

27 Mar, 2016

3 commits

d5a38f6e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"

[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...

Linus Torvalds
2016-03-27 06:53:16 +0800
b4cec5f66 Merge tag 'ntb-4.6' of git://github.com/jonmason/ntb ... Browse Code »

Pull NTB bug fixes from Jon Mason:
"NTB bug fixes for tasklet from spinning forever, link errors,
translation window setup, NULL ptr dereference, and ntb-perf errors.

Also, a modification to the driver API that makes _addr functions
optional"

* tag 'ntb-4.6' of git://github.com/jonmason/ntb:
NTB: Remove _addr functions from ntb_hw_amd
NTB: Make _addr functions optional in the API
NTB: Fix incorrect clean up routine in ntb_perf
NTB: Fix incorrect return check in ntb_perf
ntb: fix possible NULL dereference
ntb: add missing setup of translation window
ntb: stop link work when we do not have memory
ntb: stop tasklet from spinning forever during shutdown.
ntb: perf test: fix address space confusion

Linus Torvalds
2016-03-27 02:37:42 +0800
895a1067d Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull more SCSI updates from James Bottomley:
"The only new stuff which missed the first pull request is an update to
the UFS driver.

The rest is an assortment of bug fixes and minor tweaks which appeared
recently (some are fixes for recent code and some are stuff spotted
recently by the checkers or the new gcc-6 compiler [most of Arnd's
stuff])"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
scsi_common: do not clobber fixed sense information
scsi: ufs: select CONFIG_NLS
scsi: fc: use get/put_unaligned64 for wwn access
fnic: move printk()s outside of the critical code section.
qla2xxx: avoid maybe_uninitialized warning
megaraid_sas: add missing curly braces in ioctl handler
lpfc: fix misleading indentation
scsi_transport_sas: add 'scsi_target_id' sysfs attribute
scsi_dh_alua: uninitialized variable in alua_check_vpd()
scsi: ufs-qcom: add printouts of testbus debug registers
scsi: ufs-qcom: enable/disable the device ref clock
scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup
scsi: ufs: add device quirk delay before putting UFS rails in LPM
scsi: ufs: fix leakage during link off state
scsi: ufs: tune UniPro parameters to optimize hibern8 exit time
scsi: ufs: handle non spec compliant bkops behaviour by device
scsi: ufs: add retry for query descriptors
scsi: ufs: add error recovery after DL NAC error
scsi: ufs: make error handling bit faster
scsi: ufs: disable vccq if it's not needed by UFS device
...

Linus Torvalds
2016-03-27 02:31:01 +0800

26 Mar, 2016

20 commits

cd11016e5 mm, kasan: stackdepot implementation. Enable stackdepot for SLAB ... Browse Code »

Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks. The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.

IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.

Right now stackdepot support is only enabled in SLAB allocator. Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.

This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.

Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.

[akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
[aryabinin@virtuozzo.com: comment style fixes]
Signed-off-by: Alexander Potapenko
Signed-off-by: Andrey Ryabinin
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-03-26 07:37:42 +0800
be7635e72 arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections ... Browse Code »

KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.

Move the definition of __irq_entry to so that the
users don't need to pull in . Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.

Signed-off-by: Alexander Potapenko
Acked-by: Steven Rostedt
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-03-26 07:37:42 +0800
505f5dcb1 mm, kasan: add GFP flags to KASAN API ... Browse Code »

Add GFP flags to KASAN hooks for future patches to use.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-03-26 07:37:42 +0800
7ed2f9e66 mm, kasan: SLAB support ... Browse Code »

Add KASAN hooks to SLAB allocator.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-03-26 07:37:42 +0800
aaf4fb712 include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill() ... Browse Code »

A leftover from commit c32b3cbe0d06 ("oom, PM: make OOM detection in the
freezer path raceless").

Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-03-26 07:37:42 +0800
bb29902a7 oom, oom_reaper: protect oom_reaper_list using simpler way ... Browse Code »

"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
by simply checking tsk->oom_reaper_list != NULL.

Signed-off-by: Tetsuo Handa
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-03-26 07:37:42 +0800
29c696e1c oom: make oom_reaper_list single linked ... Browse Code »

Entries are only added/removed from oom_reaper_list at head so we can
use a single linked list and hence save a word in task_struct.

Signed-off-by: Vladimir Davydov
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-26 07:37:42 +0800
855b01832 oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task ... Browse Code »

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Cc: David Rientjes
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Hugh Dickins
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-26 07:37:42 +0800
03049269d mm, oom_reaper: implement OOM victims queuing ... Browse Code »

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require some
way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach to
help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
wake_oom_reaper
cmpxchg // OK
oom_reaper
oom_reap_task
__oom_reap_task
oom_victim terminates
atomic_inc_not_zero // fail
out_of_memory
wake_oom_reaper
cmpxchg // fails
task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim
might have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Andrea Argangeli
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-26 07:37:42 +0800
36324a990 oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space ... Browse Code »

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.

Signed-off-by: Michal Hocko
Suggested-by: Johannes Weiner
Suggested-by: Tetsuo Handa
Cc: Andrea Argangeli
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-26 07:37:42 +0800
aac453635 mm, oom: introduce oom reaper ... Browse Code »

This patch (of 5):

This is based on the idea from Mel Gorman discussed during LSFMM 2015
and independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory. Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event (e.g.
lock) which is blocked by another task looping in the page allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory owned
by the oom victim under an assumption that such a memory won't be needed
when its owner is killed and kicked from the userspace anyway. There is
one notable exception to this, though, if the OOM victim was in the
process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between. Users of
mmap_sem which need it for write should be carefully reviewed to use
_killable waiting as much as possible and reduce allocations requests
done with the lock held to absolute minimum to reduce the risk even
further.

The API between oom killer and oom reaper is quite trivial.
wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
NULL->mm transition and oom_reaper clear this atomically once it is done
with the work. This means that only a single mm_struct can be reaped at
the time. As the operation is potentially disruptive we are trying to
limit it to the ncessary minimum and the reaper blocks any updates while
it operates on an mm. mm_struct is pinned by mm_count to allow parallel
exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

Signed-off-by: Michal Hocko
Suggested-by: Oleg Nesterov
Suggested-by: Mel Gorman
Acked-by: Mel Gorman
Acked-by: David Rientjes
Cc: Mel Gorman
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: Hugh Dickins
Cc: Andrea Argangeli
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-26 07:37:42 +0800
69b27baf0 sched: add schedule_timeout_idle() ... Browse Code »

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-03-26 07:37:42 +0800
315f24088 ceph: fix security xattr deadlock ... Browse Code »

When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.

The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)

Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.

Signed-off-by: Yan, Zheng

Yan, Zheng
2016-03-26 01:51:55 +0800
2c63f49a7 libceph: add helper that duplicates last extent operation ... Browse Code »

This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov

Yan, Zheng
2016-03-26 01:51:43 +0800
3f1af42ad libceph: enable large, variable-sized OSD requests ... Browse Code »

Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests. The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once. ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:43 +0800
7665d85b7 libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op ... Browse Code »

This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov

Yan, Zheng
2016-03-26 01:51:42 +0800
de2aa102e libceph: rename ceph_osd_req_op::payload_len to indata_len ... Browse Code »

Follow userspace nomenclature on this - the next commit adds
outdata_len.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:41 +0800
168b9090c libceph: monc hunt rate is 3s with backoff up to 30s ... Browse Code »

Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:39 +0800
58d81b129 libceph: monc ping rate is 10s ... Browse Code »

Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:39 +0800
82dcabad7 libceph: revamp subs code, switch to SUBSCRIBE2 protocol ... Browse Code »

It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime". To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs. Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:38 +0800

25 Mar, 2016

8 commits

f98c2135f Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

Pull drm fixes from Dave Airlie:
"Just a couple of dma-buf related fixes and some amdgpu fixes, along
with a regression fix for radeon off but default feature, but makes my
30" monitor happy again"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux:
drm/radeon/mst: cleanup code indentation
drm/radeon/mst: fix regression in lane/link handling.
drm/amdgpu: add invalidate_page callback for userptrs
drm/amdgpu: Revert "remove the userptr rmn->lock"
drm/amdgpu: clean up path handling for powerplay
drm/amd/powerplay: fix memory leak of tdp_table
dma-buf/fence: fix fence_is_later v2
dma-buf: Update docs for SYNC ioctl
drm: remove excess description
dma-buf, drm, ion: Propagate error code from dma_buf_start_cpu_access()
drm/atmel-hlcdc: use helper to get crtc state
drm/atomic: use helper to get crtc state

Linus Torvalds
2016-03-25 23:48:31 +0800
4cef191d0 net: phy: bcm7xxx: Add entries for Broadcom BCM7346 and BCM7362 ... Browse Code »

Add PHY entries for the Broadcom BCM7346 and BCM7362 chips, these are
40nm generation Ethernet PHY.

Fixes: 815717d1473e ("net: phy: bcm7xxx: Remove wildcard entries")
Signed-off-by: Jaedon Shin
Signed-off-by: David S. Miller

Jaedon Shin
2016-03-25 23:37:57 +0800
11caf57f6 Merge tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic ... Browse Code »

Pull asm-generic updates from Arnd Bergmann:
"There are only three patches this time, most other changes to files in
include/asm-generic tend to go through the tree of whoever depends on
the change.

Two patches are cleanups for stuff that is no longer needed, the main
change is to adapt the generic version of BUG_ON() for CONFIG_BUG=n to
make it behave consistently with BUG().

This avoids undefined behavior along with a number of warnings about
that undefined behavior in randconfig builds when we keep going on
after hitting a BUG_ON()"

* tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: remove old nonatomic-io wrapper files
asm-generic: default BUG_ON(x) to if(x)BUG()
asm-generic: page.h: Remove useless get_user_page and free_user_page

Linus Torvalds
2016-03-25 14:13:48 +0800
3d66c6ba3 Merge tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more power management and ACPI updates from Rafael Wysocki:
"The second batch of power management and ACPI updates for v4.6.

Included are fixups on top of the previous PM/ACPI pull request and
other material that didn't make into it but still should go into 4.6.

Among other things, there's a fix for an intel_pstate driver issue
uncovered by recent cpufreq changes, a workaround for a boot hang on
Skylake-H related to the handling of deep C-states by the platform and
a PCI/ACPI fix for the handling of IO port resources on non-x86
architectures plus some new device IDs and similar.

Specifics:

- Fix for an intel_pstate driver issue related to the handling of MSR
updates uncovered by the recent cpufreq rework (Rafael Wysocki).

- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking fix
for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).

- intel_idle driver update preventing some Skylake-H systems from
hanging during initialization by disabling deep C-states mishandled
by the platform in the problematic configurations (Len Brown).

- Intel Xeon Phi Processor x200 support for intel_idle
(Dasaratharaman Chandramouli).

- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the fallback
C-state on x86 when they are set below its exit latency) and to
restore the previous behavior to fall back to C1 if the next timer
event is set far enough in the future that was changed in 4.4 which
led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).

- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).

- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).

- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

- Fix for the ACPI backend of the generic device properties API to
make it parse non-device (data node only) children of an ACPI
device correctly (Irina Tirdea).

- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).

- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).

- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven)"

* tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM / AVS: rockchip-io: add io selectors and supplies for rk3399
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
ACPI / PM: Runtime resume devices when waking from hibernate
PM / sleep: Clear pm_suspend_global_flags upon hibernate
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
PCI: ACPI: IA64: fix IO port generic range check
ACPI / util: cast data to u64 before shifting to fix sign extension
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpuidle: menu: Fall back to polling if next timer event is near
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
ACPI / property: fix data node parsing in acpi_get_next_subnode()
ACPI / APD: Add device HID for future AMD UART controller
...

Linus Torvalds
2016-03-25 13:59:58 +0800
1d02369db Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Final round of fixes for this merge window - some of this has come up
after the initial pull request, and some of it was put in a post-merge
branch before the merge window.

This contains:

- Fix for a bad check for an error on dma mapping in the mtip32xx
driver, from Alexey Khoroshilov.

- A set of fixes for lightnvm, from Javier, Matias, and Wenwei.

- An NVMe completion record corruption fix from Marta, ensuring that
we read things in the right order.

- Two writeback fixes from Tejun, marked for stable@ as well.

- A blk-mq sw queue iterator fix from Thomas, fixing an oops for
sparse CPU maps. They hit this in the hot plug/unplug rework"

* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: avoid cqe corruption when update at the same time as read
writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
blk-mq: Use proper cpumask iterator
mtip32xx: fix checks for dma mapping errors
lightnvm: do not load L2P table if not supported
lightnvm: do not reserve lun on l2p loading
nvme: lightnvm: return ppa completion status
lightnvm: add a bitmap of luns
lightnvm: specify target's logical address area
null_blk: add lightnvm null_blk device to the nullb_list

Linus Torvalds
2016-03-25 11:00:44 +0800
8f40842e4 Merge tag 'for-linus-20160324' of git://git.infradead.org/linux-mtd ... Browse Code »

Pull MTD updates from Brian Norris:
"NAND:
- Add sunxi_nand randomizer support
- begin refactoring NAND ecclayout structs
- fix pxa3xx_nand dmaengine usage
- brcmnand: fix support for v7.1 controller
- add Qualcomm NAND controller driver

SPI NOR:
- add new ls1021a, ls2080a support to Freescale QuadSPI
- add new flash ID entries
- support bottom-block protection for Winbond flash
- support Status Register Write Protect
- remove broken QPI support for Micron SPI flash

JFFS2:
- improve post-mount CRC scan efficiency

General:
- refactor bcm63xxpart parser, to later extend for NAND
- add writebuf size parameter to mtdram

Other minor code quality improvements"

* tag 'for-linus-20160324' of git://git.infradead.org/linux-mtd: (72 commits)
mtd: nand: remove kerneldoc for removed function parameter
mtd: nand: Qualcomm NAND controller driver
dt/bindings: qcom_nandc: Add DT bindings
mtd: nand: don't select chip in nand_chip's block_bad op
mtd: spi-nor: support lock/unlock for a few Winbond chips
mtd: spi-nor: add TB (Top/Bottom) protect support
mtd: spi-nor: add SPI_NOR_HAS_LOCK flag
mtd: spi-nor: use BIT() for flash_info flags
mtd: spi-nor: disallow further writes to SR if WP# is low
mtd: spi-nor: make lock/unlock bounds checks more obvious and robust
mtd: spi-nor: silently drop lock/unlock for already locked/unlocked region
mtd: spi-nor: wait for SR_WIP to clear on initial unlock
mtd: nand: simplify nand_bch_init() usage
mtd: mtdswap: remove useless if (!mtd->ecclayout) test
mtd: create an mtd_oobavail() helper and make use of it
mtd: kill the ecclayout->oobavail field
mtd: nand: check status before reporting timeout
mtd: bcm63xxpart: give width specifier an 'int', not 'size_t'
mtd: mtdram: Add parameter for setting writebuf size
mtd: nand: pxa3xx_nand: kill unused field 'drcmr_cmd'
...

Linus Torvalds
2016-03-25 10:57:15 +0800
8b306a2e7 Merge tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull more nfsd updates from Bruce Fields:
"Apologies for the previous request, which omitted the top 8 commits
from my for-next branch (including the SCSI layout commits). Thanks
to Trond for spotting my error!"

This actually includes the new layout types, so here's that part of
the pull message repeated:

"Support for a new pnfs layout type from Christoph Hellwig. The new
layout type is a variant of the block layout which uses SCSI features
to offer improved fencing and device identification.

Note this pull request also includes the client side of SCSI layout,
with Trond's permission"

* tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux:
nfsd: use short read as well as i_size to set eof
nfsd: better layoutupdate bounds-checking
nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK
nfsd: add SCSI layout support
nfsd: move some blocklayout code
nfsd: add a new config option for the block layout driver
nfs/blocklayout: add SCSI layout support
nfs4.h: add SCSI layout definitions

Linus Torvalds
2016-03-25 10:50:32 +0800
2162b80fc Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild ... Browse Code »

Pull kbuild updates from Michal Marek:

- make dtbs_install fix

- Error handling fix fixdep and link-vmlinux.sh

- __UNIQUE_ID fix for clang

- Fix for if_changed_* to suppress the "is up to date." message

- The kernel is built with -Werror=incompatible-pointer-types

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kbuild: Add option to turn incompatible pointer check into error
kbuild: suppress annoying "... is up to date." message
kbuild: fixdep: Check fstat(2) return value
scripts/link-vmlinux.sh: force error on kallsyms failure
Kbuild: provide a __UNIQUE_ID for clang
dtbsinstall: don't move target directory out of the way

Linus Torvalds
2016-03-25 10:26:47 +0800