Eric Lee / smarc-fsl-linux-kernel

06 Apr, 2016

1 commit

e43569e6d sctp: flush if we can't fit another DATA chunk ... Browse Code »

There is no point on delaying the packet if we can't fit a single byte
of data on it anymore. So lets just reduce the threshold by the amount
that a data chunk with 4 bytes (rounding) would use.

v2: based on the right tree

Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2016-04-06 03:39:44 +0800

05 Apr, 2016

3 commits

c862cc9b7 bridge: Fix incorrect variable assignment on error path in br_sysfs_addbr ... Browse Code »

This fixes the incorrect variable assignment on error path in
br_sysfs_addbr for when the call to kobject_create_and_add
fails to assign the value of -EINVAL to the returned variable of
err rather then incorrectly return zero making callers think this
function has succeededed due to the previous assignment being
assigned zero when assigning it the successful return value of
the call to sysfs_create_group which is zero.

Signed-off-by: Bastien Philbert
Signed-off-by: David S. Miller

Bastien Philbert
2016-04-05 04:12:37 +0800
be447f305 ipv6: l2tp: fix a potential issue in l2tp_ip6_recv ... Browse Code »

pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
right place.

Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2016-04-05 04:00:28 +0800
5745b8232 ipv4: l2tp: fix a potential issue in l2tp_ip_recv ... Browse Code »

pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
right place.

Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2016-04-05 04:00:28 +0800

02 Apr, 2016

2 commits

05cf8077e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Missing device reference in IPSEC input path results in crashes
during device unregistration. From Subash Abhinov Kasiviswanathan.

2) Per-queue ISR register writes not being done properly in macb
driver, from Cyrille Pitchen.

3) Stats accounting bugs in bcmgenet, from Patri Gynther.

4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
Quentin Armitage.

5) SXGBE driver has off-by-one in probe error paths, from Rasmus
Villemoes.

6) Fix race in save/swap/delete options in netfilter ipset, from
Vishwanath Pai.

7) Ageing time of bridge not set properly when not operating over a
switchdev device. Fix from Haishuang Yan.

8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
Duyck.

9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

10) FEC driver should only access registers that actually exist on the
given chipset, fix from Fabio Estevam.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
net: mvneta: fix changing MTU when using per-cpu processing
stmmac: fix MDIO settings
Revert "stmmac: Fix 'eth0: No PHY found' regression"
stmmac: fix TX normal DESC
net: mvneta: use cache_line_size() to get cacheline size
net: mvpp2: use cache_line_size() to get cacheline size
net: mvpp2: fix maybe-uninitialized warning
tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
rtnl: fix msg size calculation in if_nlmsg_size()
fec: Do not access unexisting register in Coldfire
net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
bpf: make padding in bpf_tunnel_key explicit
ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
bnxt_en: Fix ethtool -a reporting.
bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
bnxt_en: Implement proper firmware message padding.
...

Linus Torvalds
2016-04-02 09:03:33 +0800
5a5abb1fa tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter ... Browse Code »

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

[ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
[ 52.765688] other info that might help us debug this:
[ 52.765695] rcu_scheduler_active = 1, debug_locks = 1
[ 52.765701] 1 lock held by a.out/1525:
[ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
[ 52.765721] stack backtrace:
[ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
[...]
[ 52.765768] Call Trace:
[ 52.765775] [] dump_stack+0x85/0xc8
[ 52.765784] [] lockdep_rcu_suspicious+0xd5/0x110
[ 52.765792] [] sk_detach_filter+0x82/0x90
[ 52.765801] [] tun_detach_filter+0x35/0x90 [tun]
[ 52.765810] [] __tun_chr_ioctl+0x354/0x1130 [tun]
[ 52.765818] [] ? selinux_file_ioctl+0x130/0x210
[ 52.765827] [] tun_chr_ioctl+0x13/0x20 [tun]
[ 52.765834] [] do_vfs_ioctl+0x96/0x690
[ 52.765843] [] ? security_file_ioctl+0x43/0x60
[ 52.765850] [] SyS_ioctl+0x79/0x90
[ 52.765858] [] do_syscall_64+0x62/0x140
[ 52.765866] [] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 994051625981 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.

Reported-by: Sasha Levin
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2016-04-02 02:33:46 +0800

01 Apr, 2016

1 commit

c57c7a95d rtnl: fix msg size calculation in if_nlmsg_size() ... Browse Code »

Size of the attribute IFLA_PHYS_PORT_NAME was missing.

Fixes: db24a9044ee1 ("net: add support for phys_port_name")
CC: David Ahern
Signed-off-by: Nicolas Dichtel
Acked-by: David Ahern
Signed-off-by: David S. Miller

Nicolas Dichtel
2016-04-01 04:49:54 +0800

31 Mar, 2016

5 commits

c0e760c9c bpf: make padding in bpf_tunnel_key explicit ... Browse Code »

Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
and tunnel_label members explicit. No issue has been observed, and
gcc/llvm does padding for the old struct already, where tunnel_label
was not yet present, so the current code works, but since it's part
of uapi, make sure we don't introduce holes in structs.

Therefore, add tunnel_ext that we can use generically in future
(f.e. to flag OAM messages for backends, etc). Also add the offset
to the compat tests to be sure should some compilers not padd the
tail of the old version of bpf_tunnel_key.

Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-03-31 07:01:33 +0800
2d4212261 ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates ... Browse Code »

IPv6 counters updates use a different macro than IPv4.

Fixes: 36cbb2452cbaf ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts")
Signed-off-by: Eric Dumazet
Cc: Rick Jones
Cc: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2016-03-31 07:01:33 +0800
c3483384e gro: Allow tunnel stacking in the case of FOU/GUE ... Browse Code »

This patch should fix the issues seen with a recent fix to prevent
tunnel-in-tunnel frames from being generated with GRO. The fix itself is
correct for now as long as we do not add any devices that support
NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the
potential to mess things up due to the fact that the outer transport header
points to the outer UDP header and not the GRE header as would be expected.

Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2016-03-31 04:02:33 +0800
28fd34985 sctp: really allow using GFP_KERNEL on sctp_packet_transmit ... Browse Code »

Somehow my patch for commit cea8768f333e ("sctp: allow
sctp_transmit_packet and others to use gfp") missed two important
chunks, which are now added.

Fixes: cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp")
Signed-off-by: Marcelo Ricardo Leitner
Acked-By: Neil Horman
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2016-03-31 03:41:22 +0800
5e263f712 bridge: Allow set bridge ageing time when switchdev disabled ... Browse Code »

When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP,
we should ignore this error code and continue to set the ageing time.

Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev")
Signed-off-by: Haishuang Yan
Acked-by: Ido Schimmel
Signed-off-by: David S. Miller

Haishuang Yan
2016-03-31 03:38:13 +0800

29 Mar, 2016

1 commit

0c84ea17f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
they are:

1) There was a race condition between parallel save/swap and delete,
which resulted a kernel crash due to the increase ref for save, swap,
wrong ref decrease operations. Reported and fixed by Vishwanath Pai.

2) OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action. From Jarno Rajahalme.

3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.

4) Early validation of entry->target_offset to make sure it doesn't take us
out from the blob, from Florian Westphal.

5) Again early validation of entry->next_offset to make sure it doesn't take
out from the blob, also from Florian.

6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
for unconditional entries, when checking both from check_underflow()
and when checking for loops in mark_source_chains(), again from
Florian.

7) Fix inconsistent behaviour in nfnetlink_queue when
NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
overrun, we have to reinject the packet as the user expects.

8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
requests.

9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
this fixes a recent update of the code to namespaceify
ip_default_ttl, patch from Liping Zhang.

This batch comes with four patches to validate x_tables blobs coming
from userspace. CONFIG_USERNS exposes the x_tables interface to
unpriviledged users and to be honest this interface never received the
attention for this move away from the CAP_NET_ADMIN domain. Florian is
working on another round with more patches with more sanity checks, so
expect a bit more Netfilter fixes in this development cycle than usual.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-03-29 03:38:59 +0800

28 Mar, 2016

11 commits

29421198c netfilter: ipv4: fix NULL dereference ... Browse Code »

Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.

Signed-off-by: Liping Zhang
Reviewed-by: Nikolay Borisov
Signed-off-by: Pablo Neira Ayuso

Liping Zhang
2016-03-28 23:59:29 +0800
b301f2538 netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES ... Browse Code »

Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.

Reported-by: Baozeng Ding
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2016-03-28 23:59:24 +0800
931401137 netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails ... Browse Code »

When netlink unicast fails to deliver the message to userspace, we
should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
the packet back to the stack.

I think the user expects no packet drops when this flag is set due to
queueing to userspace errors, no matter if related to the internal queue
or when sending the netlink message to userspace.

The userspace application will still get the ENOBUFS error via recvmsg()
so the user still knows that, with the current configuration that is in
place, the userspace application is not consuming the messages at the
pace that the kernel needs.

Reported-by: "Yigal Reiss (yreiss)"
Signed-off-by: Pablo Neira Ayuso
Tested-by: "Yigal Reiss (yreiss)"

Pablo Neira Ayuso
2016-03-28 23:59:20 +0800
54d83fc74 netfilter: x_tables: fix unconditional helper ... Browse Code »

Ben Hawkes says:

In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.

Reported-by: Ben Hawkes
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-03-28 23:59:15 +0800
6e94e0cfb netfilter: x_tables: make sure e->next_offset covers remaining blob size ... Browse Code »

Otherwise this function may read data beyond the ruleset blob.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-03-28 23:59:08 +0800
bdf533de6 netfilter: x_tables: validate e->target_offset early ... Browse Code »

We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-03-28 23:59:04 +0800
99b7248e2 openvswitch: call only into reachable nf-nat code ... Browse Code »

The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:

net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.

A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.

This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.

The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.

Signed-off-by: Arnd Bergmann
Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
Acked-by: Joe Stringer
Signed-off-by: Pablo Neira Ayuso

Arnd Bergmann
2016-03-28 23:58:59 +0800
5745b0be0 openvswitch: Fix checking for new expected connections. ... Browse Code »

OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action. The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection. The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.

Fixes: 05752523e565 ('openvswitch: Interface with NAT.')
Reported-by: Dan Carpenter
Signed-off-by: Jarno Rajahalme
Signed-off-by: Pablo Neira Ayuso

Jarno Rajahalme
2016-03-28 23:58:51 +0800
596cf3fe5 netfilter: ipset: fix race condition in ipset save, swap and delete ... Browse Code »

This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:

ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters

ipset save &

ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */

Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

Both delete and swap will error out if ref_netlink != 0 on the set.

Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.

Reviewed-by: Joshua Hunt
Signed-off-by: Vishwanath Pai
Signed-off-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso

Vishwanath Pai
2016-03-28 23:57:45 +0800
ac71b46ef openvswitch: Use proper buffer size in nla_memcpy ... Browse Code »

For the input parameter count, it's better to use the size
of destination buffer size, as nla_memcpy would take into
account the length of the source netlink attribute when
a data is copied from an attribute.

Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2016-03-28 23:37:14 +0800
995096a0a Fix returned tc and hoplimit values for route with IPv6 encapsulation ... Browse Code »

For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.

># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link

Signed-off-by: Quentin Armitage
Signed-off-by: David S. Miller

Quentin Armitage
2016-03-28 10:35:02 +0800

27 Mar, 2016

1 commit

d5a38f6e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"

[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...

Linus Torvalds
2016-03-27 06:53:16 +0800

26 Mar, 2016

15 commits

5ee61e95b libceph: use KMEM_CACHE macro ... Browse Code »

Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang
Signed-off-by: Ilya Dryomov

Geliang Tang
2016-03-26 01:51:57 +0800
89f081730 libceph: use sizeof_footer() more ... Browse Code »

Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2016-03-26 01:51:53 +0800
2c63f49a7 libceph: add helper that duplicates last extent operation ... Browse Code »

This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov

Yan, Zheng
2016-03-26 01:51:43 +0800
3f1af42ad libceph: enable large, variable-sized OSD requests ... Browse Code »

Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests. The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once. ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:43 +0800
9e767adbd libceph: osdc->req_mempool should be backed by a slab pool ... Browse Code »

ceph_osd_request_cache was introduced a long time ago. Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:43 +0800
ae458f5a1 libceph: make r_request msg_size calculation clearer ... Browse Code »

Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.

No functional change.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:42 +0800
7665d85b7 libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op ... Browse Code »

This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov

Yan, Zheng
2016-03-26 01:51:42 +0800
de2aa102e libceph: rename ceph_osd_req_op::payload_len to indata_len ... Browse Code »

Follow userspace nomenclature on this - the next commit adds
outdata_len.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:41 +0800
b5d91704f libceph: behave in mon_fault() if cur_mon < 0 ... Browse Code »

This can happen if __close_session() in ceph_monc_stop() races with
a connection reset. We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.

The (two!) con->private tests are useless, as nothing ever clears
con->private. Nuke them.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:40 +0800
bee3a37c4 libceph: reschedule tick in mon_fault() ... Browse Code »

Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.

What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:40 +0800
1752b50ca libceph: introduce and switch to reopen_session() ... Browse Code »

hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around. The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:39 +0800
168b9090c libceph: monc hunt rate is 3s with backoff up to 30s ... Browse Code »

Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:39 +0800
58d81b129 libceph: monc ping rate is 10s ... Browse Code »

Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:39 +0800
0e04dc26c libceph: pick a different monitor when reconnecting ... Browse Code »

Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.

For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway. __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:38 +0800
82dcabad7 libceph: revamp subs code, switch to SUBSCRIBE2 protocol ... Browse Code »

It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime". To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs. Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-03-26 01:51:38 +0800