28 Jan, 2015

1 commit


27 Jan, 2015

9 commits

  • Steffen Klassert says:

    ====================
    ipsec 2015-01-26

    Just two small fixes for _decode_session6() where we
    might decode to wrong header information in some rare
    situations.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Lubomir Rintel reported that during replacing a route the interface
    reference counter isn't correctly decremented.

    To quote bug :
    | [root@rhel7-5 lkundrak]# sh -x lal
    | + ip link add dev0 type dummy
    | + ip link set dev0 up
    | + ip link add dev1 type dummy
    | + ip link set dev1 up
    | + ip addr add 2001:db8:8086::2/64 dev dev0
    | + ip route add 2001:db8:8086::/48 dev dev0 proto static metric 20
    | + ip route add 2001:db8:8088::/48 dev dev1 proto static metric 10
    | + ip route replace 2001:db8:8086::/48 dev dev1 proto static metric 20
    | + ip link del dev0 type dummy
    | Message from syslogd@rhel7-5 at Jan 23 10:54:41 ...
    | kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
    |
    | Message from syslogd@rhel7-5 at Jan 23 10:54:51 ...
    | kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2

    During replacement of a rt6_info we must walk all parent nodes and check
    if the to be replaced rt6_info got propagated. If so, replace it with
    an alive one.

    Fixes: 4a287eba2de3957 ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag")
    Reported-by: Lubomir Rintel
    Signed-off-by: Hannes Frederic Sowa
    Tested-by: Lubomir Rintel
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • An exception is seen in ICMP ping receive path where the skb
    destructor sock_rfree() tries to access a freed socket. This happens
    because ping_rcv() releases socket reference with sock_put() and this
    internally frees up the socket. Later icmp_rcv() will try to free the
    skb and as part of this, skb destructor is called and which leads
    to a kernel panic as the socket is freed already in ping_rcv().

    -->|exception
    -007|sk_mem_uncharge
    -007|sock_rfree
    -008|skb_release_head_state
    -009|skb_release_all
    -009|__kfree_skb
    -010|kfree_skb
    -011|icmp_rcv
    -012|ip_local_deliver_finish

    Fix this incorrect free by cloning this skb and processing this cloned
    skb instead.

    This patch was suggested by Eric Dumazet

    Signed-off-by: Subash Abhinov Kasiviswanathan
    Cc: Eric Dumazet
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    subashab@codeaurora.org
     
  • While working on rhashtable walking I noticed that the UDP diag
    dumping code is buggy. In particular, the socket skipping within
    a chain never happens, even though we record the number of sockets
    that should be skipped.

    As this code was supposedly copied from TCP, this patch does what
    TCP does and resets num before we walk a chain.

    Signed-off-by: Herbert Xu
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • …kernel/git/jberg/mac80211

    Another set of last-minute fixes:
    * fix station double-removal when suspending while associating
    * fix the HT (802.11n) header length calculation
    * fix the CCK radiotap flag used for monitoring, a pretty
    old regression but a simple one-liner
    * fix per-station group-key handling

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Not caching dst_entries which cause redirects could be exploited by hosts
    on the same subnet, causing a severe DoS attack. This effect aggravated
    since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()").

    Lookups causing redirects will be allocated with DST_NOCACHE set which
    will force dst_release to free them via RCU. Unfortunately waiting for
    RCU grace period just takes too long, we can end up with >1M dst_entries
    waiting to be released and the system will run OOM. rcuos threads cannot
    catch up under high softirq load.

    Attaching the flag to emit a redirect later on to the specific skb allows
    us to cache those dst_entries thus reducing the pressure on allocation
    and deallocation.

    This issue was discovered by Marcelo Leitner.

    Cc: Julian Anastasov
    Signed-off-by: Marcelo Leitner
    Signed-off-by: Florian Westphal
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • When hitting an INIT collision case during the 4WHS with AUTH enabled, as
    already described in detail in commit 1be9a950c646 ("net: sctp: inherit
    auth_capable on INIT collisions"), it can happen that we occasionally
    still remotely trigger the following panic on server side which seems to
    have been uncovered after the fix from commit 1be9a950c646 ...

    [ 533.876389] BUG: unable to handle kernel paging request at 00000000ffffffff
    [ 533.913657] IP: [] __kmalloc+0x95/0x230
    [ 533.940559] PGD 5030f2067 PUD 0
    [ 533.957104] Oops: 0000 [#1] SMP
    [ 533.974283] Modules linked in: sctp mlx4_en [...]
    [ 534.939704] Call Trace:
    [ 534.951833] [] ? crypto_init_shash_ops+0x60/0xf0
    [ 534.984213] [] crypto_init_shash_ops+0x60/0xf0
    [ 535.015025] [] __crypto_alloc_tfm+0x6d/0x170
    [ 535.045661] [] crypto_alloc_base+0x4c/0xb0
    [ 535.074593] [] ? _raw_spin_lock_bh+0x12/0x50
    [ 535.105239] [] sctp_inet_listen+0x161/0x1e0 [sctp]
    [ 535.138606] [] SyS_listen+0x9d/0xb0
    [ 535.166848] [] system_call_fastpath+0x16/0x1b

    ... or depending on the the application, for example this one:

    [ 1370.026490] BUG: unable to handle kernel paging request at 00000000ffffffff
    [ 1370.026506] IP: [] kmem_cache_alloc+0x75/0x1d0
    [ 1370.054568] PGD 633c94067 PUD 0
    [ 1370.070446] Oops: 0000 [#1] SMP
    [ 1370.085010] Modules linked in: sctp kvm_amd kvm [...]
    [ 1370.963431] Call Trace:
    [ 1370.974632] [] ? SyS_epoll_ctl+0x53f/0x960
    [ 1371.000863] [] SyS_epoll_ctl+0x53f/0x960
    [ 1371.027154] [] ? anon_inode_getfile+0xd3/0x170
    [ 1371.054679] [] ? __alloc_fd+0xa7/0x130
    [ 1371.080183] [] system_call_fastpath+0x16/0x1b

    With slab debugging enabled, we can see that the poison has been overwritten:

    [ 669.826368] BUG kmalloc-128 (Tainted: G W ): Poison overwritten
    [ 669.826385] INFO: 0xffff880228b32e50-0xffff880228b32e50. First byte 0x6a instead of 0x6b
    [ 669.826414] INFO: Allocated in sctp_auth_create_key+0x23/0x50 [sctp] age=3 cpu=0 pid=18494
    [ 669.826424] __slab_alloc+0x4bf/0x566
    [ 669.826433] __kmalloc+0x280/0x310
    [ 669.826453] sctp_auth_create_key+0x23/0x50 [sctp]
    [ 669.826471] sctp_auth_asoc_create_secret+0xcb/0x1e0 [sctp]
    [ 669.826488] sctp_auth_asoc_init_active_key+0x68/0xa0 [sctp]
    [ 669.826505] sctp_do_sm+0x29d/0x17c0 [sctp] [...]
    [ 669.826629] INFO: Freed in kzfree+0x31/0x40 age=1 cpu=0 pid=18494
    [ 669.826635] __slab_free+0x39/0x2a8
    [ 669.826643] kfree+0x1d6/0x230
    [ 669.826650] kzfree+0x31/0x40
    [ 669.826666] sctp_auth_key_put+0x19/0x20 [sctp]
    [ 669.826681] sctp_assoc_update+0x1ee/0x2d0 [sctp]
    [ 669.826695] sctp_do_sm+0x674/0x17c0 [sctp]

    Since this only triggers in some collision-cases with AUTH, the problem at
    heart is that sctp_auth_key_put() on asoc->asoc_shared_key is called twice
    when having refcnt 1, once directly in sctp_assoc_update() and yet again
    from within sctp_auth_asoc_init_active_key() via sctp_assoc_update() on
    the already kzfree'd memory, which is also consistent with the observation
    of the poison decrease from 0x6b to 0x6a (note: the overwrite is detected
    at a later point in time when poison is checked on new allocation).

    Reference counting of auth keys revisited:

    Shared keys for AUTH chunks are being stored in endpoints and associations
    in endpoint_shared_keys list. On endpoint creation, a null key is being
    added; on association creation, all endpoint shared keys are being cached
    and thus cloned over to the association. struct sctp_shared_key only holds
    a pointer to the actual key bytes, that is, struct sctp_auth_bytes which
    keeps track of users internally through refcounting. Naturally, on assoc
    or enpoint destruction, sctp_shared_key are being destroyed directly and
    the reference on sctp_auth_bytes dropped.

    User space can add keys to either list via setsockopt(2) through struct
    sctp_authkey and by passing that to sctp_auth_set_key() which replaces or
    adds a new auth key. There, sctp_auth_create_key() creates a new sctp_auth_bytes
    with refcount 1 and in case of replacement drops the reference on the old
    sctp_auth_bytes. A key can be set active from user space through setsockopt()
    on the id via sctp_auth_set_active_key(), which iterates through either
    endpoint_shared_keys and in case of an assoc, invokes (one of various places)
    sctp_auth_asoc_init_active_key().

    sctp_auth_asoc_init_active_key() computes the actual secret from local's
    and peer's random, hmac and shared key parameters and returns a new key
    directly as sctp_auth_bytes, that is asoc->asoc_shared_key, plus drops
    the reference if there was a previous one. The secret, which where we
    eventually double drop the ref comes from sctp_auth_asoc_set_secret() with
    intitial refcount of 1, which also stays unchanged eventually in
    sctp_assoc_update(). This key is later being used for crypto layer to
    set the key for the hash in crypto_hash_setkey() from sctp_auth_calculate_hmac().

    To close the loop: asoc->asoc_shared_key is freshly allocated secret
    material and independant of the sctp_shared_key management keeping track
    of only shared keys in endpoints and assocs. Hence, also commit 4184b2a79a76
    ("net: sctp: fix memory leak in auth key management") is independant of
    this bug here since it concerns a different layer (though same structures
    being used eventually). asoc->asoc_shared_key is reference dropped correctly
    on assoc destruction in sctp_association_free() and when active keys are
    being replaced in sctp_auth_asoc_init_active_key(), it always has a refcount
    of 1. Hence, it's freed prematurely in sctp_assoc_update(). Simple fix is
    to remove that sctp_auth_key_put() from there which fixes these panics.

    Fixes: 730fc3d05cd4 ("[SCTP]: Implete SCTP-AUTH parameter processing")
    Signed-off-by: Daniel Borkmann
    Acked-by: Vlad Yasevich
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When creating a bpf classifier in tc with priority collisions and
    invoking automatic unique handle assignment, cls_bpf_grab_new_handle()
    will return a wrong handle id which in fact is non-unique. Usually
    altering of specific filters is being addressed over major id, but
    in case of collisions we result in a filter chain, where handle ids
    address individual cls_bpf_progs inside the classifier.

    Issue is, in cls_bpf_grab_new_handle() we probe for head->hgen handle
    in cls_bpf_get() and in case we found a free handle, we're supposed
    to use exactly head->hgen. In case of insufficient numbers of handles,
    we bail out later as handle id 0 is not allowed.

    Fixes: 7d1d65cb84e1 ("net: sched: cls_bpf: add BPF-based classifier")
    Signed-off-by: Daniel Borkmann
    Acked-by: Jiri Pirko
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In cls_bpf_modify_existing(), we read out the number of filter blocks,
    do some sanity checks, allocate a block on that size, and copy over the
    BPF instruction blob from user space, then pass everything through the
    classic BPF checker prior to installation of the classifier.

    We should reject mismatches here, there are 2 scenarios: the number of
    filter blocks could be smaller than the provided instruction blob, so
    we do a partial copy of the BPF program, and thus the instructions will
    either be rejected from the verifier or a valid BPF program will be run;
    in the other case, we'll end up copying more than we're supposed to,
    and most likely the trailing garbage will be rejected by the verifier
    as well (i.e. we need to fit instruction pattern, ret {A,K} needs to be
    last instruction, load/stores must be correct, etc); in case not, we
    would leak memory when dumping back instruction patterns. The code should
    have only used nla_len() as Dave noted to avoid this from the beginning.
    Anyway, lets fix it by rejecting such load attempts.

    Fixes: 7d1d65cb84e1 ("net: sched: cls_bpf: add BPF-based classifier")
    Signed-off-by: Daniel Borkmann
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

26 Jan, 2015

2 commits

  • In my last commit (a3c00e4: ipv6: Remove BACKTRACK macro), the changes in
    __ip6_route_redirect is incorrect. The following case is missed:
    1. The for loop tries to find a valid gateway rt. If it fails to find
    one, rt will be NULL.
    2. When rt is NULL, it is set to the ip6_null_entry.
    3. The newly added 'else if', from a3c00e4, will stop the backtrack from
    happening.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • When registering a mdio bus, Linux assumes than every port has a PHY and tries
    to scan it. If a switch port has no PHY registered, DSA will fail to register
    the slave MII bus. To fix this, set the slave MII bus PHY mask to the switch
    PHYs mask.

    As an example, if we use a Marvell MV88E6352 (which is a 7-port switch with no
    registered PHYs for port 5 and port 6), with the following declared names:

    static struct dsa_chip_data switch_cdata = {
    [...]
    .port_names[0] = "sw0",
    .port_names[1] = "sw1",
    .port_names[2] = "sw2",
    .port_names[3] = "sw3",
    .port_names[4] = "sw4",
    .port_names[5] = "cpu",
    };

    DSA will fail to create the switch instance. With the PHY mask set for the
    slave MII bus, only the PHY for ports 0-4 will be scanned and the instance will
    be successfully created.

    Signed-off-by: Vivien Didelot
    Tested-by: Florian Fainelli
    Acked-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Vivien Didelot
     

25 Jan, 2015

1 commit


23 Jan, 2015

4 commits

  • In case userspace attempts to obtain key information for or delete a
    unicast key, this is currently erroneously rejected unless the driver
    sets the WIPHY_FLAG_IBSS_RSN flag. Apparently enough drivers do so it
    was never noticed.

    Fix that, and while at it fix a potential memory leak: the error path
    in the get_key() function was placed after allocating a message but
    didn't free it - move it to a better place. Luckily admin permissions
    are needed to call this operation.

    Cc: stable@vger.kernel.org
    Fixes: e31b82136d1ad ("cfg80211/mac80211: allow per-station GTKs")
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Fix a regression introduced by commit a5e70697d0c4 ("mac80211: add radiotap flag
    and handling for 5/10 MHz") where the IEEE80211_CHAN_CCK channel type flag was
    incorrectly replaced by the IEEE80211_CHAN_OFDM flag. This commit fixes that by
    using the CCK flag again.

    Cc: stable@vger.kernel.org
    Fixes: a5e70697d0c4 ("mac80211: add radiotap flag and handling for 5/10 MHz")
    Signed-off-by: Mathy Vanhoef
    Signed-off-by: Johannes Berg

    Mathy Vanhoef
     
  • HT Control field may also be present in management frames, as defined
    in 8.2.4.1.10 of 802.11-2012. Account for this in calculation of header
    length.

    Signed-off-by: Fred Chou
    Signed-off-by: Johannes Berg

    Fred Chou
     
  • In normal cases (i.e. when we are fully associated), cfg80211 takes
    care of removing all the stations before calling suspend in mac80211.

    But in the corner case when we suspend during authentication or
    association, mac80211 needs to roll back the station states. But we
    shouldn't roll back the station states in the suspend function,
    because this is taken care of in other parts of the code, except for
    WDS interfaces. For AP types of interfaces, cfg80211 takes care of
    disconnecting all stations before calling the driver's suspend code.
    For station interfaces, this is done in the quiesce code.

    For WDS interfaces we still need to do it here, so move the code into
    a new switch case for WDS.

    Cc: stable@kernel.org [3.15+]
    Signed-off-by: Luciano Coelho
    Signed-off-by: Johannes Berg

    Luciano Coelho
     

20 Jan, 2015

1 commit

  • Reduce the attack vector and stop generating IPv6 Fragment Header for
    paths with an MTU smaller than the minimum required IPv6 MTU
    size (1280 byte) - called atomic fragments.

    See IETF I-D "Deprecating the Generation of IPv6 Atomic Fragments" [1]
    for more information and how this "feature" can be misused.

    [1] https://tools.ietf.org/html/draft-ietf-6man-deprecate-atomfrag-generation-00

    Signed-off-by: Fernando Gont
    Signed-off-by: Hagen Paul Pfeifer
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

18 Jan, 2015

1 commit

  • I.e. one-to-many sockets in SCTP are not required to explicitly
    call into connect(2) or sctp_connectx(2) prior to data exchange.
    Instead, they can directly invoke sendmsg(2) and the SCTP stack
    will automatically trigger connection establishment through 4WHS
    via sctp_primitive_ASSOCIATE(). However, this in its current
    implementation is racy: INIT is being sent out immediately (as
    it cannot be bundled anyway) and the rest of the DATA chunks are
    queued up for later xmit when connection is established, meaning
    sendmsg(2) will return successfully. This behaviour can result
    in an undesired side-effect that the kernel made the application
    think the data has already been transmitted, although none of it
    has actually left the machine, worst case even after close(2)'ing
    the socket.

    Instead, when the association from client side has been shut down
    e.g. first gracefully through SCTP_EOF and then close(2), the
    client could afterwards still receive the server's INIT_ACK due
    to a connection with higher latency. This INIT_ACK is then considered
    out of the blue and hence responded with ABORT as there was no
    alive assoc found anymore. This can be easily reproduced f.e.
    with sctp_test application from lksctp. One way to fix this race
    is to wait for the handshake to actually complete.

    The fix defers waiting after sctp_primitive_ASSOCIATE() and
    sctp_primitive_SEND() succeeded, so that DATA chunks cooked up
    from sctp_sendmsg() have already been placed into the output
    queue through the side-effect interpreter, and therefore can then
    be bundeled together with COOKIE_ECHO control chunks.

    strace from example application (shortened):

    socket(PF_INET, SOCK_SEQPACKET, IPPROTO_SCTP) = 3
    sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
    msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
    sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
    msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
    sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
    msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
    sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
    msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
    sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
    msg_iov(0)=[], msg_controllen=48, {cmsg_len=48, cmsg_level=0x84 /* SOL_??? */, cmsg_type=, ...},
    msg_flags=0}, 0) = 0 // graceful shutdown for SOCK_SEQPACKET via SCTP_EOF
    close(3) = 0

    tcpdump before patch (fooling the application):

    22:33:36.306142 IP 192.168.1.114.41462 > 192.168.1.115.8888: sctp (1) [INIT] [init tag: 3879023686] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 3139201684]
    22:33:36.316619 IP 192.168.1.115.8888 > 192.168.1.114.41462: sctp (1) [INIT ACK] [init tag: 3345394793] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 3380109591]
    22:33:36.317600 IP 192.168.1.114.41462 > 192.168.1.115.8888: sctp (1) [ABORT]

    tcpdump after patch:

    14:28:58.884116 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [INIT] [init tag: 438593213] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 3092969729]
    14:28:58.888414 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [INIT ACK] [init tag: 381429855] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 2141904492]
    14:28:58.888638 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [COOKIE ECHO] , (2) [DATA] (B)(E) [TSN: 3092969729] [...]
    14:28:58.893278 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [COOKIE ACK] , (2) [SACK] [cum ack 3092969729] [a_rwnd 106491] [#gap acks 0] [#dup tsns 0]
    14:28:58.893591 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [DATA] (B)(E) [TSN: 3092969730] [...]
    14:28:59.096963 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SACK] [cum ack 3092969730] [a_rwnd 106496] [#gap acks 0] [#dup tsns 0]
    14:28:59.097086 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [DATA] (B)(E) [TSN: 3092969731] [...] , (2) [DATA] (B)(E) [TSN: 3092969732] [...]
    14:28:59.103218 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SACK] [cum ack 3092969732] [a_rwnd 106486] [#gap acks 0] [#dup tsns 0]
    14:28:59.103330 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [SHUTDOWN]
    14:28:59.107793 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SHUTDOWN ACK]
    14:28:59.107890 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [SHUTDOWN COMPLETE]

    Looks like this bug is from the pre-git history museum. ;)

    Fixes: 08707d5482df ("lksctp-2_5_31-0_5_1.patch")
    Signed-off-by: Daniel Borkmann
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

17 Jan, 2015

2 commits

  • In addition to the problem Jeff Layton reported, I looked at the code
    and reproduced the same warning by subscribing and removing the genl
    family with a socket still open. This is a fairly tricky race which
    originates in the fact that generic netlink allows the family to go
    away while sockets are still open - unlike regular netlink which has
    a module refcount for every open socket so in general this cannot be
    triggered.

    Trying to resolve this issue by the obvious locking isn't possible as
    it will result in deadlocks between unregistration and group unbind
    notification (which incidentally lockdep doesn't find due to the home
    grown locking in the netlink table.)

    To really resolve this, introduce a "closing socket" reference counter
    (for generic netlink only, as it's the only affected family) in the
    core netlink code and use that in generic netlink to wait for all the
    sockets that are being closed at the same time as a generic netlink
    family is removed.

    This fixes the race that when a socket is closed, it will should call
    the unbind, but if the family is removed at the same time the unbind
    will not find it, leading to the warning. The real problem though is
    that in this case the unbind could actually find a new family that is
    registered to have a multicast group with the same ID, and call its
    mcast_unbind() leading to confusing.

    Also remove the warning since it would still trigger, but is now no
    longer a problem.

    This also moves the code in af_netlink.c to before unreferencing the
    module to avoid having the same problem in the normal non-genl case.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Jeff Layton reported that he could trigger the multicast unbind warning
    in generic netlink using trinity. I originally thought it was a race
    condition between unregistering the generic netlink family and closing
    the socket, but there's a far simpler explanation: genetlink currently
    allows subscribing to groups that don't (yet) exist, and the warning is
    triggered when unsubscribing again while the group still doesn't exist.

    Originally, I had a warning in the subscribe case and accepted it out of
    userspace API concerns, but the warning was of course wrong and removed
    later.

    However, I now think that allowing userspace to subscribe to groups that
    don't exist is wrong and could possibly become a security problem:
    Consider a (new) genetlink family implementing a permission check in
    the mcast_bind() function similar to the like the audit code does today;
    it would be possible to bypass the permission check by guessing the ID
    and subscribing to the group it exists. This is only possible in case a
    family like that would be dynamically loaded, but it doesn't seem like a
    huge stretch, for example wireless may be loaded when you plug in a USB
    device.

    To avoid this reject such subscription attempts.

    If this ends up causing userspace issues we may need to add a workaround
    in af_netlink to deny such requests but not return an error.

    Reported-by: Jeff Layton
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

16 Jan, 2015

3 commits

  • softnet_data.input_pkt_queue is protected by a spinlock that
    we must hold when transferring packets from victim queue to an active
    one. This is because other cpus could still be trying to enqueue packets
    into victim queue.

    A second problem is that when we transfert the NAPI poll_list from
    victim to current cpu, we absolutely need to special case the percpu
    backlog, because we do not want to add complex locking to protect
    process_queue : Only owner cpu is allowed to manipulate it, unless cpu
    is offline.

    Based on initial patch from Prasad Sodagudi & Subash Abhinov
    Kasiviswanathan.

    This version is better because we do not slow down packet processing,
    only make migration safer.

    Reported-by: Prasad Sodagudi
    Reported-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That
    structure is defined and allocated on the stack as

    struct {
    struct sock_extended_err ee;
    struct sockaddr_in(6) offender;
    } errhdr;

    The second part is only initialized for certain SO_EE_ORIGIN values.
    Always initialize it completely.

    An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that
    would return uninitialized bytes.

    Signed-off-by: Willem de Bruijn

    ----

    Also verified that there is no padding between errhdr.ee and
    errhdr.offender that could leak additional kernel data.
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • …kernel/git/jberg/mac80211

    Just two fixes - one for an uninialized variable and
    one for a deadlock in regulatory processing.

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     

15 Jan, 2015

3 commits

  • Pull networking fixes from David Miller:

    1) Don't use uninitialized data in IPVS, from Dan Carpenter.

    2) conntrack race fixes from Pablo Neira Ayuso.

    3) Fix TX hangs with i40e, from Jesse Brandeburg.

    4) Fix budget return from poll calls in dnet and alx, from Eric
    Dumazet.

    5) Fix bugus "if (unlikely(x) < 0)" test in AF_PACKET, from Christoph
    Jaeger.

    6) Fix bug introduced by conversion to list_head in TIPC retransmit
    code, from Jon Paul Maloy.

    7) Don't use GFP_NOIO under spinlock in USB kaweth driver, from Alexey
    Khoroshilov.

    8) Fix bridge build with INET disabled, from Arnd Bergmann.

    9) Fix netlink array overrun for PROBE attributes in openvswitch, from
    Thomas Graf.

    10) Don't hold spinlock across synchronize_irq() in tg3 driver, from
    Prashant Sreedharan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
    tg3: Release tp->lock before invoking synchronize_irq()
    tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
    tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
    team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
    openvswitch: packet messages need their own probe attribtue
    i40e: adds FCoE configure option
    cxgb4vf: Fix queue allocation for 40G adapter
    netdevice: Add missing parentheses in macro
    bridge: only provide proxy ARP when CONFIG_INET is enabled
    neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
    net: fec: fix MDIO bus assignement for dual fec SoC's
    xen-netfront: use different locks for Rx and Tx stats
    drivers: net: cpsw: fix multicast flush in dual emac mode
    cxgb4vf: Initialize mdio_addr before using it
    net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations
    usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb()
    MAINTAINERS: add me as ibmveth maintainer
    tipc: fix bug in broadcast retransmit code
    update ip-sysctl.txt documentation (v2)
    net/at91_ether: prepare and unprepare clock
    ...

    Linus Torvalds
     
  • User space is currently sending a OVS_FLOW_ATTR_PROBE for both flow
    and packet messages. This leads to an out-of-bounds access in
    ovs_packet_cmd_execute() because OVS_FLOW_ATTR_PROBE >
    OVS_PACKET_ATTR_MAX.

    Introduce a new OVS_PACKET_ATTR_PROBE with the same numeric value
    as OVS_FLOW_ATTR_PROBE to grow the range of accepted packet attributes
    while maintaining to be binary compatible with existing OVS binaries.

    Fixes: 05da589 ("openvswitch: Add support for OVS_FLOW_ATTR_PROBE.")
    Reported-by: Sander Eikelenboom
    Tracked-down-by: Florian Westphal
    Signed-off-by: Thomas Graf
    Reviewed-by: Jesse Gross
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • When IPV4 support is disabled, we cannot call arp_send from
    the bridge code, which would result in a kernel link error:

    net/built-in.o: In function `br_handle_frame_finish':
    :(.text+0x59914): undefined reference to `arp_send'
    :(.text+0x59a50): undefined reference to `arp_tbl'

    This makes the newly added proxy ARP support in the bridge
    code depend on the CONFIG_INET symbol and lets the compiler
    optimize the code out to avoid the link error.

    Signed-off-by: Arnd Bergmann
    Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP")
    Cc: Kyeyoon Park
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

14 Jan, 2015

1 commit

  • When setting base_reachable_time or base_reachable_time_ms on a
    specific interface through sysctl or netlink, the reachable_time
    value is not updated.

    This means that neighbour entries will continue to be updated using the
    old value until it is recomputed in neigh_period_work (which
    recomputes the value every 300*HZ).
    On systems with HZ equal to 1000 for instance, it means 5mins before
    the change is effective.

    This patch changes this behavior by recomputing reachable_time after
    each set on base_reachable_time or base_reachable_time_ms.
    The new value will become effective the next time the neighbour's timer
    is triggered.

    Changes are made in two places: the netlink code for set and the sysctl
    handling code. For sysctl, I use a proc_handler. The ipv6 network
    code does provide its own handler but it already refreshes
    reachable_time correctly so it's not an issue.
    Any other user of neighbour which provide its own handlers must
    refresh reachable_time.

    Signed-off-by: Jean-Francois Remy
    Signed-off-by: David S. Miller

    Jean-Francois Remy
     

13 Jan, 2015

1 commit

  • In commit 58dc55f25631178ee74cd27185956a8f7dcb3e32 ("tipc: use generic
    SKB list APIs to manage link transmission queue") we replace all list
    traversal loops with the macros skb_queue_walk() or
    skb_queue_walk_safe(). While the previous loops were based on the
    assumption that the list was NULL-terminated, the standard macros
    stop when the iterator reaches the list head, which is non-NULL.

    In the function bclink_retransmit_pkt() this macro replacement has
    lead to a bug. When we receive a BCAST STATE_MSG we unconditionally
    call the function bclink_retransmit_pkt(), whether there really is
    anything to retransmit or not, assuming that the sequence number
    comparisons will lead to the correct behavior. However, if the
    transmission queue is empty, or if there are no eligible buffers in
    the transmission queue, we will by mistake pass the list head pointer
    to the function tipc_link_retransmit(). Since the list head is not a
    valid sk_buff, this leads to a crash.

    In this commit we fix this by only calling tipc_link_retransmit()
    if we actually found eligible buffers in the transmission queue.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

12 Jan, 2015

2 commits

  • Pablo Neira Ayuso says:

    ====================
    netfilter/ipvs fixes for net

    The following patchset contains netfilter/ipvs fixes, they are:

    1) Small fix for the FTP helper in IPVS, a diff variable may be left
    unset when CONFIG_IP_VS_IPV6 is set. Patch from Dan Carpenter.

    2) Fix nf_tables port NAT in little endian archs, patch from leroy
    christophe.

    3) Fix race condition between conntrack confirmation and flush from
    userspace. This is the second reincarnation to resolve this problem.

    4) Make sure inner messages in the batch come with the nfnetlink header.

    5) Relax strict check from nfnetlink_bind() that may break old userspace
    applications using all 1s group mask.

    6) Schedule removal of chains once no sets and rules refer to them in
    the new nf_tables ruleset flush command. Reported by Asbjoern Sloth
    Toennesen.

    Note that this batch comes later than usual because of the short
    winter holidays.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Due to a misplaced parenthesis, the expression

    (unlikely(offset) < 0),

    which expands to

    (__builtin_expect(!!(offset), 0) < 0),

    never evaluates to true. Therefore, when sending packets with
    PF_PACKET/SOCK_DGRAM, packet_snd() does not abort as intended
    if the creation of the layer 2 header fails.

    Spotted by Coverity - CID 1259975 ("Operands don't affect result").

    Fixes: 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header")
    Signed-off-by: Christoph Jaeger
    Acked-by: Eric Dumazet
    Acked-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Christoph Jaeger
     

10 Jan, 2015

2 commits


09 Jan, 2015

1 commit


08 Jan, 2015

1 commit

  • A struct xdr_stream at a page boundary might point to the end of one
    page or the beginning of the next, but xdr_truncate_encode isn't
    prepared to handle the former.

    This can cause corruption of NFSv4 READDIR replies in the case that a
    readdir entry that would have exceeded the client's dircount/maxcount
    limit would have ended exactly on a 4k page boundary. You're more
    likely to hit this case on large directories.

    Other xdr_truncate_encode callers are probably also affected.

    Reported-by: Holger Hoffstätte
    Tested-by: Holger Hoffstätte
    Fixes: 3e19ce762b53 "rpc: xdr_truncate_encode"
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

07 Jan, 2015

5 commits

  • If a P2P GO is active, the cfg80211_reg_can_beacon function will take
    the wdev lock, in its call to cfg80211_go_permissive_chan. But the wdev lock
    is already taken by the parent channel-checking function, causing a
    deadlock.
    Split the checking code into two parts. The first part will check if the
    wdev is active and saves the channel under the wdev lock. The second part
    will check actual channel validity according to type.

    Signed-off-by: Arik Nemtsov
    Reviewed-by: Ilan Peer
    Reviewed-by: Emmanuel Grumbach
    Signed-off-by: Johannes Berg

    Arik Nemtsov
     
  • The return value should be initialized to false so that there's a
    valid return value when there are no sessions that need work to be
    done on them. Luckily, the side effect of using the uninitialized
    value is an extra harmless driver call.

    Coverity: CID 1260096
    Fixes: 02219b3abca59 ("mac80211: add WMM admission control support")
    Signed-off-by: John W. Linville
    [extend commit message]
    Signed-off-by: Johannes Berg

    John Linville
     
  • Pull networking fixes from David Miller:
    "Just a pile of random fixes, including:

    1) Do not apply TSO limits to non-TSO packets, fix from Herbert Xu.

    2) MDI{,X} eeprom check in e100 driver is reversed, from John W.
    Linville.

    3) Missing error return assignments in several ethernet drivers, from
    Julia Lawall.

    4) Altera TSE device doesn't come back up after ifconfig down/up
    sequence, fix from Kostya Belezko.

    5) Add more cases to the check for whether the qmi_wwan device has a
    bogus MAC address and needs to be assigned a random one. From
    Kristian Evensen.

    6) Fix interrupt hangs in CPSW, from Felipe Balbi.

    7) Implement ndo_features_check in r8152 so that the stack doesn't
    feed GSO packets which are outside of the chip's capabilities.
    From Hayes Wang"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
    qla3xxx: don't allow never end busy loop
    xen-netback: fixing the propagation of the transmit shaper timeout
    r8152: support ndo_features_check
    batman-adv: fix potential TT client + orig-node memory leak
    batman-adv: fix multicast counter when purging originators
    batman-adv: fix counter for multicast supporting nodes
    batman-adv: fix lock class for decoding hash in network-coding.c
    batman-adv: fix delayed foreign originator recognition
    batman-adv: fix and simplify condition when bonding should be used
    Revert "mac80211: Fix accounting of the tailroom-needed counter"
    net: ethernet: cpsw: fix hangs with interrupts
    enic: free all rq buffs when allocation fails
    qmi_wwan: Set random MAC on devices with buggy fw
    openvswitch: Consistently include VLAN header in flow and port stats.
    tcp: Do not apply TSO segment limit to non-TSO packets
    Altera TSE: Add missing phydev
    net/mlx4_core: Fix error flow in mlx4_init_hca()
    net/mlx4_core: Correcly update the mtt's offset in the MR re-reg flow
    qlcnic: Fix return value in qlcnic_probe()
    net: axienet: fix error return code
    ...

    Linus Torvalds
     
  • Jumping between chains doesn't mix well with flush ruleset. Rules
    from a different chain and set elements may still refer to us.

    [ 353.373791] ------------[ cut here ]------------
    [ 353.373845] kernel BUG at net/netfilter/nf_tables_api.c:1159!
    [ 353.373896] invalid opcode: 0000 [#1] SMP
    [ 353.373942] Modules linked in: intel_powerclamp uas iwldvm iwlwifi
    [ 353.374017] CPU: 0 PID: 6445 Comm: 31c3.nft Not tainted 3.18.0 #98
    [ 353.374069] Hardware name: LENOVO 5129CTO/5129CTO, BIOS 6QET47WW (1.17 ) 07/14/2010
    [...]
    [ 353.375018] Call Trace:
    [ 353.375046] [] ? nf_tables_commit+0x381/0x540
    [ 353.375101] [] nfnetlink_rcv+0x3d8/0x4b0
    [ 353.375150] [] netlink_unicast+0x105/0x1a0
    [ 353.375200] [] netlink_sendmsg+0x32e/0x790
    [ 353.375253] [] sock_sendmsg+0x8e/0xc0
    [ 353.375300] [] ? move_addr_to_kernel.part.20+0x19/0x70
    [ 353.375357] [] ? move_addr_to_kernel+0x19/0x30
    [ 353.375410] [] ? verify_iovec+0x42/0xd0
    [ 353.375459] [] ___sys_sendmsg+0x3f0/0x400
    [ 353.375510] [] ? native_sched_clock+0x2a/0x90
    [ 353.375563] [] ? acct_account_cputime+0x17/0x20
    [ 353.375616] [] ? account_user_time+0x88/0xa0
    [ 353.375667] [] __sys_sendmsg+0x3d/0x80
    [ 353.375719] [] ? int_check_syscall_exit_work+0x34/0x3d
    [ 353.375776] [] SyS_sendmsg+0xd/0x20
    [ 353.375823] [] system_call_fastpath+0x16/0x1b

    Release objects in this order: rules -> sets -> chains -> tables, to
    make sure no references to chains are held anymore.

    Reported-by: Asbjoern Sloth Toennesen
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Relax the checking that was introduced in 97840cb ("netfilter:
    nfnetlink: fix insufficient validation in nfnetlink_bind") when the
    subscription bitmask is used. Existing userspace code code may request
    to listen to all of the existing netlink groups by setting an all to one
    subscription group bitmask. Netlink already validates subscription via
    setsockopt() for us.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso