15 Aug, 2014

6 commits

  • Silences the following sparse warnings:
    net/netlink/af_netlink.c:2926:21: warning: context imbalance in 'netlink_seq_start' - wrong count at exit
    net/netlink/af_netlink.c:2972:13: warning: context imbalance in 'netlink_seq_stop' - unexpected unlock

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Fix TCP FRTO logic so that it always notices when snd_una advances,
    indicating that any RTO after that point will be a new and distinct
    loss episode.

    Previously there was a very specific sequence that could cause FRTO to
    fail to notice a new loss episode had started:

    (1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
    (2) receiver ACKs packet 1
    (3) FRTO sends 2 more packets
    (4) RTO timer fires again (should start a new loss episode)

    The problem was in step (3) above, where tcp_process_loss() returned
    early (in the spot marked "Step 2.b"), so that it never got to the
    logic to clear icsk_retransmits. Thus icsk_retransmits stayed
    non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
    icsk_retransmits, decide that this RTO is not a new episode, and
    decide not to cut ssthresh and remember the current cwnd and ssthresh
    for undo.

    There were two main consequences to the bug that we have
    observed. First, ssthresh was not decreased in step (4). Second, when
    there was a series of such FRTO (1-4) sequences that happened to be
    followed by an FRTO undo, we would restore the cwnd and ssthresh from
    before the entire series started (instead of the cwnd and ssthresh
    from before the most recent RTO). This could result in cwnd and
    ssthresh being restored to values much bigger than the proper values.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Fixes: e33099f96d99c ("tcp: implement RFC5682 F-RTO")
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
    ordering of incoming connections and teardowns without the need to
    hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
    the last measured RTO. To do so, we keep the last seen timestamp in a
    per-host indexed data structure and verify if the incoming timestamp
    in a connection request is strictly greater than the saved one during
    last connection teardown. Thus we can verify later on that no old data
    packets will be accepted by the new connection.

    During moving a socket to time-wait state we already verify if timestamps
    where seen on a connection. Only if that was the case we let the
    time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
    will be used. But we don't verify this on incoming SYN packets. If a
    connection teardown was less than TCP_PAWS_MSL seconds in the past we
    cannot guarantee to not accept data packets from an old connection if
    no timestamps are present. We should drop this SYN packet. This patch
    closes this loophole.

    Please note, this patch does not make tcp_tw_recycle in any way more
    usable but only adds another safety check:
    Sporadic drops of SYN packets because of reordering in the network or
    in the socket backlog queues can happen. Users behing NAT trying to
    connect to a tcp_tw_recycle enabled server can get caught in blackholes
    and their connection requests may regullary get dropped because hosts
    behind an address translator don't have synchronized tcp timestamp clocks.
    tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.

    In general, use of tcp_tw_recycle is disadvised.

    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Make sure we use the correct address-family-specific function for
    handling MTU reductions from within tcp_release_cb().

    Previously AF_INET6 sockets were incorrectly always using the IPv6
    code path when sometimes they were handling IPv4 traffic and thus had
    an IPv4 dst.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Willem de Bruijn
    Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications")
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • As of 4fddbf5d78 ("sit: strictly restrict incoming traffic to tunnel link device"),
    when looking up a tunnel, tunnel's underlying interface (t->parms.link)
    is verified to match incoming traffic's ingress device.

    However the comparison was incorrectly based on skb->dev->iflink.

    Instead, dev->ifindex should be used, which correctly represents the
    interface from which the IP stack hands the ipip6 packets.

    This allows setting up sit tunnels bound to vlan interfaces (otherwise
    incoming ipip6 traffic on the vlan interface was dropped due to
    ipip6_tunnel_lookup match failure).

    Signed-off-by: Shmulik Ladkani
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     
  • We don't know right timestamp for repaired skb-s. Wrong RTT estimations
    isn't good, because some congestion modules heavily depends on it.

    This patch adds the TCPCB_REPAIRED flag, which is included in
    TCPCB_RETRANS.

    Thanks to Eric for the advice how to fix this issue.

    This patch fixes the warning:
    [ 879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
    [ 879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
    [ 879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 879.568177] 0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
    [ 879.568776] 0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
    [ 879.569386] ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
    [ 879.569982] Call Trace:
    [ 879.570264] [] dump_stack+0x4d/0x66
    [ 879.570599] [] warn_slowpath_common+0x7d/0xa0
    [ 879.570935] [] warn_slowpath_null+0x1a/0x20
    [ 879.571292] [] tcp_ack+0x11f5/0x1380
    [ 879.571614] [] tcp_rcv_established+0x1ed/0x710
    [ 879.571958] [] tcp_v4_do_rcv+0x10a/0x370
    [ 879.572315] [] release_sock+0x89/0x1d0
    [ 879.572642] [] do_tcp_setsockopt.isra.36+0x120/0x860
    [ 879.573000] [] ? rcu_read_lock_held+0x6e/0x80
    [ 879.573352] [] tcp_setsockopt+0x32/0x40
    [ 879.573678] [] sock_common_setsockopt+0x14/0x20
    [ 879.574031] [] SyS_setsockopt+0x80/0xf0
    [ 879.574393] [] system_call_fastpath+0x16/0x1b
    [ 879.574730] ---[ end trace a17cbc38eb8c5c00 ]---

    v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
    where it's set for other skb-s.

    Fixes: 431a91242d8d ("tcp: timestamp SYN+DATA messages")
    Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
    Cc: Eric Dumazet
    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

14 Aug, 2014

9 commits

  • Bytestream timestamps are correlated with a single byte in the skbuff,
    recorded in skb_shinfo(skb)->tskey. When fragmenting skbuffs, ensure
    that the tskey is set for the fragment in which the tskey falls
    (seqno < end_seqno).

    The original implementation did not address fragmentation in
    tcp_fragment or tso_fragment. Add code to inspect the sequence numbers
    and move both tskey and the relevant tx_flags if necessary.

    Reported-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath
    can break out early, causing the timestamp code to be skipped. Move
    the code up before the break.

    Reported-by: David S. Miller

    Also fix a boundary condition: tp->snd_una is the next unacknowledged
    byte and between tests inclusive (a snd_una - 1).

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Signed-off-by: Maks Naumov
    Signed-off-by: David S. Miller

    Maks Naumov
     
  • b67bfe0d42cac56c512dd5da4b1b347a23f4b70a (hlist: drop the node
    parameter from iterators) dropped the node parameter from
    iterators which lec_tbl_walk() was using to iterate the list.

    Signed-off-by: Chas Williams
    Signed-off-by: David S. Miller

    chas williams - CONTRACTOR
     
  • One should not call blocking primitives inside a wait loop, since both
    require task_struct::state to sleep, so the inner will destroy the
    outer state.

    sigd_enq() will possibly sleep for alloc_skb(). Move sigd_enq() before
    prepare_to_wait() to avoid sleeping while waiting interruptibly. You do
    not actually need to call sigd_enq() after the initial prepare_to_wait()
    because we test the termination condition before calling schedule().

    Based on suggestions from Peter Zijlstra.

    Signed-off-by: Chas Williams
    Acked-by: Peter Zijlstra
    Signed-off-by: David S. Miller

    chas williams - CONTRACTOR
     
  • ovs_vport_alloc() bails out without freeing the memory 'vport' points to.

    Picked up by Coverity - CID 1230503.

    Fixes: 5cd667b0a4 ("openvswitch: Allow each vport to have an array of 'port_id's.")
    Signed-off-by: Christoph Jaeger
    Signed-off-by: David S. Miller

    Christoph Jaeger
     
  • Pull networking fixes from David Miller:
    "Several networking final fixes and tidies for the merge window:

    1) Changes during the merge window unintentionally took away the
    ability to build bluetooth modular, fix from Geert Uytterhoeven.

    2) Several phy_node reference count bug fixes from Uwe Kleine-König.

    3) Fix ucc_geth build failures, also from Uwe Kleine-König.

    4) Fix klog false positivies when netlink messages go to network
    taps, by properly resetting the network header. Fix from Daniel
    Borkmann.

    5) Sizing estimate of VF netlink messages is too small, from Jiri
    Benc.

    6) New APM X-Gene SoC ethernet driver, from Iyappan Subramanian.

    7) VLAN untagging is erroneously dependent upon whether the VLAN
    module is loaded or not, but there are generic dependencies that
    matter wrt what can be expected as the SKB enters the stack.
    Make the basic untagging generic code, and do it unconditionally.
    From Vlad Yasevich.

    8) xen-netfront only has so many slots in it's transmit queue so
    linearize packets that have too many frags. From Zoltan Kiss.

    9) Fix suspend/resume PHY handling in bcmgenet driver, from Florian
    Fainelli"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (55 commits)
    net: bcmgenet: correctly resume adapter from Wake-on-LAN
    net: bcmgenet: update UMAC_CMD only when link is detected
    net: bcmgenet: correctly suspend and resume PHY device
    net: bcmgenet: request and enable main clock earlier
    net: ethernet: myricom: myri10ge: myri10ge.c: Cleaning up missing null-terminate after strncpy call
    xen-netfront: Fix handling packets on compound pages with skb_linearize
    net: fec: Support phys probed from devicetree and fixed-link
    smsc: replace WARN_ON() with WARN_ON_SMP()
    xen-netback: Don't deschedule NAPI when carrier off
    net: ethernet: qlogic: qlcnic: Remove duplicate object file from Makefile
    wan: wanxl: Remove typedefs from struct names
    m68k/atari: EtherNEC - ethernet support (ne)
    net: ethernet: ti: cpmac.c: Cleaning up missing null-terminate after strncpy call
    hdlc: Remove typedefs from struct names
    airo_cs: Remove typedef local_info_t
    atmel: Remove typedef atmel_priv_ioctl
    com20020_cs: Remove typedef com20020_dev_t
    ethernet: amd: Remove typedef local_info_t
    net: Always untag vlan-tagged traffic on input.
    drivers: net: Add APM X-Gene SoC ethernet driver support.
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    - stable fix for a bug in nfs3_list_one_acl()
    - speed up NFS path walks by supporting LOOKUP_RCU
    - more read/write code cleanups
    - pNFS fixes for layout return on close
    - fixes for the RCU handling in the rpcsec_gss code
    - more NFS/RDMA fixes"

    * tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
    nfs: reject changes to resvport and sharecache during remount
    NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
    SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
    NFS: fix two problems in lookup_revalidate in RCU-walk
    NFS: allow lockless access to access_cache
    NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
    NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
    NFS: support RCU_WALK in nfs_permission()
    sunrpc/auth: allow lockless (rcu) lookup of credential cache.
    NFS: prepare for RCU-walk support but pushing tests later in code.
    NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
    NFS: add checks for returned value of try_module_get()
    nfs: clear_request_commit while holding i_lock
    pnfs: add pnfs_put_lseg_async
    pnfs: find swapped pages on pnfs commit lists too
    nfs: fix comment and add warn_on for PG_INODE_REF
    nfs: check wait_on_bit_lock err in page_group_lock
    sunrpc: remove "ec" argument from encrypt_v2 operation
    sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
    sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
    ...

    Linus Torvalds
     
  • Pull Ceph updates from Sage Weil:
    "There is a lot of refactoring and hardening of the libceph and rbd
    code here from Ilya that fix various smaller bugs, and a few more
    important fixes with clone overlap. The main fix is a critical change
    to the request_fn handling to not sleep that was exposed by the recent
    mutex changes (which will also go to the 3.16 stable series).

    Yan Zheng has several fixes in here for CephFS fixing ACL handling,
    time stamps, and request resends when the MDS restarts.

    Finally, there are a few cleanups from Himangi Saraogi based on
    Coccinelle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
    libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
    rbd: remove extra newlines from rbd_warn() messages
    rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
    rbd: rework rbd_request_fn()
    ceph: fix kick_requests()
    ceph: fix append mode write
    ceph: fix sizeof(struct tYpO *) typo
    ceph: remove redundant memset(0)
    rbd: take snap_id into account when reading in parent info
    rbd: do not read in parent info before snap context
    rbd: update mapping size only on refresh
    rbd: harden rbd_dev_refresh() and callers a bit
    rbd: split rbd_dev_spec_update() into two functions
    rbd: remove unnecessary asserts in rbd_dev_image_probe()
    rbd: introduce rbd_dev_header_info()
    rbd: show the entire chain of parent images
    ceph: replace comma with a semicolon
    rbd: use rbd_segment_name_free() instead of kfree()
    ceph: check zero length in ceph_sync_read()
    ceph: reset r_resend_mds after receiving -ESTALE
    ...

    Linus Torvalds
     

12 Aug, 2014

2 commits

  • Currently the functionality to untag traffic on input resides
    as part of the vlan module and is build only when VLAN support
    is enabled in the kernel. When VLAN is disabled, the function
    vlan_untag() turns into a stub and doesn't really untag the
    packets. This seems to create an interesting interaction
    between VMs supporting checksum offloading and some network drivers.

    There are some drivers that do not allow the user to change
    tx-vlan-offload feature of the driver. These drivers also seem
    to assume that any VLAN-tagged traffic they transmit will
    have the vlan information in the vlan_tci and not in the vlan
    header already in the skb. When transmitting skbs that already
    have tagged data with partial checksum set, the checksum doesn't
    appear to be updated correctly by the card thus resulting in a
    failure to establish TCP connections.

    The following is a packet trace taken on the receiver where a
    sender is a VM with a VLAN configued. The host VM is running on
    doest not have VLAN support and the outging interface on the
    host is tg3:
    10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
    (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
    offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
    -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
    4294837885 ecr 0,nop,wscale 7], length 0
    10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
    (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
    offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
    -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
    4294838888 ecr 0,nop,wscale 7], length 0

    This connection finally times out.

    I've only access to the TG3 hardware in this configuration thus have
    only tested this with TG3 driver. There are a lot of other drivers
    that do not permit user changes to vlan acceleration features, and
    I don't know if they all suffere from a similar issue.

    The patch attempt to fix this another way. It moves the vlan header
    stipping code out of the vlan module and always builds it into the
    kernel network core. This way, even if vlan is not supported on
    a virtualizatoin host, the virtual machines running on top of such
    host will still work with VLANs enabled.

    CC: Patrick McHardy
    CC: Nithin Nayak Sujir
    CC: Michael Chan
    CC: Jiri Pirko
    Signed-off-by: Vladislav Yasevich
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains fixes for your net tree, they are:

    1) Unitialize the set element key and data from the commit path,
    otherwise this leaks chain refcount if the transaction is aborted,
    reported by Thomas Graf.

    2) Fix crash when updating chains without no counters in nf_tables,
    this slipped through in the new transaction infrastructure, reported
    by Matteo Croce.

    3) Replace all mutex_lock_interruptible() by mutex_lock() in the Netfilter
    tree, suggested by Patrick McHardy. This implicitly fixes the problem
    that Eric Dumazet reported in: http://patchwork.ozlabs.org/patch/373076/

    4) Fix error return code in nf_tables when deleting set element in
    nf_tables if the transaction cannot be allocated, from Julia Lawall.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Aug, 2014

2 commits

  • Pull namespace updates from Eric Biederman:
    "This is a bunch of small changes built against 3.16-rc6. The most
    significant change for users is the first patch which makes setns
    drmatically faster by removing unneded rcu handling.

    The next chunk of changes are so that "mount -o remount,.." will not
    allow the user namespace root to drop flags on a mount set by the
    system wide root. Aks this forces read-only mounts to stay read-only,
    no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
    mounts to stay no exec and it prevents unprivileged users from messing
    with a mounts atime settings. I have included my test case as the
    last patch in this series so people performing backports can verify
    this change works correctly.

    The next change fixes a bug in NFS that was discovered while auditing
    nsproxy users for the first optimization. Today you can oops the
    kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
    with pid namespaces. I rebased and fixed the build of the
    !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
    that no one to my knowledge bases anything on my tree fixing the typo
    in place seems more responsible that requiring a typo-fix to be
    backported as well.

    The last change is a small semantic cleanup introducing
    /proc/thread-self and pointing /proc/mounts and /proc/net at it. This
    prevents several kinds of problemantic corner cases. It is a
    user-visible change so it has a minute chance of causing regressions
    so the change to /proc/mounts and /proc/net are individual one line
    commits that can be trivially reverted. Unfortunately I lost and
    could not find the email of the original reporter so he is not
    credited. From at least one perspective this change to /proc/net is a
    refgression fix to allow pthread /proc/net uses that were broken by
    the introduction of the network namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
    proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
    proc: Implement /proc/thread-self to point at the directory of the current thread
    proc: Have net show up under /proc//task/
    NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
    mnt: Add tests for unprivileged remount cases that have found to be faulty
    mnt: Change the default remount atime from relatime to the existing value
    mnt: Correct permission checks in do_remount
    mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
    mnt: Only change user settable mount flags in remount
    namespaces: Use task_lock and not rcu to protect nsproxy

    Linus Torvalds
     
  • Pull nfsd updates from Bruce Fields:
    "This includes a major rewrite of the NFSv4 state code, which has
    always depended on a single mutex. As an example, open creates are no
    longer serialized, fixing a performance regression on NFSv3->NFSv4
    upgrades. Thanks to Jeff, Trond, and Benny, and to Christoph for
    review.

    Also some RDMA fixes from Chuck Lever and Steve Wise, and
    miscellaneous fixes from Kinglong Mee and others"

    * 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits)
    svcrdma: remove rdma_create_qp() failure recovery logic
    nfsd: add some comments to the nfsd4 object definitions
    nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
    nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
    nfsd: remove nfs4_lock_state: nfs4_laundromat
    nfsd: Remove nfs4_lock_state(): reclaim_complete()
    nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
    nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
    nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
    nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
    nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
    nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
    nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
    nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
    nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
    nfsd: remove old fault injection infrastructure
    nfsd: add more granular locking to *_delegations fault injectors
    nfsd: add more granular locking to forget_openowners fault injector
    nfsd: add more granular locking to forget_locks fault injector
    nfsd: add a list_head arg to nfsd_foreach_client_lock
    ...

    Linus Torvalds
     

09 Aug, 2014

3 commits

  • Determining ->last_piece based on the value of ->page_offset + length
    is incorrect because length here is the length of the entire message.
    ->last_piece set to false even if page array data item length is /dev/null
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    # rbd_resize calls librbd rbd_resize(), size is in bytes
    ./rbd_resize bar $(((4 << 20) + 512))
    rbd resize --size 10 bar
    BAR_DEV=$(rbd map bar)
    # trigger a 512-byte copyup -- 512-byte page array data item
    dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5

    The problem exists only in ceph_msg_data_pages_cursor_init(),
    ceph_msg_data_pages_advance() does the right thing. The size_t cast is
    unnecessary.

    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Commit 1d8faf48c74b8 ("net/core: Add VF link state control") added new
    attribute to IFLA_VF_INFO group in rtnl_fill_ifinfo but did not adjust size
    of the allocated memory in if_nlmsg_size/rtnl_vfinfo_size. As the result, we
    may trigger warnings in rtnl_getlink and similar functions when many VF
    links are enabled, as the information does not fit into the allocated skb.

    Fixes: 1d8faf48c74b8 ("net/core: Add VF link state control")
    Reported-by: Yulong Pei
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • Since fib_lookup cannot return ESRCH no longer,
    checking for this error code is no longer neccesary.

    Signed-off-by: Niv Yehezkel
    Signed-off-by: David S. Miller

    Niv Yehezkel
     

08 Aug, 2014

8 commits

  • Convert a zero return value on error to a negative one, as returned
    elsewhere in the function.

    A simplified version of the semantic match that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @@
    identifier ret; expression e1,e2;
    @@
    (
    if (\(ret < 0\|ret != 0\))
    { ... return ret; }
    |
    ret = 0
    )
    ... when != ret = e1
    when != &ret
    *if(...)
    {
    ... when != ret = e2
    when forall
    return ret;
    }
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Pablo Neira Ayuso

    Julia Lawall
     
  • Eric Dumazet reports that getsockopt() or setsockopt() sometimes
    returns -EINTR instead of -ENOPROTOOPT, causing headaches to
    application developers.

    This patch replaces all the mutex_lock_interruptible() by mutex_lock()
    in the netfilter tree, as there is no reason we should sleep for a
    long time there.

    Reported-by: Eric Dumazet
    Suggested-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Julian Anastasov

    Pablo Neira Ayuso
     
  • Fix possible replacement of the per-cpu chain counters by null
    pointer when updating an existing chain in the commit path.

    Reported-by: Matteo Croce
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This should happen once the element has been effectively released in
    the commit path, not before. This fixes a possible chain refcount leak
    if the transaction is aborted.

    Reported-by: Thomas Graf
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • netlink doesn't set any network header offset thus when the skb is
    being passed to tap devices via dev_queue_xmit_nit(), it emits klog
    false positives due to it being unset like:

    ...
    [ 124.990397] protocol 0000 is buggy, dev nlmon0
    [ 124.990411] protocol 0000 is buggy, dev nlmon0
    ...

    So just reset the network header before passing to the device; for
    packet sockets that just means nothing will change - mac and net
    offset hold the same value just as before.

    Reported-by: Marcel Holtmann
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The header multicast.h was included twice, so delete one of them.

    Signed-off-by: Jean Sacren
    Cc: Marek Lindner
    Cc: Simon Wunderlich
    Cc: Antonio Quartulli
    Cc: b.a.t.m.a.n@lists.open-mesh.org
    Signed-off-by: David S. Miller

    Jean Sacren
     
  • The #include headers net/genetlink.h and linux/genetlink.h both were
    included twice, so delete each of the duplicate.

    Signed-off-by: Jean Sacren
    Cc: Pravin Shelar
    Cc: dev@openvswitch.org
    Signed-off-by: David S. Miller

    Jean Sacren
     
  • Change config symbol 6LOWPAN from type bool to type tristate, so
    6LoWPAN can be built modular, just like IPV6

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Marcel Holtmann
    Signed-off-by: David S. Miller

    Geert Uytterhoeven
     

07 Aug, 2014

7 commits

  • Merge incoming from Andrew Morton:
    - Various misc things.
    - arch/sh updates.
    - Part of ocfs2. Review is slow.
    - Slab updates.
    - Most of -mm.
    - printk updates.
    - lib/ updates.
    - checkpatch updates.

    * emailed patches from Andrew Morton : (226 commits)
    checkpatch: update $declaration_macros, add uninitialized_var
    checkpatch: warn on missing spaces in broken up quoted
    checkpatch: fix false positives for --strict "space after cast" test
    checkpatch: fix false positive MISSING_BREAK warnings with --file
    checkpatch: add test for native c90 types in unusual order
    checkpatch: add signed generic types
    checkpatch: add short int to c variable types
    checkpatch: add for_each tests to indentation and brace tests
    checkpatch: fix brace style misuses of else and while
    checkpatch: add --fix option for a couple OPEN_BRACE misuses
    checkpatch: use the correct indentation for which()
    checkpatch: add fix_insert_line and fix_delete_line helpers
    checkpatch: add ability to insert and delete lines to patch/file
    checkpatch: add an index variable for fixed lines
    checkpatch: warn on break after goto or return with same tab indentation
    checkpatch: emit a warning on file add/move/delete
    checkpatch: add test for commit id formatting style in commit log
    checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
    checkpatch: improve "no space after cast" test
    checkpatch: allow multiple const * types
    ...

    Linus Torvalds
     
  • Although RCU protection would be possible during diag dump, doing
    so allows for concurrent table mutations which can render the
    in-table offset between individual Netlink messages invalid and
    thus cause legitimate sockets to be skipped in the dump.

    Since the diag dump is relatively low volume and consistency is
    more important than performance, the table mutex is held during
    dump.

    Reported-by: Andrey Wagin
    Signed-off-by: Thomas Graf
    Fixes: e341694e3eb57fc ("netlink: Convert netlink_lookup() to use RCU protected hash table")
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • All other add functions for lists have the new item as first argument
    and the position where it is added as second argument. This was changed
    for no good reason in this function and makes using it unnecessary
    confusing.

    The name was changed to hlist_add_behind() to cause unconverted code to
    generate a compile error instead of using the wrong parameter order.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Acked-by: Jeff Kirsher [intel driver bits]
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     
  • Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
    doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
    no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
    calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
    from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.

    Signed-off-by: Dmitry Popov
    Signed-off-by: David S. Miller

    Dmitry Popov
     
  • A set of small fixes pointed out just after the merge:
    - make tcp_tx_timestamp static
    - make tcp_gso_tstamp static
    - use before() to compare TCP seqno, instead of cast to u64
    - add tstamp to tx_flags in GSO, instead of overwrite tx_flags
    - record skb_shinfo(skb)->tskey for all timestamps, also HW.
    - optimization in tcp_tx_timestamp:
    call sock_tx_timestamp only if a tstamp option is set.

    Signed-off-by: Willem de Bruijn
    Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping")
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • sock_tx_timestamp() should not ignore initial *tx_flags value, as TCP
    stack can store SKBTX_SHARED_FRAG in it.

    Also first argument (struct sock *) can be const.

    Signed-off-by: Eric Dumazet
    Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping")
    Cc: Willem de Bruijn
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull networking updates from David Miller:
    "Highlights:

    1) Steady transitioning of the BPF instructure to a generic spot so
    all kernel subsystems can make use of it, from Alexei Starovoitov.

    2) SFC driver supports busy polling, from Alexandre Rames.

    3) Take advantage of hash table in UDP multicast delivery, from David
    Held.

    4) Lighten locking, in particular by getting rid of the LRU lists, in
    inet frag handling. From Florian Westphal.

    5) Add support for various RFC6458 control messages in SCTP, from
    Geir Ola Vaagland.

    6) Allow to filter bridge forwarding database dumps by device, from
    Jamal Hadi Salim.

    7) virtio-net also now supports busy polling, from Jason Wang.

    8) Some low level optimization tweaks in pktgen from Jesper Dangaard
    Brouer.

    9) Add support for ipv6 address generation modes, so that userland
    can have some input into the process. From Jiri Pirko.

    10) Consolidate common TCP connection request code in ipv4 and ipv6,
    from Octavian Purdila.

    11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

    12) Generic resizable RCU hash table, with intial users in netlink and
    nftables. From Thomas Graf.

    13) Maintain a name assignment type so that userspace can see where a
    network device name came from (enumerated by kernel, assigned
    explicitly by userspace, etc.) From Tom Gundersen.

    14) Automatic flow label generation on transmit in ipv6, from Tom
    Herbert.

    15) New packet timestamping facilities from Willem de Bruijn, meant to
    assist in measuring latencies going into/out-of the packet
    scheduler, latency from TCP data transmission to ACK, etc"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
    cxgb4 : Disable recursive mailbox commands when enabling vi
    net: reduce USB network driver config options.
    tg3: Modify tg3_tso_bug() to handle multiple TX rings
    amd-xgbe: Perform phy connect/disconnect at dev open/stop
    amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
    net: sun4i-emac: fix memory leak on bad packet
    sctp: fix possible seqlock seadlock in sctp_packet_transmit()
    Revert "net: phy: Set the driver when registering an MDIO bus device"
    cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
    team: Simplify return path of team_newlink
    bridge: Update outdated comment on promiscuous mode
    net-timestamp: ACK timestamp for bytestreams
    net-timestamp: TCP timestamping
    net-timestamp: SCHED timestamp on entering packet scheduler
    net-timestamp: add key to disambiguate concurrent datagrams
    net-timestamp: move timestamp flags out of sk_flags
    net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
    cxgb4i : Move stray CPL definitions to cxgb4 driver
    tcp: reduce spurious retransmits due to transient SACK reneging
    qlcnic: Initialize dcbnl_ops before register_netdev
    ...

    Linus Torvalds
     

06 Aug, 2014

3 commits

  • Pull security subsystem updates from James Morris:
    "In this release:

    - PKCS#7 parser for the key management subsystem from David Howells
    - appoint Kees Cook as seccomp maintainer
    - bugfixes and general maintenance across the subsystem"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
    X.509: Need to export x509_request_asymmetric_key()
    netlabel: shorter names for the NetLabel catmap funcs/structs
    netlabel: fix the catmap walking functions
    netlabel: fix the horribly broken catmap functions
    netlabel: fix a problem when setting bits below the previously lowest bit
    PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
    tpm: simplify code by using %*phN specifier
    tpm: Provide a generic means to override the chip returned timeouts
    tpm: missing tpm_chip_put in tpm_get_random()
    tpm: Properly clean sysfs entries in error path
    tpm: Add missing tpm_do_selftest to ST33 I2C driver
    PKCS#7: Use x509_request_asymmetric_key()
    Revert "selinux: fix the default socket labeling in sock_graft()"
    X.509: x509_request_asymmetric_keys() doesn't need string length arguments
    PKCS#7: fix sparse non static symbol warning
    KEYS: revert encrypted key change
    ima: add support for measuring and appraising firmware
    firmware_class: perform new LSM checks
    security: introduce kernel_fw_from_file hook
    PKCS#7: Missing inclusion of linux/err.h
    ...

    Linus Torvalds
     
  • Conflicts:
    drivers/net/Makefile
    net/ipv6/sysctl_net_ipv6.c

    Two ipv6_table_template[] additions overlap, so the index
    of the ipv6_table[x] assignments needed to be adjusted.

    In the drivers/net/Makefile case, we've gotten rid of the
    garbage whereby we had to list every single USB networking
    driver in the top-level Makefile, there is just one
    "USB_NETWORKING" that guards everything.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Dave reported following splat, caused by improper use of
    IP_INC_STATS_BH() in process context.

    BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
    caller is __this_cpu_preempt_check+0x13/0x20
    CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
    ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
    0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
    ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] check_preemption_disabled+0xfa/0x100
    [] __this_cpu_preempt_check+0x13/0x20
    [] sctp_packet_transmit+0x692/0x710 [sctp]
    [] sctp_outq_flush+0x2a2/0xc30 [sctp]
    [] ? mark_held_locks+0x7c/0xb0
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] sctp_outq_uncork+0x1a/0x20 [sctp]
    [] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
    [] sctp_do_sm+0xdb/0x330 [sctp]
    [] ? preempt_count_sub+0xab/0x100
    [] ? sctp_cname+0x70/0x70 [sctp]
    [] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
    [] sctp_sendmsg+0x88f/0xe30 [sctp]
    [] ? lock_release_holdtime.part.28+0x9a/0x160
    [] ? put_lock_stats.isra.27+0xe/0x30
    [] inet_sendmsg+0x104/0x220
    [] ? inet_sendmsg+0x5/0x220
    [] sock_sendmsg+0x9e/0xe0
    [] ? might_fault+0xb9/0xc0
    [] ? might_fault+0x5e/0xc0
    [] SYSC_sendto+0x124/0x1c0
    [] ? syscall_trace_enter+0x250/0x330
    [] SyS_sendto+0xe/0x10
    [] tracesys+0xdd/0xe2

    This is a followup of commits f1d8cba61c3c4b ("inet: fix possible
    seqlock deadlocks") and 7f88c6b23afbd315 ("ipv6: fix possible seqlock
    deadlock in ip6_finish_output2")

    Signed-off-by: Eric Dumazet
    Cc: Hannes Frederic Sowa
    Reported-by: Dave Jones
    Acked-by: Neil Horman
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet