10 Dec, 2014

1 commit


20 Nov, 2014

1 commit


29 Aug, 2014

1 commit


18 Aug, 2014

1 commit


30 Jul, 2014

2 commits


18 Jul, 2014

1 commit

  • The current code always selects XPRT_TRANSPORT_BC_TCP for the back
    channel, even when the forward channel was not TCP (eg, RDMA). When
    a 4.1 mount is attempted with RDMA, the server panics in the TCP BC
    code when trying to send CB_NULL.

    Instead, construct the transport protocol number from the forward
    channel transport or'd with XPRT_TRANSPORT_BC. Transports that do
    not support bi-directional RPC will not have registered a "BC"
    transport, causing create_backchannel_client() to fail immediately.

    Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

31 May, 2014

1 commit

  • When debugging, rpc prints messages from dprintk(KERN_WARNING ...)
    with "^A4" prefixed,

    [ 2780.339988] ^A4nfsd: connect from unprivileged port: 127.0.0.1, port=35316

    Trond tells,
    > dprintk != printk. We have NEVER supported dprintk(KERN_WARNING...)

    This patch removes using of dprintk with KERN_WARNING.

    Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     

23 May, 2014

2 commits

  • If an incoming NFS request is coming from the local host, then
    nfsd will need to perform some special handling. So detect that
    possibility and make the source visible in rq_local.

    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields

    NeilBrown
     
  • An NFS/RDMA client's source port is meaningless for RDMA transports.
    The transport layer typically sets the source port value on the
    connection to a random ephemeral port.

    Currently, NFS server administrators must specify the "insecure"
    export option to enable clients to access exports via RDMA.

    But this means NFS clients can access such an export via IP using an
    ephemeral port, which may not be desirable.

    This patch eliminates the need to specify the "insecure" export
    option to allow NFS/RDMA clients access to an export.

    BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

13 Apr, 2014

1 commit

  • Pull yet more networking updates from David Miller:

    1) Various fixes to the new Redpine Signals wireless driver, from
    Fariya Fatima.

    2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
    Dmitry Petukhov.

    3) UFO and TSO packets differ in whether they include the protocol
    header in gso_size, account for that in skb_gso_transport_seglen().
    From Florian Westphal.

    4) If VLAN untagging fails, we double free the SKB in the bridging
    output path. From Toshiaki Makita.

    5) Several call sites of sk->sk_data_ready() were referencing an SKB
    just added to the socket receive queue in order to calculate the
    second argument via skb->len. This is dangerous because the moment
    the skb is added to the receive queue it can be consumed in another
    context and freed up.

    It turns out also that none of the sk->sk_data_ready()
    implementations even care about this second argument.

    So just kill it off and thus fix all these use-after-free bugs as a
    side effect.

    6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.

    7) pktgen needs to do locking properly for LLTX devices, from Daniel
    Borkmann.

    8) xen-netfront driver initializes TX array entries in RX loop :-) From
    Vincenzo Maffione.

    9) After refactoring, some tunnel drivers allow a tunnel to be
    configured on top itself. Fix from Nicolas Dichtel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
    vti: don't allow to add the same tunnel twice
    gre: don't allow to add the same tunnel twice
    drivers: net: xen-netfront: fix array initialization bug
    pktgen: be friendly to LLTX devices
    r8152: check RTL8152_UNPLUG
    net: sun4i-emac: add promiscuous support
    net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    net: ipv6: Fix oif in TCP SYN+ACK route lookup.
    drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
    drivers: net: cpsw: discard all packets received when interface is down
    net: Fix use after free by removing length arg from sk_data_ready callbacks.
    Drivers: net: hyperv: Address UDP checksum issues
    Drivers: net: hyperv: Negotiate suitable ndis version for offload support
    Drivers: net: hyperv: Allocate memory for all possible per-pecket information
    bridge: Fix double free and memory leak around br_allowed_ingress
    bonding: Remove debug_fs files when module init fails
    i40evf: program RSS LUT correctly
    i40evf: remove open-coded skb_cow_head
    ixgb: remove open-coded skb_cow_head
    igbvf: remove open-coded skb_cow_head
    ...

    Linus Torvalds
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Apr, 2014

1 commit

  • There could be a case, when NFSd file system is mounted in network, different
    to socket's one, like below:

    "ip netns exec" creates new network and mount namespace, which duplicates NFSd
    mount point, created in init_net context. And thus NFS server stop in nested
    network context leads to RPCBIND client destruction in init_net.
    Then, on NFSd start in nested network context, rpc.nfsd process creates socket
    in nested net and passes it into "write_ports", which leads to RPCBIND sockets
    creation in init_net context because of the same reason (NFSd monut point was
    created in init_net context). An attempt to register passed socket in nested
    net leads to panic, because no RPCBIND client present in nexted network
    namespace.

    This patch add check that passed socket's net matches NFSd superblock's one.
    And returns -EINVAL error to user psace otherwise.

    v2: Put socket on exit.

    Reported-by: Weng Meiling
    Signed-off-by: Stanislav Kinsbursky
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Stanislav Kinsbursky
     

10 Oct, 2013

1 commit


09 Oct, 2013

1 commit

  • TCP listener refactoring, part 4 :

    To speed up inet lookups, we moved IPv4 addresses from inet to struct
    sock_common

    Now is time to do the same for IPv6, because it permits us to have fast
    lookups for all kind of sockets, including upcoming SYN_RECV.

    Getting IPv6 addresses in TCP lookups currently requires two extra cache
    lines, plus a dereference (and memory stall).

    inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

    This patch is way bigger than its IPv4 counter part, because for IPv4,
    we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
    it's not doable easily.

    inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
    inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

    And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
    at the same offset.

    We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
    macro.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Aug, 2013

1 commit


01 Aug, 2013

1 commit

  • Since we enabled auto-tuning for sunrpc TCP connections we do not
    guarantee that there is enough write-space on each connection to
    queue a reply.

    If memory pressure causes the window to shrink too small, the request
    throttling in sunrpc/svc will not accept any requests so no more requests
    will be handled. Even when pressure decreases the window will not
    grow again until data is sent on the connection.
    This means we get a deadlock: no requests will be handled until there
    is more space, and no space will be allocated until a request is
    handled.

    This can be simulated by modifying svc_tcp_has_wspace to inflate the
    number of byte required and removing the 'svc_sock_setbufsize' calls
    in svc_setup_socket.

    I found that multiplying by 16 was enough to make the requirement
    exceed the default allocation. With this modification in place:
    mount -o vers=3,proto=tcp 127.0.0.1:/home /mnt
    would block and eventually time out because the nfs server could not
    accept any requests.

    This patch relaxes the request throttling to always allow at least one
    request through per connection. It does this by checking both
    sk_stream_min_wspace() and xprt->xpt_reserved
    are zero.
    The first is zero when the TCP transmit queue is empty.
    The second is zero when there are no RPC requests being processed.
    When both of these are zero the socket is idle and so one more
    request can safely be allowed through.

    Applying this patch allows the above mount command to succeed cleanly.
    Tracing shows that the allocated write buffer space quickly grows and
    after a few requests are handled, the extra tests are no longer needed
    to permit further requests to be processed.

    The main purpose of request throttling is to handle the case when one
    client is slow at collecting replies and the send queue gets full of
    replies that the client hasn't acknowledged (at the TCP level) yet.
    As we only change behaviour when the send queue is empty this main
    purpose is still preserved.

    Reported-by: Ben Myers
    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields

    NeilBrown
     

25 Jul, 2013

1 commit

  • Several call sites use the hardcoded following condition :

    sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

    Lets use a helper because TCP_NOTSENT_LOWAT support will change this
    condition for TCP sockets.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Jul, 2013

2 commits

  • Though clients we care about mostly don't do this, it is possible for
    rpc requests to be sent in multiple fragments. Here we have a sanity
    check to ensure that the final received rpc isn't too small--except that
    the number we're actually checking is the length of just the final
    fragment, not of the whole rpc. So a perfectly legal rpc that's
    unluckily fragmented could cause the server to close the connection
    here.

    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • If we detect that an rpc is too short, we abort and close the
    connection. Except, there's a bug here: we're leaving sk_datalen
    nonzero without leaving any pages in the sk_pages array. The most
    likely result of the inconsistency is a subsequent crash in
    svc_tcp_clear_pages.

    Also demote the BUG_ON in svc_tcp_clear_pages to a WARN.

    Cc: stable@kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

01 Feb, 2013

1 commit


21 Dec, 2012

1 commit

  • Pull nfsd update from Bruce Fields:
    "Included this time:

    - more nfsd containerization work from Stanislav Kinsbursky: we're
    not quite there yet, but should be by 3.9.

    - NFSv4.1 progress: implementation of basic backchannel security
    negotiation and the mandatory BACKCHANNEL_CTL operation. See

    http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues

    for remaining TODO's

    - Fixes for some bugs that could be triggered by unusual compounds.
    Our xdr code wasn't designed with v4 compounds in mind, and it
    shows. A more thorough rewrite is still a todo.

    - If you've ever seen "RPC: multiple fragments per record not
    supported" logged while using some sort of odd userland NFS client,
    that should now be fixed.

    - Further work from Jeff Layton on our mechanism for storing
    information about NFSv4 clients across reboots.

    - Further work from Bryan Schumaker on his fault-injection mechanism
    (which allows us to discard selective NFSv4 state, to excercise
    rarely-taken recovery code paths in the client.)

    - The usual mix of miscellaneous bugs and cleanup.

    Thanks to everyone who tested or contributed this cycle."

    * 'for-3.8' of git://linux-nfs.org/~bfields/linux: (111 commits)
    nfsd4: don't leave freed stateid hashed
    nfsd4: free_stateid can use the current stateid
    nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
    nfsd: warn on odd reply state in nfsd_vfs_read
    nfsd4: fix oops on unusual readlike compound
    nfsd4: disable zero-copy on non-final read ops
    svcrpc: fix some printks
    NFSD: Correct the size calculation in fault_inject_write
    NFSD: Pass correct buffer size to rpc_ntop
    nfsd: pass proper net to nfsd_destroy() from NFSd kthreads
    nfsd: simplify service shutdown
    nfsd: replace boolean nfsd_up flag by users counter
    nfsd: simplify NFSv4 state init and shutdown
    nfsd: introduce helpers for generic resources init and shutdown
    nfsd: make NFSd service structure allocated per net
    nfsd: make NFSd service boot time per-net
    nfsd: per-net NFSd up flag introduced
    nfsd: move per-net startup code to separated function
    nfsd: pass net to __write_ports() and down
    nfsd: pass net to nfsd_set_nrthreads()
    ...

    Linus Torvalds
     

18 Dec, 2012

2 commits


04 Dec, 2012

5 commits

  • Over TCP, RPC's are preceded by a single 4-byte field telling you how
    long the rpc is (in bytes). The spec also allows you to send an RPC in
    multiple such records (the high bit of the length field is used to tell
    you whether this is the final record).

    We've survived for years without supporting this because in practice the
    clients we care about don't use it. But the userland rpc libraries do,
    and every now and then an experimental client will run into this. (Most
    recently I noticed it while trying to write a pynfs check.) And we're
    really on the wrong side of the spec here--let's fix this.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Keep a separate field, sk_datalen, that tracks only the data contained
    in a fragment, not including the fragment header.

    For now, this is always just max(0, sk_tcplen - 4), but after we allow
    multiple fragments sk_datalen will accumulate the total rpc data size
    while sk_tcplen only tracks progress receiving the current fragment.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • The full reclen doesn't include the fragment header, but sk_tcplen does.
    Fix this to make it an apples-to-apples comparison.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Soon we want to support multiple fragments, in which case it may be
    legal for a single fragment to be smaller than 8 bytes, so we'll want to
    delay this check till we've reached the last fragment.

    Also fix an outdated comment.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Byte-swapping in place is always a little dubious.

    Let's instead define this field to always be big-endian, and do the
    swapping on demand where we need it.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

05 Nov, 2012

1 commit


10 Sep, 2012

1 commit

  • You can use nfsd/portlist to give nfsd additional sockets to listen on.
    In theory you can also remove listening sockets this way. But nobody's
    ever done that as far as I can tell.

    Also this was partially broken in 2.6.25, by
    a217813f9067b785241cb7f31956e51d2071703a "knfsd: Support adding
    transports by writing portlist file".

    (Note that we decide whether to take the "delfd" case by checking for a
    digit--but what's actually expected in that case is something made by
    svc_one_sock_name(), which won't begin with a digit.)

    So, let's just rip out this stuff.

    Acked-by: NeilBrown
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

22 Aug, 2012

6 commits


21 Aug, 2012

1 commit

  • Examination of svc_tcp_clear_pages shows that it assumes sk_tcplen is
    consistent with sk_pages[] (in particular, sk_pages[n] can't be NULL if
    sk_tcplen would lead us to expect n pages of data).

    svc_tcp_restore_pages zeroes out sk_pages[] while leaving sk_tcplen.
    This is OK, since both functions are serialized by XPT_BUSY. However,
    that means the inconsistency must be repaired before dropping XPT_BUSY.

    Therefore we should be ensuring that svc_tcp_save_pages repairs the
    problem before exiting svc_tcp_recv_record on error.

    Symptoms were a BUG() in svc_tcp_clear_pages.

    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

28 Jun, 2012

1 commit

  • dropwatch wrongly diagnose all received UDP packets as drops.

    This patch removes trace_kfree_skb() done in skb_free_datagram_locked().

    Locations calling skb_free_datagram_locked() should do it on their own.

    As a result, drops are accounted on the right function.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 May, 2012

1 commit