06 Jul, 2016

1 commit

  • dn_fib_count_nhs() could enter an infinite loop if nhp->rtnh_len == 0
    (i.e. if userspace passes a malformed netlink message).

    Let's use the helpers from net/nexthop.h which take care of all this
    stuff. We can do exactly the same as e.g. fib_count_nexthops() and
    fib_get_nhs() from net/ipv4/fib_semantics.c.

    This fixes the softlockup for me.

    Cc: Thomas Graf
    Signed-off-by: Vegard Nossum
    Signed-off-by: David S. Miller

    Vegard Nossum
     

11 Apr, 2016

1 commit


15 Dec, 2015

1 commit

  • 郭永刚 reported that one could simply crash the kernel as root by
    using a simple program:

    int socket_fd;
    struct sockaddr_in addr;
    addr.sin_port = 0;
    addr.sin_addr.s_addr = INADDR_ANY;
    addr.sin_family = 10;

    socket_fd = socket(10,3,0x40000000);
    connect(socket_fd , &addr,16);

    AF_INET, AF_INET6 sockets actually only support 8-bit protocol
    identifiers. inet_sock's skc_protocol field thus is sized accordingly,
    thus larger protocol identifiers simply cut off the higher bits and
    store a zero in the protocol fields.

    This could lead to e.g. NULL function pointer because as a result of
    the cut off inet_num is zero and we call down to inet_autobind, which
    is NULL for raw sockets.

    kernel: Call Trace:
    kernel: [] ? inet_autobind+0x2e/0x70
    kernel: [] inet_dgram_connect+0x54/0x80
    kernel: [] SYSC_connect+0xd9/0x110
    kernel: [] ? ptrace_notify+0x5b/0x80
    kernel: [] ? syscall_trace_enter_phase2+0x108/0x200
    kernel: [] SyS_connect+0xe/0x10
    kernel: [] tracesys_phase2+0x84/0x89

    I found no particular commit which introduced this problem.

    CVE: CVE-2015-8543
    Cc: Cong Wang
    Reported-by: 郭永刚
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

02 Dec, 2015

1 commit

  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Oct, 2015

2 commits

  • This merge resolves conflicts with 75aec9df3a78 ("bridge: Remove
    br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
    netns support in the network stack that reached upstream via David's
    net-next tree.

    Signed-off-by: Pablo Neira Ayuso

    Conflicts:
    net/bridge/br_netfilter_hooks.c

    Pablo Neira Ayuso
     
  • A recent change to the dst_output handling caused a new warning
    when the call to NF_HOOK() is the only used of a local variable
    passed as 'dev', and CONFIG_NETFILTER is disabled:

    net/ipv6/ip6_output.c: In function 'ip6_output':
    net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]

    The reason for this is that the NF_HOOK macro in this case does
    not reference the variable at all, and the call to dev_net(dev)
    got removed from the ip6_output function. To avoid that warning now
    and in the future, this changes the macro into an equivalent
    inline function, which tells the compiler that the variable is
    passed correctly but still unused.

    The dn_forward function apparently had the same problem in
    the past and added a local workaround that no longer works
    with the inline function. In order to avoid a regression, we
    have to also remove the #ifdef from decnet in the same patch.

    Fixes: ede2059dbaf9 ("dst: Pass net into dst->output")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Pablo Neira Ayuso

    Arnd Bergmann
     

08 Oct, 2015

2 commits


19 Sep, 2015

1 commit


18 Sep, 2015

3 commits

  • This is immediately motivated by the bridge code that chains functions that
    call into netfilter. Without passing net into the okfns the bridge code would
    need to guess about the best expression for the network namespace to process
    packets in.

    As net is frequently one of the first things computed in continuation functions
    after netfilter has done it's job passing in the desired network namespace is in
    many cases a code simplification.

    To support this change the function dst_output_okfn is introduced to
    simplify passing dst_output as an okfn. For the moment dst_output_okfn
    just silently drops the struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Pass a network namespace parameter into the netfilter hooks. At the
    call site of the netfilter hooks the path a packet is taking through
    the network stack is well known which allows the network namespace to
    be easily and reliabily.

    This allows the replacement of magic code like
    "dev_net(state->in?:state->out)" that appears at the start of most
    netfilter hooks with "state->net".

    In almost all cases the network namespace passed in is derived
    from the first network device passed in, guaranteeing those
    paths will not see any changes in practice.

    The exceptions are:
    xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
    ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
    ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
    ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
    ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
    ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
    ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
    br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

    In all cases these exceptions seem to be a better expression for the
    network namespace the packet is being processed in then the historic
    "dev_net(in?in:out)". I am documenting them in case something odd
    pops up and someone starts trying to track down what happened.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Add a sock paramter to dst_output making dst_output_sk superfluous.
    Add a skb->sk parameter to all of the callers of dst_output
    Have the callers of dst_output_sk call dst_output.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

10 Sep, 2015

1 commit

  • This switches IPv6 policy routing to use the shared
    fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
    multicast routing for IPv4 as well as IPv6.

    The motivation for this patch is a complaint about iproute2 behaving
    inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
    IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
    assigned priority value was decreased with each rule added.

    Since then all users of the default_pref field have been converted to
    assign the generic function fib_default_rule_pref(), fib_nl_newrule()
    may just use it directly instead. Therefore get rid of the function
    pointer altogether and make fib_default_rule_pref() static, as it's not
    used outside fib_rules.c anymore.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     

11 May, 2015

1 commit


08 Apr, 2015

1 commit

  • On the output paths in particular, we have to sometimes deal with two
    socket contexts. First, and usually skb->sk, is the local socket that
    generated the frame.

    And second, is potentially the socket used to control a tunneling
    socket, such as one the encapsulates using UDP.

    We do not want to disassociate skb->sk when encapsulating in order
    to fix this, because that would break socket memory accounting.

    The most extreme case where this can cause huge problems is an
    AF_PACKET socket transmitting over a vxlan device. We hit code
    paths doing checks that assume they are dealing with an ipv4
    socket, but are actually operating upon the AF_PACKET one.

    Signed-off-by: David S. Miller

    David Miller
     

07 Apr, 2015

1 commit

  • Conflicts:
    drivers/net/ethernet/mellanox/mlx4/cmd.c
    net/core/fib_rules.c
    net/ipv4/fib_frontend.c

    The fib_rules.c and fib_frontend.c conflicts were locking adjustments
    in 'net' overlapping addition and removal of code in 'net-next'.

    The mlx4 conflict was a bug fix in 'net' happening in the same
    place a constant was being replaced with a more suitable macro.

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Apr, 2015

1 commit


03 Apr, 2015

1 commit

  • We have to hold rtnl lock for fib_rules_unregister()
    otherwise the following race could happen:

    fib_rules_unregister(): fib_nl_delrule():
    ... ...
    ... ops = lookup_rules_ops();
    list_del_rcu(&ops->list);
    list_for_each_entry(ops->rules) {
    fib_rules_cleanup_ops(ops); ...
    list_del_rcu(); list_del_rcu();
    }

    Note, net->rules_mod_lock is actually not needed at all,
    either upper layer netns code or rtnl lock guarantees
    we are safe.

    Cc: Alexander Duyck
    Cc: Thomas Graf
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

10 Mar, 2015

1 commit

  • After my change to neigh_hh_init to obtain the protocol from the
    neigh_table there are no more users of protocol in struct dst_ops.
    Remove the protocol field from dst_ops and all of it's initializers.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

07 Mar, 2015

1 commit

  • Other users users of the neighbour table use neigh->output as the method
    to decided when and which link-layer header to place on a packet.
    DECnet has been using neigh->output to decide which DECnet headers to
    place on a packet depending which neighbour the packet is destined for.

    The DECnet usage isn't totally wrong but it can run into problems if the
    neighbour output function is run for a second time as the teql driver
    and the bridge netfilter code can do.

    Therefore to avoid pathologic problems later down the line and make the
    neighbour code easier to understand by refactoring the decnet output
    code to only use a neighbour method to add a link layer header to a
    packet.

    This is done by moving the neigbhour operations lookup from
    dn_to_neigh_output to dn_neigh_output_packet.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

04 Mar, 2015

2 commits

  • While looking at the mpls code I found myself writing yet another
    version of neigh_lookup_noref. We currently have __ipv4_lookup_noref
    and __ipv6_lookup_noref.

    So to make my work a little easier and to make it a smidge easier to
    verify/maintain the mpls code in the future I stopped and wrote
    ___neigh_lookup_noref. Then I rewote __ipv4_lookup_noref and
    __ipv6_lookup_noref in terms of this new function. I tested my new
    version by verifying that the same code is generated in
    ip_finish_output2 and ip6_finish_output2 where these functions are
    inlined.

    To get to ___neigh_lookup_noref I added a new neighbour cache table
    function key_eq. So that the static size of the key would be
    available.

    I also added __neigh_lookup_noref for people who want to to lookup
    a neighbour table entry quickly but don't know which neibhgour table
    they are going to look up.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Conflicts:
    drivers/net/ethernet/rocker/rocker.c

    The rocker commit was two overlapping changes, one to rename
    the ->vport member to ->pport, and another making the bitmask
    expression use '1ULL' instead of plain '1'.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Mar, 2015

2 commits

  • - Add protocol to neigh_tbl so that dst->ops->protocol is not needed
    - Acquire the device from neigh->dev

    This results in a neigh_hh_init that will cache the samve values
    regardless of the packets flowing through it.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

24 Feb, 2015

1 commit


19 Jan, 2015

1 commit

  • Commit 053c095a82cf ("netlink: make nlmsg_end() and genlmsg_end()
    void") didn't catch all of the cases where callers were breaking out
    on the return value being equal to zero, which they no longer should
    when zero means success.

    Fix all such cases.

    Reported-by: Marcel Holtmann
    Reported-by: Scott Feldman
    Signed-off-by: David S. Miller

    David S. Miller
     

18 Jan, 2015

1 commit

  • Contrary to common expectations for an "int" return, these functions
    return only a positive value -- if used correctly they cannot even
    return 0 because the message header will necessarily be in the skb.

    This makes the very common pattern of

    if (genlmsg_end(...) < 0) { ... }

    be a whole bunch of dead code. Many places also simply do

    return nlmsg_end(...);

    and the caller is expected to deal with it.

    This also commonly (at least for me) causes errors, because it is very
    common to write

    if (my_function(...))
    /* error condition */

    and if my_function() does "return nlmsg_end()" this is of course wrong.

    Additionally, there's not a single place in the kernel that actually
    needs the message length returned, and if anyone needs it later then
    it'll be very easy to just use skb->len there.

    Remove this, and make the functions void. This removes a bunch of dead
    code as described above. The patch adds lines because I did

    - return nlmsg_end(...);
    + nlmsg_end(...);
    + return 0;

    I could have preserved all the function's return values by returning
    skb->len, but instead I've audited all the places calling the affected
    functions and found that none cared. A few places actually compared
    the return value with < 0 with no change in behaviour, so I opted for the more
    efficient version.

    One instance of the error I've made numerous times now is also present
    in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
    check for
    Signed-off-by: David S. Miller

    Johannes Berg
     

06 Jan, 2015

1 commit

  • This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
    control metric to be set up and dumped back to user space.

    While the internal representation of RTAX_CC_ALGO is handled as a u32
    key, we avoided to expose this implementation detail to user space, thus
    instead, we chose the netlink attribute that is being exchanged between
    user space to be the actual congestion control algorithm name, similarly
    as in the setsockopt(2) API in order to allow for maximum flexibility,
    even for 3rd party modules.

    It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
    it should have been stored in RTAX_FEATURES instead, we first thought
    about reusing it for the congestion control key, but it brings more
    complications and/or confusion than worth it.

    Joint work with Florian Westphal.

    Signed-off-by: Florian Westphal
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

24 Nov, 2014

2 commits


12 Nov, 2014

1 commit

  • Currently there are only three neigh tables in the whole kernel:
    arp table, ndisc table and decnet neigh table. What's more,
    we don't support registering multiple tables per family.
    Therefore we can just make these tables statically built-in.

    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

23 Aug, 2014

3 commits

  • The functions time_before, time_before_eq, time_after, and time_after_eq
    are more robust for comparing jiffies against other values.

    A simplified version of the Coccinelle semantic patch making this change
    is as follows:

    @change@
    expression E1,E2,E3;
    @@
    - jiffies - E1 >= (E2*E3)
    + time_after_eq(jiffies, E1+E2*E3)

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: David S. Miller

    Himangi Saraogi
     
  • The functions time_before, time_before_eq, time_after, and time_after_eq
    are more robust for comparing jiffies against other values.

    A simplified version of the Coccinelle semantic patch making this change
    is as follows:

    @change@
    expression E1,E2;
    @@
    - (jiffies - E1) >= E2
    + time_after_eq(jiffies, E1+E2)

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: David S. Miller

    Himangi Saraogi
     
  • The functions time_before, time_before_eq, time_after, and time_after_eq
    are more robust for comparing jiffies against other values.

    A simplified version of the Coccinelle semantic patch making this change
    is as follows:

    @change@
    expression E1,E2;
    @@

    (
    - (jiffies - E1) < E2
    + time_before(jiffies, E1+E2)
    )

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: David S. Miller

    Himangi Saraogi
     

24 May, 2014

1 commit

  • Define separate fields in the sock structure for configuring disabling
    checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
    The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
    removed UDP_CSUM_* defines since they are no longer necessary.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

25 Apr, 2014

1 commit

  • It is possible by passing a netlink socket to a more privileged
    executable and then to fool that executable into writing to the socket
    data that happens to be valid netlink message to do something that
    privileged executable did not intend to do.

    To keep this from happening replace bare capable and ns_capable calls
    with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
    Which act the same as the previous calls except they verify that the
    opener of the socket had the desired permissions as well.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

16 Apr, 2014

1 commit

  • In the dst->output() path for ipv4, the code assumes the skb it has to
    transmit is attached to an inet socket, specifically via
    ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
    provider of the packet is an AF_PACKET socket.

    The dst->output() method gets an additional 'struct sock *sk'
    parameter. This needs a cascade of changes so that this parameter can
    be propagated from vxlan to final consumer.

    Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
    Reported-by: lucien xin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Feb, 2014

2 commits

  • Move prototype declaration of functions to header file include/net/dn.h
    from net/decnet/af_decnet.c because they are used by more than one file.

    This eliminates the following warning in net/decnet/af_decnet.c:
    net/decnet/sysctl_net_decnet.c:354:6: warning: no previous prototype for ‘dn_register_sysctl’ [-Wmissing-prototypes]
    net/decnet/sysctl_net_decnet.c:359:6: warning: no previous prototype for ‘dn_unregister_sysctl’ [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Signed-off-by: David S. Miller

    Rashika Kheria
     
  • Move prototype declaration of functions to header file include/net/dn_route.h
    from net/decnet/af_decnet.c because it is used by more than one file.

    This eliminates the following warning in net/decnet/dn_route.c:
    net/decnet/dn_route.c:629:5: warning: no previous prototype for ‘dn_route_rcv’ [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Signed-off-by: David S. Miller

    Rashika Kheria