10 May, 2019

1 commit

  • commit 517d7c79bdb398 ("tipc: fix hanging poll() for stream sockets")
    introduced a regression for clients using non-blocking sockets.
    After the commit, we send EPOLLOUT event to the client even in
    TIPC_CONNECTING state. This causes the subsequent send() to fail
    with ENOTCONN, as the socket is still not in TIPC_ESTABLISHED state.

    In this commit, we:
    - improve the fix for hanging poll() by replacing sk_data_ready()
    with sk_state_change() to wake up all clients.
    - revert the faulty updates introduced by commit 517d7c79bdb398
    ("tipc: fix hanging poll() for stream sockets").

    Fixes: 517d7c79bdb398 ("tipc: fix hanging poll() for stream sockets")
    Signed-off-by: Parthasarathy Bhuvaragan
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     

04 May, 2019

1 commit

  • TIPC link can temporarily fall into "half-establish" that only one of
    the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
    messages, whereas the other link endpoint is not up (e.g. immediately
    when the endpoint receives ACTIVATE_MSG, the network interface goes
    down...).

    This is a normal situation and will be settled because the link
    endpoint will be eventually brought down after the link tolerance time.

    However, the situation will become worse when the second link is
    established before the first link endpoint goes down,
    For example:

    1. Both links , down
    2. Link endpoint 2A up, but 1A still down (e.g. due to network
    disturbance, wrong session, etc.)
    3. Link up
    4. Link endpoint 2A down (e.g. due to link tolerance timeout)
    5. Node B starts failover onto link

    ==> Node A does never start link failover.

    When the "half-failover" situation happens, two consequences have been
    observed:

    a) Peer link/node gets stuck in FAILINGOVER state;
    b) Traffic or user messages that peer node is trying to failover onto
    the second link can be partially or completely dropped by this node.

    The consequence a) was actually solved by commit c140eb166d68 ("tipc:
    fix failover problem"), but that commit didn't cover the b). It's due
    to the fact that the tunnel link endpoint has never been prepared for a
    failover, so the 'l->drop_point' (and the other data...) is not set
    correctly. When a TUNNEL_MSG from peer node arrives on the link,
    depending on the inner message's seqno and the current 'l->drop_point'
    value, the message can be dropped (- treated as a duplicate message) or
    processed.
    At this early stage, the traffic messages from peer are likely to be
    NAME_DISTRIBUTORs, this means some name table entries will be missed on
    the node forever!

    The commit resolves the issue by starting the FAILOVER process on this
    node as well. Another benefit from this solution is that we ensure the
    link will not be re-established until the failover ends.

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

28 Apr, 2019

3 commits

  • Add options to strictly validate messages and dump messages,
    sometimes perhaps validating dump messages non-strictly may
    be required, so add an option for that as well.

    Since none of this can really be applied to existing commands,
    set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
    {
    .cmd = X,
    + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
    ...
    },
    ...
    };

    For new commands one should just not copy the .validate 'opt-out'
    flags and thus get strict validation.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

27 Apr, 2019

1 commit


25 Apr, 2019

1 commit

  • First thing tipc_udp_recv() does is to use rcu_dereference_sk_user_data(),
    and this is really hinting we already own rcu_read_lock() from the caller
    (UDP stack).

    No need to add another rcu_read_lock()/rcu_read_unlock() pair.

    Also use rcu_dereference() instead of rcu_dereference_rtnl()
    in the data path.

    Signed-off-by: Eric Dumazet
    Cc: Jon Maloy
    Cc: Ying Xue
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Apr, 2019

1 commit

  • When using TIPC_SOCK_RECVQ_DEPTH for getsockopt(), it returns the
    number of buffers in receive socket buffer which is not so helpful
    for user space applications.

    This commit introduces the new option TIPC_SOCK_RECVQ_USED which
    returns the current allocated bytes of the receive socket buffer.
    This helps user space applications dimension its buffer usage to
    avoid buffer overload issue.

    Signed-off-by: Tung Nguyen
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Tung Nguyen
     

18 Apr, 2019

1 commit


17 Apr, 2019

2 commits

  • We find that sysctl_tipc_rmem and named_timeout do not have the right minimum
    setting. sysctl_tipc_rmem should be larger than zero, like sysctl_tcp_rmem.
    And named_timeout as a timeout setting should be not less than zero.

    Fixes: cc79dd1ba9c10 ("tipc: change socket buffer overflow control to respect sk_rcvbuf")
    Fixes: a5325ae5b8bff ("tipc: add name distributor resiliency queue")
    Signed-off-by: Jie Liu
    Reported-by: Qiang Ning
    Reviewed-by: Zhiqiang Liu
    Reviewed-by: Miaohe Lin
    Signed-off-by: David S. Miller

    Jie Liu
     
  • According to the link FSM, when a link endpoint got RESET_MSG (- a
    traditional one without the stopping bit) from its peer, it moves to
    PEER_RESET state and raises a LINK_DOWN event which then resets the
    link itself. Its state will become ESTABLISHING after the reset event
    and the link will be re-established soon after this endpoint starts to
    send ACTIVATE_MSG to the peer.

    There is no problem with this mechanism, however the link resetting has
    cleared the link 'in_session' flag (along with the other important link
    data such as: the link 'mtu') that was correctly set up at the 1st step
    (i.e. when this endpoint received the peer RESET_MSG). As a result, the
    link will become ESTABLISHED, but the 'in_session' flag is not set, and
    all STATE_MSG from its peer will be dropped at the link_validate_msg().
    It means the link not synced and will sooner or later face a failure.

    Since the link reset action is obviously needed for a new link session
    (this is also true in the other situations), the problem here is that
    the link is re-established a bit too early when the link endpoints are
    not really in-sync yet. The commit forces a resync as already done in
    the previous commit 91986ee166cf ("tipc: fix link session and
    re-establish issues") by simply varying the link 'peer_session' value
    at the link_reset().

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     

12 Apr, 2019

1 commit

  • In the function tipc_node_create() we protect the peer capability field
    by using the node rw_lock. However, we access the lock directly instead
    of using the dedicated functions for this, as we do everywhere else in
    node.c. This cosmetic spot is fixed here.

    Fixes: 40999f11ce67 ("tipc: make link capability update thread safe")
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

11 Apr, 2019

1 commit

  • When binding multiple services with specific type 1Ki, 2Ki..,
    this leads to some entries in the name table of publications
    missing when listed out via 'tipc name show'.

    The problem is at identify zero last_type conditional provided
    via netlink. The first is initial 'type' when starting name table
    dummping. The second is continuously with zero type (node state
    service type). Then, lookup function failure to finding node state
    service type in next iteration.

    To solve this, adding more conditional to marked as dirty type and
    lookup correct service type for the next iteration instead of select
    the first service as initial 'type' zero.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

06 Apr, 2019

1 commit


05 Apr, 2019

4 commits

  • In commit 0ae955e2656d ("tipc: improve TIPC throughput by Gap ACK
    blocks"), we enhance the link transmq by releasing as many packets as
    possible with the multi-ACKs from peer node. This also means the queue
    is now non-linear and the peer link deferdq becomes vital.

    Whereas, in the case of link failover, all messages in the link transmq
    need to be transmitted as tunnel messages in such a way that message
    sequentiality and cardinality per sender is preserved. This requires us
    to maintain the link deferdq somehow, so that when the tunnel messages
    arrive, the inner user messages along with the ones in the deferdq will
    be delivered to upper layer correctly.

    The commit accomplishes this by defining a new queue in the TIPC link
    structure to hold the old link deferdq when link failover happens and
    process it upon receipt of tunnel messages.

    Also, in the case of link syncing, the link deferdq will not be purged
    to avoid unnecessary retransmissions that in the worst case will fail
    because the packets might have been freed on the sending side.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • For unicast transmission, the current NACK sending althorithm is over-
    active that forces the sending side to retransmit a packet that is not
    really lost but just arrived at the receiving side with some delay, or
    even retransmit same packets that have already been retransmitted
    before. As a result, many duplicates are observed also under normal
    condition, ie. without packet loss.

    One example case is: node1 transmits 1 2 3 4 10 5 6 7 8 9, when node2
    receives packet #10, it puts into the deferdq. When the packet #5 comes
    it sends NACK with gap [6 - 9]. However, shortly after that, when
    packet #6 arrives, it pulls out packet #10 from the deferfq, but it is
    still out of order, so it makes another NACK with gap [7 - 9] and so on
    ... Finally, node1 has to retransmit the packets 5 6 7 8 9 a number of
    times, but in fact all the packets are not lost at all, so duplicates!

    This commit reduces duplicates by changing the condition to send NACK,
    also restricting the retransmissions on individual packets via a timer
    of about 1ms. However, it also needs to say that too tricky condition
    for NACKs or too long timeout value for retransmissions will result in
    performance reducing! The criterias in this commit are found to be
    effective for both the requirements to reduce duplicates but not affect
    performance.

    The tipc_link_rcv() is also improved to only dequeue skb from the link
    deferdq if it is expected (ie. its seqno
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • During unicast link transmission, it's observed very often that because
    of one or a few lost/dis-ordered packets, the sending side will fastly
    reach the send window limit and must wait for the packets to be arrived
    at the receiving side or in the worst case, a retransmission must be
    done first. The sending side cannot release a lot of subsequent packets
    in its transmq even though all of them might have already been received
    by the receiving side.
    That is, one or two packets dis-ordered/lost and dozens of packets have
    to wait, this obviously reduces the overall throughput!

    This commit introduces an algorithm to overcome this by using "Gap ACK
    blocks". Basically, a Gap ACK block will consist of numbers
    that describes the link deferdq where packets have been got by the
    receiving side but with gaps, for example:

    link deferdq: [1 2 3 4 10 11 13 14 15 20]
    --> Gap ACK blocks: , , ,

    The Gap ACK blocks will be sent to the sending side along with the
    traditional ACK or NACK message. Immediately when receiving the message
    the sending side will now not only release from its transmq the packets
    ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
    be enqueued and transmitted.
    In addition, the sending side can now do "multi-retransmissions"
    according to the Gaps reported in the Gap ACK blocks.

    The new algorithm as verified helps greatly improve the TIPC throughput
    especially under packet loss condition.

    So far, a maximum of 32 blocks is quite enough without any "Too few Gap
    ACK blocks" reports with a 5.0% packet loss rate, however this number
    can be increased in the furture if needed.

    Also, the patch is backward compatible.

    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • skb somehow dequeued out of inputq before processing, it causes to
    NULL pointer and kernel crashed.

    Add checking skb valid before using.

    Fixes: c55c8edafa9 ("tipc: smooth change between replicast and broadcast")
    Reported-by: Tuong Lien Tong
    Acked-by: Ying Xue
    Signed-off-by: Hoang Le
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hoang Le
     

01 Apr, 2019

3 commits

  • Syzbot found a crash:

    BUG: KMSAN: uninit-value in tipc_nl_compat_name_table_dump+0x54f/0xcd0 net/tipc/netlink_compat.c:872
    Call Trace:
    tipc_nl_compat_name_table_dump+0x54f/0xcd0 net/tipc/netlink_compat.c:872
    __tipc_nl_compat_dumpit+0x59e/0xda0 net/tipc/netlink_compat.c:215
    tipc_nl_compat_dumpit+0x63a/0x820 net/tipc/netlink_compat.c:280
    tipc_nl_compat_handle net/tipc/netlink_compat.c:1226 [inline]
    tipc_nl_compat_recv+0x1b5f/0x2750 net/tipc/netlink_compat.c:1265
    genl_family_rcv_msg net/netlink/genetlink.c:601 [inline]
    genl_rcv_msg+0x185f/0x1a60 net/netlink/genetlink.c:626
    netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
    genl_rcv+0x63/0x80 net/netlink/genetlink.c:637
    netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
    netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
    netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1917
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg net/socket.c:632 [inline]

    Uninit was created at:
    __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
    alloc_skb include/linux/skbuff.h:1012 [inline]
    netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
    netlink_sendmsg+0xb82/0x1300 net/netlink/af_netlink.c:1892
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg net/socket.c:632 [inline]

    It was supposed to be fixed on commit 974cb0e3e7c9 ("tipc: fix uninit-value
    in tipc_nl_compat_name_table_dump") by checking TLV_GET_DATA_LEN(msg->req)
    in cmd->header()/tipc_nl_compat_name_table_dump_header(), which is called
    ahead of tipc_nl_compat_name_table_dump().

    However, tipc_nl_compat_dumpit() doesn't handle the error returned from cmd
    header function. It means even when the check added in that fix fails, it
    won't stop calling tipc_nl_compat_name_table_dump(), and the issue will be
    triggered again.

    So this patch is to add the process for the err returned from cmd header
    function in tipc_nl_compat_dumpit().

    Reported-by: syzbot+3ce8520484b0d4e260a5@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • A similar issue as fixed by Patch "tipc: check bearer name with right
    length in tipc_nl_compat_bearer_enable" was also found by syzbot in
    tipc_nl_compat_link_set().

    The length to check with should be 'TLV_GET_DATA_LEN(msg->req) -
    offsetof(struct tipc_link_config, name)'.

    Reported-by: syzbot+de00a87b8644a582ae79@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • Syzbot reported the following crash:

    BUG: KMSAN: uninit-value in memchr+0xce/0x110 lib/string.c:961
    memchr+0xce/0x110 lib/string.c:961
    string_is_valid net/tipc/netlink_compat.c:176 [inline]
    tipc_nl_compat_bearer_enable+0x2c4/0x910 net/tipc/netlink_compat.c:401
    __tipc_nl_compat_doit net/tipc/netlink_compat.c:321 [inline]
    tipc_nl_compat_doit+0x3aa/0xaf0 net/tipc/netlink_compat.c:354
    tipc_nl_compat_handle net/tipc/netlink_compat.c:1162 [inline]
    tipc_nl_compat_recv+0x1ae7/0x2750 net/tipc/netlink_compat.c:1265
    genl_family_rcv_msg net/netlink/genetlink.c:601 [inline]
    genl_rcv_msg+0x185f/0x1a60 net/netlink/genetlink.c:626
    netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
    genl_rcv+0x63/0x80 net/netlink/genetlink.c:637
    netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
    netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
    netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1917
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg net/socket.c:632 [inline]

    Uninit was created at:
    __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
    alloc_skb include/linux/skbuff.h:1012 [inline]
    netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
    netlink_sendmsg+0xb82/0x1300 net/netlink/af_netlink.c:1892
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg net/socket.c:632 [inline]

    It was triggered when the bearer name size < TIPC_MAX_BEARER_NAME,
    it would check with a wrong len/TLV_GET_DATA_LEN(msg->req), which
    also includes priority and disc_domain length.

    This patch is to fix it by checking it with a right length:
    'TLV_GET_DATA_LEN(msg->req) - offsetof(struct tipc_bearer_config, name)'.

    Reported-by: syzbot+8b707430713eb46e1e45@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

30 Mar, 2019

1 commit

  • The number of stubs is growing and has nothing to do with addrconf.
    Move the definition of the stubs to a separate header file and update
    users. In the move, drop the vxlan specific comment before ipv6_stub.

    Code move only; no functional change intended.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

28 Mar, 2019

1 commit


27 Mar, 2019

2 commits

  • Fix the return value check which testing the wrong variable
    in tipc_mcast_send_sync().

    Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast")
    Reported-by: Hulk Robot
    Signed-off-by: Wei Yongjun
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • When running a syz script, a panic occurred:

    [ 156.088228] BUG: KASAN: use-after-free in tipc_disc_timeout+0x9c9/0xb20 [tipc]
    [ 156.094315] Call Trace:
    [ 156.094844]
    [ 156.095306] dump_stack+0x7c/0xc0
    [ 156.097346] print_address_description+0x65/0x22e
    [ 156.100445] kasan_report.cold.3+0x37/0x7a
    [ 156.102402] tipc_disc_timeout+0x9c9/0xb20 [tipc]
    [ 156.106517] call_timer_fn+0x19a/0x610
    [ 156.112749] run_timer_softirq+0xb51/0x1090

    It was caused by the netns freed without deleting the discoverer timer,
    while later on the netns would be accessed in the timer handler.

    The timer should have been deleted by tipc_net_stop() when cleaning up a
    netns. However, tipc has been able to enable a bearer and start d->timer
    without the local node_addr set since Commit 52dfae5c85a4 ("tipc: obtain
    node identity from interface by default"), which caused the timer not to
    be deleted in tipc_net_stop() then.

    So fix it in tipc_net_stop() by changing to check local node_id instead
    of local node_addr, as Jon suggested.

    While at it, remove the calling of tipc_nametbl_withdraw() there, since
    tipc_nametbl_stop() will take of the nametbl's freeing after.

    Fixes: 52dfae5c85a4 ("tipc: obtain node identity from interface by default")
    Reported-by: syzbot+a25307ad099309f1c2b9@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Xin Long
     

24 Mar, 2019

1 commit

  • When checking the code with clang -Wsometimes-uninitialized we get the
    following warning:

    if (!tipc_link_is_establishing(l)) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    net/tipc/node.c:847:46: note: uninitialized use occurs here
    tipc_bearer_xmit(n->net, bearer_id, &xmitq, maddr);

    net/tipc/node.c:831:2: note: remove the 'if' if its condition is always
    true
    if (!tipc_link_is_establishing(l)) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    net/tipc/node.c:821:31: note: initialize the variable 'maddr' to silence
    this warning
    struct tipc_media_addr *maddr;

    We fix this by initializing 'maddr' to NULL. For the matter of clarity,
    we also test if 'xmitq' is non-empty before we use it and 'maddr'
    further down in the function. It will never happen that 'xmitq' is non-
    empty at the same time as 'maddr' is NULL, so this is a sufficient test.

    Fixes: 598411d70f85 ("tipc: make resetting of links non-atomic")
    Reported-by: Nathan Chancellor
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Maloy
     

22 Mar, 2019

4 commits

  • Since maxattr is common, the policy can't really differ sanely,
    so make it common as well.

    The only user that did in fact manage to make a non-common policy
    is taskstats, which has to be really careful about it (since it's
    still using a common maxattr!). This is no longer supported, but
    we can fake it using pre_doit.

    This reduces the size of e.g. nl80211.o (which has lots of commands):

    text data bss dec hex filename
    398745 14323 2240 415308 6564c net/wireless/nl80211.o (before)
    397913 14331 2240 414484 65314 net/wireless/nl80211.o (after)
    --------------------------------
    -832 +8 0 -824

    Which is obviously just 8 bytes for each command, and an added 8
    bytes for the new policy pointer. I'm not sure why the ops list is
    counted as .text though.

    Most of the code transformations were done using the following spatch:
    @ops@
    identifier OPS;
    expression POLICY;
    @@
    struct genl_ops OPS[] = {
    ...,
    {
    - .policy = POLICY,
    },
    ...
    };

    @@
    identifier ops.OPS;
    expression ops.POLICY;
    identifier fam;
    expression M;
    @@
    struct genl_family fam = {
    .ops = OPS,
    .maxattr = M,
    + .policy = POLICY,
    ...
    };

    This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
    the cb->data as ops, which we want to change in a later genl patch.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • In commit c55c8edafa91 ("tipc: smooth change between replicast and
    broadcast") we introduced new method to eliminate the risk of message
    reordering that happen in between different nodes.
    Unfortunately, we forgot checking at receiving side to ignore intra node.

    We fix this by checking and returning if arrived message from intra node.

    syzbot report:

    ==================================================================
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 01/01/2011
    RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
    Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
    24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
    3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
    RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
    RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
    RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
    R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
    R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
    FS: 000000000106a880(0000) GS:ffff8880ae800000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
    tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
    tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
    tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
    tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
    tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
    __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
    tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
    sock_sendmsg_nosec net/socket.c:651 [inline]
    sock_sendmsg+0xdd/0x130 net/socket.c:661
    ___sys_sendmsg+0x806/0x930 net/socket.c:2260
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
    __do_sys_sendmsg net/socket.c:2307 [inline]
    __se_sys_sendmsg net/socket.c:2305 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4401c9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
    48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
    3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
    RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
    R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace ba79875754e1708f ]---

    Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
    Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     
  • skb free-ed in:
    1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv
    2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg
    This leads to a "use-after-free" access in the next condition.

    We fix this by intializing the variable at declaration, then it is safe
    to check this variable to continue processing if condition matches.

    syzbot report:

    ==================================================================
    BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0
    net/tipc/socket.c:2167
    Read of size 4 at addr ffff88808ea58534 by task kworker/u4:0/7

    CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61
    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 01/01/2011
    Workqueue: tipc_send tipc_conn_send_work
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
    kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
    tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167
    tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
    tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
    tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610
    tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283
    tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303
    process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
    worker_thread+0x98/0xe40 kernel/workqueue.c:2415
    kthread+0x357/0x430 kernel/kthread.c:253
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

    Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com
    Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     
  • When cancelling a subscription, we have to clear the cancel bit in the
    request before iterating over any established subscriptions with memcmp.
    Otherwise no subscription will ever be found, and it will not be
    possible to explicitly unsubscribe individual subscriptions.

    Fixes: 8985ecc7c1e0 ("tipc: simplify endianness handling in topology subscriber")
    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     

20 Mar, 2019

3 commits

  • Currently, a multicast stream may start out using replicast, because
    there are few destinations, and then it should ideally switch to
    L2/broadcast IGMP/multicast when the number of destinations grows beyond
    a certain limit. The opposite should happen when the number decreases
    below the limit.

    To eliminate the risk of message reordering caused by method change,
    a sending socket must stick to a previously selected method until it
    enters an idle period of 5 seconds. Means there is a 5 seconds pause
    in the traffic from the sender socket.

    If the sender never makes such a pause, the method will never change,
    and transmission may become very inefficient as the cluster grows.

    With this commit, we allow such a switch between replicast and
    broadcast without any need for a traffic pause.

    Solution is to send a dummy message with only the header, also with
    the SYN bit set, via broadcast or replicast. For the data message,
    the SYN bit is set and sending via replicast or broadcast (inverse
    method with dummy).

    Then, at receiving side any messages follow first SYN bit message
    (data or dummy message), they will be held in deferred queue until
    another pair (dummy or data message) arrived in other link.

    v2: reverse christmas tree declaration

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     
  • As a preparation for introducing a smooth switching between replicast
    and broadcast method for multicast message, We have to introduce a new
    capability flag TIPC_MCAST_RBCTL to handle this new feature.

    During a cluster upgrade a node can come back with this new capabilities
    which also must be reflected in the cluster capabilities field.
    The new feature is only applicable if all node in the cluster supports
    this new capability.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     
  • Currently, a multicast stream uses either broadcast or replicast as
    transmission method, based on the ratio between number of actual
    destinations nodes and cluster size.

    However, when an L2 interface (e.g., VXLAN) provides pseudo
    broadcast support, this becomes very inefficient, as it blindly
    replicates multicast packets to all cluster/subnet nodes,
    irrespective of whether they host actual target sockets or not.

    The TIPC multicast algorithm is able to distinguish real destination
    nodes from other nodes, and hence provides a smarter and more
    efficient method for transferring multicast messages than
    pseudo broadcast can do.

    Because of this, we now make it possible for users to force
    the broadcast link to permanently switch to using replicast,
    irrespective of which capabilities the bearer provides,
    or pretend to provide.
    Conversely, we also make it possible to force the broadcast link
    to always use true broadcast. While maybe less useful in
    deployed systems, this may at least be useful for testing the
    broadcast algorithm in small clusters.

    We retain the current AUTOSELECT ability, i.e., to let the broadcast link
    automatically select which algorithm to use, and to switch back and forth
    between broadcast and replicast as the ratio between destination
    node number and cluster size changes. This remains the default method.

    Furthermore, we make it possible to configure the threshold ratio for
    such switches. The default ratio is now set to 10%, down from 25% in the
    earlier implementation.

    Acked-by: Jon Maloy
    Signed-off-by: Hoang Le
    Signed-off-by: David S. Miller

    Hoang Le
     

18 Mar, 2019

1 commit

  • We move the check that prevents connecting service ranges to after
    the RDM/DGRAM check, and move address sanity control to a separate
    function that also validates the service range.

    Fixes: 23998835be98 ("tipc: improve address sanity check in tipc_connect()")
    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     

17 Mar, 2019

2 commits


06 Mar, 2019

1 commit

  • Fix regression bug introduced in
    commit 365ad353c256 ("tipc: reduce risk of user starvation during link
    congestion")

    Only signal -EDESTADDRREQ for RDM/DGRAM if we don't have a cached
    sockaddr.

    Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     

03 Mar, 2019

1 commit


27 Feb, 2019

1 commit

  • When sending multicast messages via blocking socket,
    if sending link is congested (tsk->cong_link_cnt is set to 1),
    the sending thread will be put into sleeping state. However,
    tipc_sk_filter_rcv() is called under socket spin lock but
    tipc_wait_for_cond() is not. So, there is no guarantee that
    the setting of tsk->cong_link_cnt to 0 in tipc_sk_proto_rcv() in
    CPU-1 will be perceived by CPU-0. If that is the case, the sending
    thread in CPU-0 after being waken up, will continue to see
    tsk->cong_link_cnt as 1 and put the sending thread into sleeping
    state again. The sending thread will sleep forever.

    CPU-0 | CPU-1
    tipc_wait_for_cond() |
    { |
    // condition_ = !tsk->cong_link_cnt |
    while ((rc_ = !(condition_))) { |
    ... |
    release_sock(sk_); |
    wait_woken(); |
    | if (!sock_owned_by_user(sk))
    | tipc_sk_filter_rcv()
    | {
    | ...
    | tipc_sk_proto_rcv()
    | {
    | ...
    | tsk->cong_link_cnt--;
    | ...
    | sk->sk_write_space(sk);
    | ...
    | }
    | ...
    | }
    sched_annotate_sleep(); |
    lock_sock(sk_); |
    remove_wait_queue(); |
    } |
    } |

    This commit fixes it by adding memory barrier to tipc_sk_proto_rcv()
    and tipc_wait_for_cond().

    Acked-by: Jon Maloy
    Signed-off-by: Tung Nguyen
    Signed-off-by: David S. Miller

    Tung Nguyen
     

25 Feb, 2019

1 commit

  • Three conflicts, one of which, for marvell10g.c is non-trivial and
    requires some follow-up from Heiner or someone else.

    The issue is that Heiner converted the marvell10g driver over to
    use the generic c45 code as much as possible.

    However, in 'net' a bug fix appeared which makes sure that a new
    local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
    is cleared.

    Signed-off-by: David S. Miller

    David S. Miller