15 Dec, 2015

2 commits

  • [ Upstream commit 6adc5fd6a142c6e2c80574c1db0c7c17dedaa42e ]

    Proxy entries could have null pointer to net-device.

    Signed-off-by: Konstantin Khlebnikov
    Fixes: 84920c1420e2 ("net: Allow ipv6 proxies and arp proxies be shown with iproute2")
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • [ Upstream commit 6900317f5eff0a7070c5936e5383f589e0de7a09 ]

    David and HacKurx reported a following/similar size overflow triggered
    in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:

    (Already fixed in later grsecurity versions by Brad and PaX Team.)

    [ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
    cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
    [ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
    [ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
    [ 1002.296153] ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
    [ 1002.296162] ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
    [ 1002.296169] ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
    [ 1002.296176] Call Trace:
    [ 1002.296190] [] dump_stack+0x45/0x57
    [ 1002.296200] [] report_size_overflow+0x38/0x60
    [ 1002.296209] [] scm_detach_fds+0x2ce/0x300
    [ 1002.296220] [] unix_stream_read_generic+0x609/0x930
    [ 1002.296228] [] unix_stream_recvmsg+0x4f/0x60
    [ 1002.296236] [] ? unix_set_peek_off+0x50/0x50
    [ 1002.296243] [] sock_recvmsg+0x47/0x60
    [ 1002.296248] [] ___sys_recvmsg+0xe2/0x1e0
    [ 1002.296257] [] __sys_recvmsg+0x46/0x80
    [ 1002.296263] [] SyS_recvmsg+0x2c/0x40
    [ 1002.296271] [] entry_SYSCALL_64_fastpath+0x12/0x85

    Further investigation showed that this can happen when an *odd* number of
    fds are being passed over AF_UNIX sockets.

    In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
    where i is the number of successfully passed fds, differ by 4 bytes due
    to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
    on 64 bit. The padding is used to align subsequent cmsg headers in the
    control buffer.

    When the control buffer passed in from the receiver side *lacks* these 4
    bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
    overflow in scm_detach_fds():

    int cmlen = CMSG_LEN(i * sizeof(int)); cmsg_level);
    if (!err)
    err = put_user(SCM_RIGHTS, &cm->cmsg_type);
    if (!err)
    err = put_user(cmlen, &cm->cmsg_len);
    if (!err) {
    cmlen = CMSG_SPACE(i * sizeof(int)); msg_control += cmlen;
    msg->msg_controllen -= cmlen; msg_controllen of 20 bytes, and the sender
    properly transferred 1 fd to the receiver, so that its CMSG_LEN results
    in 20 bytes and CMSG_SPACE in 24 bytes.

    In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
    issue in my tests as alignment seems always on 4 byte boundary. Same
    should be in case of native 32 bit, where we end up with 4 byte boundaries
    as well.

    In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
    a single fd would mean that on successful return, msg->msg_controllen is
    being set by the kernel to 24 bytes instead, thus more than the input
    buffer advertised. It could f.e. become an issue if such application later
    on zeroes or copies the control buffer based on the returned msg->msg_controllen
    elsewhere.

    Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).

    Going over the code, it seems like msg->msg_controllen is not being read
    after scm_detach_fds() in scm_recv() anymore by the kernel, good!

    Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
    and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
    and ___sys_recvmsg() places the updated length, that is, new msg_control -
    old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
    in the example).

    Long time ago, Wei Yongjun fixed something related in commit 1ac70e7ad24a
    ("[NET]: Fix function put_cmsg() which may cause usr application memory
    overflow").

    RFC3542, section 20.2. says:

    The fields shown as "XX" are possible padding, between the cmsghdr
    structure and the data, and between the data and the next cmsghdr
    structure, if required by the implementation. While sending an
    application may or may not include padding at the end of last
    ancillary data in msg_controllen and implementations must accept both
    as valid. On receiving a portable application must provide space for
    padding at the end of the last ancillary data as implementations may
    copy out the padding at the end of the control message buffer and
    include it in the received msg_controllen. When recvmsg() is called
    if msg_controllen is too small for all the ancillary data items
    including any trailing padding after the last item an implementation
    may set MSG_CTRUNC.

    Since we didn't place MSG_CTRUNC for already quite a long time, just do
    the same as in 1ac70e7ad24a to avoid an overflow.

    Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
    error in SCM_RIGHTS code sample"). Some people must have copied this (?),
    thus it got triggered in the wild (reported several times during boot by
    David and HacKurx).

    No Fixes tag this time as pre 2002 (that is, pre history tree).

    Reported-by: David Sterba
    Reported-by: HacKurx
    Cc: PaX Team
    Cc: Emese Revfy
    Cc: Brad Spengler
    Cc: Wei Yongjun
    Cc: Eric Dumazet
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

10 Dec, 2015

1 commit

  • [ Upstream commit d69bbf88c8d0b367cf3e3a052f6daadf630ee566 ]

    Only cpu seeing dst refcount going to 0 can safely
    dereference dst->flags.

    Otherwise an other cpu might already have freed the dst.

    Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst")
    Reported-by: Greg Thelen
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

27 Oct, 2015

3 commits

  • [ Upstream commit 077cb37fcf6f00a45f375161200b5ee0cd4e937b ]

    It seems that kernel memory can leak into userspace by a
    kmalloc, ethtool_get_strings, then copy_to_user sequence.

    Avoid this by using kcalloc to zero fill the copied buffer.

    Signed-off-by: Joe Perches
    Acked-by: Ben Hutchings
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Joe Perches
     
  • [ Upstream commit 93d08b6966cf730ea669d4d98f43627597077153 ]

    When sockets have a native eBPF program attached through
    setsockopt(sk, SOL_SOCKET, SO_ATTACH_BPF, ...), and then try to
    dump these over getsockopt(sk, SOL_SOCKET, SO_GET_FILTER, ...),
    the following panic appears:

    [49904.178642] BUG: unable to handle kernel NULL pointer dereference at (null)
    [49904.178762] IP: [] sk_get_filter+0x39/0x90
    [49904.182000] PGD 86fc9067 PUD 531a1067 PMD 0
    [49904.185196] Oops: 0000 [#1] SMP
    [...]
    [49904.224677] Call Trace:
    [49904.226090] [] sock_getsockopt+0x319/0x740
    [49904.227535] [] ? sock_has_perm+0x63/0x70
    [49904.228953] [] ? release_sock+0x108/0x150
    [49904.230380] [] ? selinux_socket_getsockopt+0x23/0x30
    [49904.231788] [] SyS_getsockopt+0xa6/0xc0
    [49904.233267] [] entry_SYSCALL_64_fastpath+0x12/0x71

    The underlying issue is the very same as in commit b382c0865600
    ("sock, diag: fix panic in sock_diag_put_filterinfo"), that is,
    native eBPF programs don't store an original program since this
    is only needed in cBPF ones.

    However, sk_get_filter() wasn't updated to test for this at the
    time when eBPF could be attached. Just throw an error to the user
    to indicate that eBPF cannot be dumped over this interface.
    That way, it can also be known that a program _is_ attached (as
    opposed to just return 0), and a different (future) method needs
    to be consulted for a dump.

    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 31b33dfb0a144469dd805514c9e63f4993729a48 ]

    Earlier patch 6ae459bda tried to detect void ckecksum partial
    skb by comparing pull length to checksum offset. But it does
    not work for all cases since checksum-offset depends on
    updates to skb->data.

    Following patch fixes it by validating checksum start offset
    after skb-data pointer is updated. Negative value of checksum
    offset start means there is no need to checksum.

    Fixes: 6ae459bda ("skbuff: Fix skb checksum flag on skb pull")
    Reported-by: Andrew Vagin
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Pravin B Shelar
     

03 Oct, 2015

2 commits

  • [ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

    dump_rules returns skb length and not error.
    But when family == AF_UNSPEC, the caller of dump_rules
    assumes that it returns an error. Hence, when family == AF_UNSPEC,
    we continue trying to dump on -EMSGSIZE errors resulting in
    incorrect dump idx carried between skbs belonging to the same dump.
    This results in fib rule dump always only dumping rules that fit
    into the first skb.

    This patch fixes dump_rules to return error so that we exit correctly
    and idx is correctly maintained between skbs that are part of the
    same dump.

    Signed-off-by: Wilson Kok
    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wilson Kok
     
  • [ Upstream commit b382c08656000c12a146723a153b85b13a855b49 ]

    diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
    upon request to user space (ss -0 -b). However, native eBPF programs
    attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:

    Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
    unconditionally tries to access its filter length resp. wants to copy
    the filter insns from there. Internal cBPF to eBPF transformations
    attached to sockets don't have this issue, as orig_prog state is kept.

    It's currently only used by packet sockets. If we would want to add
    native eBPF support in the future, this needs to be done through
    a different attribute than PACKET_DIAG_FILTER to not confuse possible
    user space disassemblers that work on diag data.

    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Signed-off-by: Daniel Borkmann
    Acked-by: Nicolas Dichtel
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

30 Sep, 2015

11 commits

  • [ Upstream commit 2235f2ac75fd2501c251b0b699a9632e80239a6d ]

    reqsk_queue_destroy() and reqsk_queue_unlink() should use
    del_timer_sync() instead of del_timer() before calling reqsk_put(),
    otherwise we could free a req still used by another cpu.

    But before doing so, reqsk_queue_destroy() must release syn_wait_lock
    spinlock or risk a dead lock, as reqsk_timer_handler() might
    need to take this same spinlock from reqsk_queue_unlink() (called from
    inet_csk_reqsk_queue_drop())

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit a0a2a6602496a45ae838a96db8b8173794b5d398 ]

    The commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec ("net: Clone
    skb before setting peeked flag") introduced a use-after-free bug
    in skb_recv_datagram. This is because skb_set_peeked may create
    a new skb and free the existing one. As it stands the caller will
    continue to use the old freed skb.

    This patch fixes it by making skb_set_peeked return the new skb
    (or the old one if unchanged).

    Fixes: 738ac1ebb96d ("net: Clone skb before setting peeked flag")
    Reported-by: Brenden Blanco
    Signed-off-by: Herbert Xu
    Tested-by: Brenden Blanco
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 89c22d8c3b278212eef6a8cc66b570bc840a6f5a ]

    When we calculate the checksum on the recv path, we store the
    result in the skb as an optimisation in case we need the checksum
    again down the line.

    This is in fact bogus for the MSG_PEEK case as this is done without
    any locking. So multiple threads can peek and then store the result
    to the same skb, potentially resulting in bogus skb states.

    This patch fixes this by only storing the result if the skb is not
    shared. This preserves the optimisations for the few cases where
    it can be done safely due to locking or other reasons, e.g., SIOCINQ.

    Signed-off-by: Herbert Xu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec ]

    Shared skbs must not be modified and this is crucial for broadcast
    and/or multicast paths where we use it as an optimisation to avoid
    unnecessary cloning.

    The function skb_recv_datagram breaks this rule by setting peeked
    without cloning the skb first. This causes funky races which leads
    to double-free.

    This patch fixes this by cloning the skb and replacing the skb
    in the list when setting skb->peeked.

    Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv")
    Reported-by: Konstantin Khlebnikov
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 2c17d27c36dcce2b6bf689f41a46b9e909877c21 ]

    Incoming packet should be either in backlog queue or
    in RCU read-side section. Otherwise, the final sequence of
    flush_backlog() and synchronize_net() may miss packets
    that can run without device reference:

    CPU 1 CPU 2
    skb->dev: no reference
    process_backlog:__skb_dequeue
    process_backlog:local_irq_enable

    on_each_cpu for
    flush_backlog => IPI(hardirq): flush_backlog
    - packet not found in backlog

    CPU delayed ...
    synchronize_net
    - no ongoing RCU
    read-side sections

    netdev_run_todo,
    rcu_barrier: no
    ongoing callbacks
    __netif_receive_skb_core:rcu_read_lock
    - too late
    free dev
    process packet for freed dev

    Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
    Cc: Eric W. Biederman
    Cc: Stephen Hemminger
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Julian Anastasov
     
  • [ Upstream commit e9e4dd3267d0c5234c5c0f47440456b10875dec9 ]

    commit 381c759d9916 ("ipv4: Avoid crashing in ip_error")
    fixes a problem where processed packet comes from device
    with destroyed inetdev (dev->ip_ptr). This is not expected
    because inetdev_destroy is called in NETDEV_UNREGISTER
    phase and packets should not be processed after
    dev_close_many() and synchronize_net(). Above fix is still
    required because inetdev_destroy can be called for other
    reasons. But it shows the real problem: backlog can keep
    packets for long time and they do not hold reference to
    device. Such packets are then delivered to upper levels
    at the same time when device is unregistered.
    Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
    accounts all packets from backlog but before that some packets
    continue to be delivered to upper levels long after the
    synchronize_net call which is supposed to wait the last
    ones. Also, as Eric pointed out, processed packets, mostly
    from other devices, can continue to add new packets to backlog.

    Fix the problem by moving flush_backlog early, after the
    device driver is stopped and before the synchronize_net() call.
    Then use netif_running check to make sure we do not add more
    packets to backlog. We have to do it in enqueue_to_backlog
    context when the local IRQ is disabled. As result, after the
    flush_backlog and synchronize_net sequence all packets
    should be accounted.

    Thanks to Eric W. Biederman for the test script and his
    valuable feedback!

    Reported-by: Vittorio Gambaletta
    Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
    Cc: Eric W. Biederman
    Cc: Stephen Hemminger
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Julian Anastasov
     
  • [ Upstream commit fecdf8be2d91e04b0a9a4f79ff06499a36f5d14f ]

    pktgen_thread_worker() is obviously racy, kthread_stop() can come
    between the kthread_should_stop() check and set_current_state().

    Signed-off-by: Oleg Nesterov
    Reported-by: Jan Stancek
    Reported-by: Marcelo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • [ Upstream commit 4f7d2cdfdde71ffe962399b7020c674050329423 ]

    Jason Gunthorpe reported that since commit c02db8c6290b ("rtnetlink: make
    SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
    anymore with respect to their policy, that is, ifla_vfinfo_policy[].

    Before, they were part of ifla_policy[], but they have been nested since
    placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
    which is another nested attribute for the actual VF attributes such as
    IFLA_VF_MAC, IFLA_VF_VLAN, etc.

    Despite the policy being split out from ifla_policy[] in this commit,
    it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
    testing for struct nlattr, but it doesn't know about the data context and
    their requirements.

    Fix, on top of Jason's initial work, does 1) parsing of the attributes
    with the right policy, and 2) using the resulting parsed attribute table
    from 1) instead of the nla_for_each_nested() loop (just like we used to
    do when still part of ifla_policy[]).

    Reference: http://thread.gmane.org/gmane.linux.network/368913
    Fixes: c02db8c6290b ("rtnetlink: make SR-IOV VF interface symmetric")
    Reported-by: Jason Gunthorpe
    Cc: Chris Wright
    Cc: Sucheta Chakraborty
    Cc: Greg Rose
    Cc: Jeff Kirsher
    Cc: Rony Efraim
    Cc: Vlad Zolotarov
    Cc: Nicolas Dichtel
    Cc: Thomas Graf
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Daniel Borkmann
    Acked-by: Vlad Zolotarov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 95ec655bc465ccb2a3329d4aff9a45e3c8188db5 ]

    This reverts commit e1622baf54df8cc958bf29d71de5ad545ea7d93c.

    The side effect of this commit is to add a '@NONE' after each virtual
    interface name with a 'ip link'. It may break existing scripts.

    Reported-by: Olivier Hartkopp
    Signed-off-by: Nicolas Dichtel
    Tested-by: Oliver Hartkopp
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Dichtel
     
  • [ Upstream commit d339727c2b1a10f25e6636670ab6e1841170e328 ]

    User space can crash kernel with

    ip link add ifb10 numtxqueues 100000 type ifb

    We must replace a BUG_ON() by proper test and return -EINVAL for
    crazy values.

    Fixes: 60877a32bce00 ("net: allow large number of tx queues")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit 2f064f3485cd29633ad1b3cfb00cc519509a3d72 upstream.

    Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

11 Jul, 2015

1 commit

  • [ Upstream commit 2c51a97f76d20ebf1f50fef908b986cb051fdff9 ]

    The lockless lookups can return entry that is unlinked.
    Sometimes they get reference before last neigh_cleanup_and_release,
    sometimes they do not need reference. Later, any
    modification attempts may result in the following problems:

    1. entry is not destroyed immediately because neigh_update
    can start the timer for dead entry, eg. on change to NUD_REACHABLE
    state. As result, entry lives for some time but is invisible
    and out of control.

    2. __neigh_event_send can run in parallel with neigh_destroy
    while refcnt=0 but if timer is started and expired refcnt can
    reach 0 for second time leading to second neigh_destroy and
    possible crash.

    Thanks to Eric Dumazet and Ying Xue for their work and analyze
    on the __neigh_event_send change.

    Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour")
    Fixes: a263b3093641 ("ipv4: Make neigh lookups directly in output packet path.")
    Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
    Cc: Eric Dumazet
    Cc: Ying Xue
    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Julian Anastasov
     

12 Jun, 2015

1 commit

  • We saw excessive direct memory compaction triggered by skb_page_frag_refill.
    This causes performance issues and add latency. Commit 5640f7685831e0
    introduces the order-3 allocation. According to the changelog, the order-3
    allocation isn't a must-have but to improve performance. But direct memory
    compaction has high overhead. The benefit of order-3 allocation can't
    compensate the overhead of direct memory compaction.

    This patch makes the order-3 page allocation atomic. If there is no memory
    pressure and memory isn't fragmented, the alloction will still success, so we
    don't sacrifice the order-3 benefit here. If the atomic allocation fails,
    direct memory compaction will not be triggered, skb_page_frag_refill will
    fallback to order-0 immediately, hence the direct memory compaction overhead is
    avoided. In the allocation failure case, kswapd is waken up and doing
    compaction, so chances are allocation could success next time.

    alloc_skb_with_frags is the same.

    The mellanox driver does similar thing, if this is accepted, we must fix
    the driver too.

    V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
    V2: make the changelog clearer

    Cc: Eric Dumazet
    Cc: Chris Mason
    Cc: Debabrata Banerjee
    Signed-off-by: Shaohua Li
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shaohua Li
     

11 Jun, 2015

1 commit

  • Jeff Layton reported the following;

    [ 74.232485] ------------[ cut here ]------------
    [ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
    [ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
    [ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
    [ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
    [ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
    [ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
    [ 74.249483] Call Trace:
    [ 74.249872] [] dump_stack+0x45/0x57
    [ 74.250703] [] warn_slowpath_common+0x8a/0xc0
    [ 74.251655] [] warn_slowpath_null+0x1a/0x20
    [ 74.252585] [] sk_clear_memalloc+0x51/0x80
    [ 74.253519] [] xs_disable_swap+0x42/0x80 [sunrpc]
    [ 74.254537] [] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
    [ 74.255610] [] nfs_swap_deactivate+0x27/0x30 [nfs]
    [ 74.256582] [] destroy_swap_extents+0x74/0x80
    [ 74.257496] [] SyS_swapoff+0x222/0x5c0
    [ 74.258318] [] ? syscall_trace_leave+0xc7/0x140
    [ 74.259253] [] system_call_fastpath+0x12/0x71
    [ 74.260158] ---[ end trace 2530722966429f10 ]---

    The warning in question was unnecessary but with Jeff's series the rules
    are also clearer. This patch removes the warning and updates the comment
    to explain why sk_mem_reclaim() may still be called.

    [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
    points out that it's not needed.]

    Cc: Leon Romanovsky
    Signed-off-by: Mel Gorman
    Signed-off-by: Jeff Layton
    Signed-off-by: David S. Miller

    Mel Gorman
     

09 Jun, 2015

1 commit


02 Jun, 2015

1 commit

  • This reverts commit f96dee13b8e10f00840124255bed1d8b4c6afd6f.

    It isn't right, ethtool is meant to manage one PHY instance
    per netdevice at a time, and this is selected by the SET
    command. Therefore by definition the GET command must only
    return the settings for the configured and selected PHY.

    Reported-by: Ben Hutchings
    Signed-off-by: David S. Miller

    David S. Miller
     

23 May, 2015

1 commit

  • When trying to configure the settings for PHY1, using commands
    like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
    modify other settings apart from the speed of the PHY1, in the
    above case.

    The ethtool seems to query the settings for PHY0, and use this
    as the base to apply the new settings to the PHY1. This is
    causing the other settings of the PHY 1 to be wrongly
    configured.

    The issue is caused by the '_ethtool_get_settings()' API, which
    gets called because of the 'ETHTOOL_GSET' command, is clearing
    the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
    memset. This clears all the parameters (if any) passed for the
    'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
    with 'cmd->phy_address' as '0'.

    The '_ethtool_get_settings()' is called from other files in the
    'net/core'. So the fix is applied to the 'ethtool_get_settings()'
    which is only called in the context of the 'ethtool'.

    Signed-off-by: Arun Parameswaran
    Reviewed-by: Ray Jui
    Reviewed-by: Scott Branden
    Signed-off-by: David S. Miller

    Arun Parameswaran
     

18 May, 2015

1 commit

  • Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
    causes the kernel to send a rtnl message for the bond2 interface, with an
    ifindex 0.

    'ip monitor' shows:
    0: bond2: mtu 1500 state DOWN group default
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    9: bond2@NONE: mtu 1500 qdisc noop state DOWN group default
    link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
    [snip]

    The patch fixes the spotted bug by checking in bond driver if the interface
    is registered before calling the notifier chain.
    It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
    future.

    Fixes: d4261e565000 ("bonding: create netlink event when bonding option is changed")
    CC: Jiri Pirko
    Reported-by: Julien Meunier
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

13 May, 2015

1 commit


05 May, 2015

1 commit

  • The code in __netdev_upper_dev_link() has an over-stringent
    loop detection logic that actually prevents valid configurations
    from working correctly.

    In particular, the logic returns an error if an upper device
    is already in the list of all upper devices for a given dev.
    This particular check seems to be a overzealous as it disallows
    perfectly valid configurations. For example:
    # ip l a link eth0 name eth0.10 type vlan id 10
    # ip l a dev br0 typ bridge
    # ip l s eth0.10 master br0
    # ip l s eth0 master br0
    Acked-by: Jiri Pirko
    Acked-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

04 May, 2015

1 commit

  • This reverts commit c243d7e20996254f89c28d4838b5feca735c030d.

    That patch is solving a non-existant problem while creating a
    real problem. Just because a socket is allocated in the init
    name space doesn't mean that it gets hashed in the init name space.

    When we unhash it the name space must be the same as the one
    we had when we hashed it. So this patch is completely bogus
    and causes socket leaks.

    Reported-by: Andrey Wagin
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

30 Apr, 2015

1 commit

  • NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
    it is sent only at the end of a dump.

    Libraries like libnl will wait forever for NLMSG_DONE.

    Fixes: e5a55a898720 ("net: create generic bridge ops")
    Fixes: 815cccbf10b2 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
    CC: John Fastabend
    CC: Sathya Perla
    CC: Subbu Seetharaman
    CC: Ajit Khaparde
    CC: Jeff Kirsher
    CC: intel-wired-lan@lists.osuosl.org
    CC: Jiri Pirko
    CC: Scott Feldman
    CC: Stephen Hemminger
    CC: bridge@lists.linux-foundation.org
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

27 Apr, 2015

1 commit

  • Commit 567e4b79731c ("net: rfs: add hash collision detection") had one
    mistake :

    RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
    and get_rps_cpu(), as @next_cpu was the result of an AND with
    rps_cpu_mask

    This bug showed up on a host with 72 cpus :
    next_cpu was 0x7f, and the code was trying to access percpu data of an
    non existent cpu.

    In a follow up patch, we might get rid of compares against nr_cpu_ids,
    if we init the tables with 0. This is silly to test for a very unlikely
    condition that exists only shortly after table initialization, as
    we got rid of rps_reset_sock_flow() and similar functions that were
    writing this RPS_NO_CPU magic value at flow dismantle : When table is
    old enough, it never contains this value anymore.

    Fixes: 567e4b79731c ("net: rfs: add hash collision detection")
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Apr, 2015

1 commit

  • When I added pfmemalloc support in build_skb(), I forgot netlink
    was using build_skb() with a vmalloc() area.

    In this patch I introduce __build_skb() for netlink use,
    and build_skb() is a wrapper handling both skb->head_frag and
    skb->pfmemalloc

    This means netlink no longer has to hack skb->head_frag

    [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
    [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    [ 1567.700067] Dumping ftrace buffer:
    [ 1567.700067] (ftrace buffer empty)
    [ 1567.700067] Modules linked in:
    [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
    [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
    [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
    [ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202
    [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
    [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
    [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
    [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
    [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
    [ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
    [ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
    [ 1567.700067] Stack:
    [ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
    [ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
    [ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
    [ 1567.700067] Call Trace:
    [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
    [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
    [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
    [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
    [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
    [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
    [ 1567.774369] sock_write_iter (net/socket.c:823)
    [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
    [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
    [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
    [ 1567.774369] ? default_llseek (fs/read_write.c:487)
    [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
    [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
    [ 1567.774369] vfs_write (fs/read_write.c:539)
    [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
    [ 1567.774369] ? SyS_read (fs/read_write.c:577)
    [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
    [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
    [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
    [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

    Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve")
    Signed-off-by: Eric Dumazet
    Reported-by: Sasha Levin
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2015

1 commit

  • build_skb() should look at the page pfmemalloc status.
    If set, this means page allocator allocated this page in the
    expectation it would help to free other pages. Networking
    stack can do that only if skb->pfmemalloc is also set.

    Also, we must refrain using high order pages from the pfmemalloc
    reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
    them. Under memory pressure, using order-0 pages is probably the best
    strategy.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Apr, 2015

1 commit

  • In commit 04ffcb255f22 ("net: Add ndo_gso_check") Tom originally
    added the 'dev' argument to be able to call ndo_gso_check().

    Then later, when generalizing this in commit 5f35227ea34b
    ("net: Generalize ndo_gso_check to ndo_features_check")
    Jesse removed the call to ndo_gso_check() in netif_needs_gso()
    by calling the new ndo_features_check() in a different place.
    This made the 'dev' argument unused.

    Remove the unused argument and go back to the code as before.

    Cc: Tom Herbert
    Cc: Jesse Gross
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

17 Apr, 2015

4 commits

  • On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
    > Le 15/04/2015 15:57, Herbert Xu a écrit :
    > >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
    > [snip]
    > >Subject: skbuff: Do not scrub skb mark within the same name space
    > >
    > >The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
    > Maybe add a Fixes tag?
    > Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
    >
    > >harmonize cleanup done on skb on rx path") broke anyone trying to
    > >use netfilter marking across IPv4 tunnels. While most of the
    > >fields that are cleared by skb_scrub_packet don't matter, the
    > >netfilter mark must be preserved.
    > >
    > >This patch rearranges skb_scurb_packet to preserve the mark field.
    > nit: s/scurb/scrub
    >
    > Else it's fine for me.

    Sure.

    PS I used the wrong email for James the first time around. So
    let me repeat the question here. Should secmark be preserved
    or cleared across tunnels within the same name space? In fact,
    do our security models even support name spaces?

    ---8
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch reverts commit b8fb4e0648a2ab3734140342002f68fb0c7d1602
    because the secmark must be preserved even when a packet crosses
    namespace boundaries. The reason is that security labels apply to
    the system as a whole and is not per-namespace.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • For the short-term solution, lets fix bpf helper functions to use
    skb->mac_header relative offsets instead of skb->data in order to
    get the same eBPF programs with cls_bpf and act_bpf work on ingress
    and egress qdisc path. We need to ensure that mac_header is set
    before calling into programs. This is effectively the first option
    from below referenced discussion.

    More long term solution for LD_ABS|LD_IND instructions will be more
    intrusive but also more beneficial than this, and implemented later
    as it's too risky at this point in time.

    I.e., we plan to look into the option of moving skb_pull() out of
    eth_type_trans() and into netif_receive_skb() as has been suggested
    as second option. Meanwhile, this solution ensures ingress can be
    used with eBPF, too, and that we won't run into ABI troubles later.
    For dealing with negative offsets inside eBPF helper functions,
    we've implemented bpf_skb_clone_unwritable() to test for unwriteable
    headers.

    Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Remove duplicated include.

    Signed-off-by: Wei Yongjun
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Wei Yongjun
     

14 Apr, 2015

2 commits

  • Al Viro says:

    ====================
    netdev-related stuff in vfs.git

    There are several commits sitting in vfs.git that probably ought to go in
    via net-next.git. First of all, there's merge with vfs.git#iocb - that's
    Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
    and ->recvmsg() patches a while ago. It's not so much Christoph's stuff
    that ought to be in net-next, as (pretty simple) conflict resolution on merge.
    The next chunk is switch to {compat_,}import_iovec/import_single_range - new
    safer primitives for initializing iov_iter. The primitives themselves come
    from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
    conversion of net/socket.c syscalls belongs in net-next, IMO. Next there's
    afs and rxrpc stuff from dhowells. And then there's sanitizing kernel_sendmsg
    et.al. + missing inlined helper for "how much data is left in msg->msg_iter" -
    this stuff is used in e.g. cifs stuff, but it belongs in net-next.

    That pile is pullable from
    git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem

    I'll post the individual patches in there in followups; could you take a look
    and tell if everything in there is OK with you?
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Even if we make use of classifier and actions from the egress
    path, we're going into handle_ing() executing additional code
    on a per-packet cost for ingress qdisc, just to realize that
    nothing is attached on ingress.

    Instead, this can just be blinded out as a no-op entirely with
    the use of a static key. On input fast-path, we already make
    use of static keys in various places, e.g. skb time stamping,
    in RPS, etc. It makes sense to not waste time when we're assured
    that no ingress qdisc is attached anywhere.

    Enabling/disabling of that code path is being done via two
    helpers, namely net_{inc,dec}_ingress_queue(), that are being
    invoked under RTNL mutex when a ingress qdisc is being either
    initialized or destructed.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann