02 Mar, 2017

12 commits

  • It is now very clear that silly TCP listeners might play with
    enabling/disabling timestamping while new children are added
    to their accept queue.

    Meaning net_enable_timestamp() can be called from BH context
    while current state of the static key is not enabled.

    Lets play safe and allow all contexts.

    The work queue is scheduled only under the problematic cases,
    which are the static key enable/disable transition, to not slow down
    critical paths.

    This extends and improves what we did in commit 5fa8bbda38c6 ("net: use
    a work queue to defer net_disable_timestamp() work")

    Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
    uninitialized memory in packet_bind_spkt():
    Acked-by: Eric Dumazet

    ==================================================================
    BUG: KMSAN: use of unitialized memory
    CPU: 0 PID: 1074 Comm: packet Not tainted 4.8.0-rc6+ #1891
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
    01/01/2011
    0000000000000000 ffff88006b6dfc08 ffffffff82559ae8 ffff88006b6dfb48
    ffffffff818a7c91 ffffffff85b9c870 0000000000000092 ffffffff85b9c550
    0000000000000000 0000000000000092 00000000ec400911 0000000000000002
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0x238/0x290 lib/dump_stack.c:51
    [] kmsan_report+0x276/0x2e0 mm/kmsan/kmsan.c:1003
    [] __msan_warning+0x5b/0xb0
    mm/kmsan/kmsan_instr.c:424
    [< inline >] strlen lib/string.c:484
    [] strlcpy+0x9d/0x200 lib/string.c:144
    [] packet_bind_spkt+0x144/0x230
    net/packet/af_packet.c:3132
    [] SYSC_bind+0x40d/0x5f0 net/socket.c:1370
    [] SyS_bind+0x82/0xa0 net/socket.c:1356
    [] entry_SYSCALL_64_fastpath+0x13/0x8f
    arch/x86/entry/entry_64.o:?
    chained origin: 00000000eba00911
    [] save_stack_trace+0x27/0x50
    arch/x86/kernel/stacktrace.c:67
    [< inline >] kmsan_save_stack_with_flags mm/kmsan/kmsan.c:322
    [< inline >] kmsan_save_stack mm/kmsan/kmsan.c:334
    [] kmsan_internal_chain_origin+0x118/0x1e0
    mm/kmsan/kmsan.c:527
    [] __msan_set_alloca_origin4+0xc3/0x130
    mm/kmsan/kmsan_instr.c:380
    [] SYSC_bind+0x129/0x5f0 net/socket.c:1356
    [] SyS_bind+0x82/0xa0 net/socket.c:1356
    [] entry_SYSCALL_64_fastpath+0x13/0x8f
    arch/x86/entry/entry_64.o:?
    origin description: ----address@SYSC_bind (origin=00000000eb400911)
    ==================================================================
    (the line numbers are relative to 4.8-rc6, but the bug persists
    upstream)

    , when I run the following program as root:

    =====================================
    #include
    #include
    #include
    #include

    int main() {
    struct sockaddr addr;
    memset(&addr, 0xff, sizeof(addr));
    addr.sa_family = AF_PACKET;
    int fd = socket(PF_PACKET, SOCK_PACKET, htons(ETH_P_ALL));
    bind(fd, &addr, sizeof(addr));
    return 0;
    }
    =====================================

    This happens because addr.sa_data copied from the userspace is not
    zero-terminated, and copying it with strlcpy() in packet_bind_spkt()
    results in calling strlen() on the kernel copy of that non-terminated
    buffer.

    Signed-off-by: Alexander Potapenko
    Signed-off-by: David S. Miller

    Alexander Potapenko
     
  • Even with multicast flooding turned off, IPv6 ND should still work so
    that IPv6 connectivity is provided. Allow this by continuing to flood
    multicast traffic originated by us.

    Fixes: b6cb5ac8331b ("net: bridge: add per-port multicast flood flag")
    Cc: Nikolay Aleksandrov
    Signed-off-by: Mike Manning
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Mike Manning
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    First round of fixes - details in the commits:
    * use a valid hrtimer clock ID in mac80211_hwsim
    * don't reorder frames prior to BA session
    * flush a delayed work at suspend so the state is all valid before
    suspend/resume
    * fix packet statistics in fast-RX, the RX packets
    counter increment was simply missing
    * don't try to re-transmit filtered frames in an aggregation session
    * shorten (for tracing) a debug message
    * typo fix in another debug message
    * fix nul-termination with HWSIM_ATTR_RADIO_NAME in hwsim
    * fix mgmt RX processing when station is looked up by driver/device
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • SYN processing really was meant to be handled from BH.

    When I got rid of BH blocking while processing socket backlog
    in commit 5413d1babe8f ("net: do not block BH while processing socket
    backlog"), I forgot that a malicious user could transition to TCP_LISTEN
    from a state that allowed (SYN) packets to be parked in the socket
    backlog while socket is owned by the thread doing the listen() call.

    Sure enough syzkaller found this and reported the bug ;)

    =================================
    [ INFO: inconsistent lock state ]
    4.10.0+ #60 Not tainted
    ---------------------------------
    inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    syz-executor0/5090 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at:
    [] spin_lock include/linux/spinlock.h:299 [inline]
    (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at:
    [] inet_ehash_insert+0x240/0xad0
    net/ipv4/inet_hashtables.c:407
    {IN-SOFTIRQ-W} state was registered at:
    mark_irqflags kernel/locking/lockdep.c:2923 [inline]
    __lock_acquire+0xbcf/0x3270 kernel/locking/lockdep.c:3295
    lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:299 [inline]
    inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
    reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
    inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
    tcp_conn_request+0x25cc/0x3310 net/ipv4/tcp_input.c:6399
    tcp_v4_conn_request+0x157/0x220 net/ipv4/tcp_ipv4.c:1262
    tcp_rcv_state_process+0x802/0x4130 net/ipv4/tcp_input.c:5889
    tcp_v4_do_rcv+0x56b/0x940 net/ipv4/tcp_ipv4.c:1433
    tcp_v4_rcv+0x2e12/0x3210 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x4ce/0xc40 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip_local_deliver+0x1ce/0x710 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:492 [inline]
    ip_rcv_finish+0xb1d/0x2110 net/ipv4/ip_input.c:396
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip_rcv+0xd90/0x19c0 net/ipv4/ip_input.c:487
    __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4179
    __netif_receive_skb+0x2a/0x170 net/core/dev.c:4217
    netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4245
    napi_skb_finish net/core/dev.c:4602 [inline]
    napi_gro_receive+0x4e6/0x680 net/core/dev.c:4636
    e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4033 [inline]
    e1000_clean_rx_irq+0x5e0/0x1490
    drivers/net/ethernet/intel/e1000/e1000_main.c:4489
    e1000_clean+0xb9a/0x2910 drivers/net/ethernet/intel/e1000/e1000_main.c:3834
    napi_poll net/core/dev.c:5171 [inline]
    net_rx_action+0xe70/0x1900 net/core/dev.c:5236
    __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
    invoke_softirq kernel/softirq.c:364 [inline]
    irq_exit+0x19e/0x1d0 kernel/softirq.c:405
    exiting_irq arch/x86/include/asm/apic.h:658 [inline]
    do_IRQ+0x81/0x1a0 arch/x86/kernel/irq.c:250
    ret_from_intr+0x0/0x20
    native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53
    arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline]
    default_idle+0x8f/0x410 arch/x86/kernel/process.c:271
    arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:262
    default_idle_call+0x36/0x60 kernel/sched/idle.c:96
    cpuidle_idle_call kernel/sched/idle.c:154 [inline]
    do_idle+0x348/0x440 kernel/sched/idle.c:243
    cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:345
    start_secondary+0x344/0x440 arch/x86/kernel/smpboot.c:272
    verify_cpu+0x0/0xfc
    irq event stamp: 1741
    hardirqs last enabled at (1741): []
    __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160
    [inline]
    hardirqs last enabled at (1741): []
    _raw_spin_unlock_irqrestore+0xf7/0x1a0 kernel/locking/spinlock.c:191
    hardirqs last disabled at (1740): []
    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
    hardirqs last disabled at (1740): []
    _raw_spin_lock_irqsave+0xa2/0x110 kernel/locking/spinlock.c:159
    softirqs last enabled at (1738): []
    __do_softirq+0x7cf/0xb7d kernel/softirq.c:310
    softirqs last disabled at (1571): []
    do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&hashinfo->ehash_locks[i])->rlock);

    lock(&(&hashinfo->ehash_locks[i])->rlock);

    *** DEADLOCK ***

    1 lock held by syz-executor0/5090:
    #0: (sk_lock-AF_INET6){+.+.+.}, at: [] lock_sock
    include/net/sock.h:1460 [inline]
    #0: (sk_lock-AF_INET6){+.+.+.}, at: []
    sock_setsockopt+0x233/0x1e40 net/core/sock.c:683

    stack backtrace:
    CPU: 1 PID: 5090 Comm: syz-executor0 Not tainted 4.10.0+ #60
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:15 [inline]
    dump_stack+0x292/0x398 lib/dump_stack.c:51
    print_usage_bug+0x3ef/0x450 kernel/locking/lockdep.c:2387
    valid_state kernel/locking/lockdep.c:2400 [inline]
    mark_lock_irq kernel/locking/lockdep.c:2602 [inline]
    mark_lock+0xf30/0x1410 kernel/locking/lockdep.c:3065
    mark_irqflags kernel/locking/lockdep.c:2941 [inline]
    __lock_acquire+0x6dc/0x3270 kernel/locking/lockdep.c:3295
    lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:299 [inline]
    inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
    reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
    inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
    dccp_v6_conn_request+0xada/0x11b0 net/dccp/ipv6.c:380
    dccp_rcv_state_process+0x51e/0x1660 net/dccp/input.c:606
    dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
    sk_backlog_rcv include/net/sock.h:896 [inline]
    __release_sock+0x127/0x3a0 net/core/sock.c:2052
    release_sock+0xa5/0x2b0 net/core/sock.c:2539
    sock_setsockopt+0x60f/0x1e40 net/core/sock.c:1016
    SYSC_setsockopt net/socket.c:1782 [inline]
    SyS_setsockopt+0x2fb/0x3a0 net/socket.c:1765
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    RIP: 0033:0x4458b9
    RSP: 002b:00007fe8b26c2b58 EFLAGS: 00000292 ORIG_RAX: 0000000000000036
    RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00000000004458b9
    RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000006
    RBP: 00000000006e2110 R08: 0000000000000010 R09: 0000000000000000
    R10: 00000000208c3000 R11: 0000000000000292 R12: 0000000000708000
    R13: 0000000020000000 R14: 0000000000001000 R15: 0000000000000000

    Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Fix error path order in nbp_vlan_init, so if switchdev_port_attr_set
    call failes, the vlan_hash wouldn't be destroyed before inited.

    Fixes: efa5356b0d97 ("bridge: per vlan dst_metadata netlink support")
    CC: Roopa Prabhu
    Signed-off-by: Yotam Gigi
    Acked-by: Roopa Prabhu
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • This will add stricter validating for RTA_MARK attribute.

    Signed-off-by: Liping Zhang
    Signed-off-by: David S. Miller

    Liping Zhang
     
  • The addr_gen_mode variable can be accessed by both sysctl and netlink.
    Repleacd rtnl_lock() with rtnl_trylock() protect the sysctl operation to
    avoid the possbile dead lock.`

    Signed-off-by: Felix Jia
    Signed-off-by: David S. Miller

    Felix Jia
     
  • While playing with mlx4 hardware timestamping of RX packets, I found
    that some packets were received by TCP stack with a ~200 ms delay...

    Since the timestamp was provided by the NIC, and my probe was added
    in tcp_v4_rcv() while in BH handler, I was confident it was not
    a sender issue, or a drop in the network.

    This would happen with a very low probability, but hurting RPC
    workloads.

    A NAPI driver normally arms the IRQ after the napi_complete_done(),
    after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
    it.

    Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
    while IRQ are not disabled, we might have later an IRQ firing and
    finding this bit set, right before napi_complete_done() clears it.

    This can happen with busy polling users, or if gro_flush_timeout is
    used. But some other uses of napi_schedule() in drivers can cause this
    as well.

    thread 1 thread 2 (could be on same cpu, or not)

    // busy polling or napi_watchdog()
    napi_schedule();
    ...
    napi->poll()

    device polling:
    read 2 packets from ring buffer
    Additional 3rd packet is
    available.
    device hard irq

    // does nothing because
    NAPI_STATE_SCHED bit is owned by thread 1
    napi_schedule();

    napi_complete_done(napi, 2);
    rearm_irq();

    Note that rearm_irq() will not force the device to send an additional
    IRQ for the packet it already signaled (3rd packet in my example)

    This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
    can set if it could not grab NAPI_STATE_SCHED

    Then napi_complete_done() properly reschedules the napi to make sure
    we do not miss something.

    Since we manipulate multiple bits at once, use cmpxchg() like in
    sk_busy_loop() to provide proper transactions.

    In v2, I changed napi_watchdog() to use a relaxed variant of
    napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

    In v3, I added more details in the changelog and clears
    NAPI_STATE_MISSED in busy_poll_stop()

    In v4, I added the ideas given by Alexander Duyck in v3 review

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The variables rds_ib_mr_1m_pool_size and rds_ib_mr_8k_pool_size
    are used only in the ib.c file. As such, the static type is
    added to limit them in this file.

    Cc: Joe Jin
    Cc: Junxiao Bi
    Signed-off-by: Zhu Yanjun
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Zhu Yanjun
     
  • Commit cd2b70875058 ("sctp: check duplicate node before inserting a
    new transport") called rhltable_lookup() to check for the duplicate
    transport node in transport rhashtable.

    But rhltable_lookup() doesn't call rcu_read_lock inside, it could cause
    a use-after-free issue if it tries to dereference the node that another
    cpu has freed it. Note that sock lock can not avoid this as it is per
    sock.

    This patch is to fix it by calling rcu_read_lock before checking for
    duplicate transport nodes.

    Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new transport")
    Reported-by: Andrey Konovalov
    Signed-off-by: Xin Long
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Xin Long
     
  • All the routines by which rxrpc is accessed from the outside are serialised
    by means of the socket lock (sendmsg, recvmsg, bind,
    rxrpc_kernel_begin_call(), ...) and this presents a problem:

    (1) If a number of calls on the same socket are in the process of
    connection to the same peer, a maximum of four concurrent live calls
    are permitted before further calls need to wait for a slot.

    (2) If a call is waiting for a slot, it is deep inside sendmsg() or
    rxrpc_kernel_begin_call() and the entry function is holding the socket
    lock.

    (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
    from servicing the other calls as they need to take the socket lock to
    do so.

    (4) The socket is stuck until a call is aborted and makes its slot
    available to the waiter.

    Fix this by:

    (1) Provide each call with a mutex ('user_mutex') that arbitrates access
    by the users of rxrpc separately for each specific call.

    (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
    they've got a call and taken its mutex.

    Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
    set but someone else has the lock. Should I instead only return
    EWOULDBLOCK if there's nothing currently to be done on a socket, and
    sleep in this particular instance because there is something to be
    done, but we appear to be blocked by the interrupt handler doing its
    ping?

    (3) Make rxrpc_new_client_call() unlock the socket after allocating a new
    call, locking its user mutex and adding it to the socket's call tree.
    The call is returned locked so that sendmsg() can add data to it
    immediately.

    From the moment the call is in the socket tree, it is subject to
    access by sendmsg() and recvmsg() - even if it isn't connected yet.

    (4) Lock new service calls in the UDP data_ready handler (in
    rxrpc_new_incoming_call()) because they may already be in the socket's
    tree and the data_ready handler makes them live immediately if a user
    ID has already been preassigned.

    Note that the new call is locked before any notifications are sent
    that it is live, so doing mutex_trylock() *ought* to always succeed.
    Userspace is prevented from doing sendmsg() on calls that are in a
    too-early state in rxrpc_do_sendmsg().

    (5) Make rxrpc_new_incoming_call() return the call with the user mutex
    held so that a ping can be scheduled immediately under it.

    Note that it might be worth moving the ping call into
    rxrpc_new_incoming_call() and then we can drop the mutex there.

    (6) Make rxrpc_accept_call() take the lock on the call it is accepting and
    release the socket after adding the call to the socket's tree. This
    is slightly tricky as we've dequeued the call by that point and have
    to requeue it.

    Note that requeuing emits a trace event.

    (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
    new mutex immediately and don't bother with the socket mutex at all.

    This patch has the nice bonus that calls on the same socket are now to some
    extent parallelisable.

    Note that we might want to move rxrpc_service_prealloc() calls out from the
    socket lock and give it its own lock, so that we don't hang progress in
    other calls because we're waiting for the allocator.

    We probably also want to avoid calling rxrpc_notify_socket() from within
    the socket lock (rxrpc_accept_call()).

    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

01 Mar, 2017

4 commits

  • Pull IDR rewrite from Matthew Wilcox:
    "The most significant part of the following is the patch to rewrite the
    IDR & IDA to be clients of the radix tree. But there's much more,
    including an enhancement of the IDA to be significantly more space
    efficient, an IDR & IDA test suite, some improvements to the IDR API
    (and driver changes to take advantage of those improvements), several
    improvements to the radix tree test suite and RCU annotations.

    The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
    for most of the last cycle. Coupled with the IDR test suite, I feel
    pretty confident that any remaining bugs are quite hard to hit. 0-day
    did a great job of watching my git tree and pointing out problems; as
    it hit them, I added new test-cases to be sure not to be caught the
    same way twice"

    Willy goes on to expand a bit on the IDR rewrite rationale:
    "The radix tree and the IDR use very similar data structures.

    Merging the two codebases lets us share the memory allocation pools,
    and results in a net deletion of 500 lines of code. It also opens up
    the possibility of exposing more of the features of the radix tree to
    users of the IDR (and I have some interesting patches along those
    lines waiting for 4.12)

    It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
    will shrink a fair few data structures that embed an IDR"

    * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
    radix tree test suite: Add config option for map shift
    idr: Add missing __rcu annotations
    radix-tree: Fix __rcu annotations
    radix-tree: Add rcu_dereference and rcu_assign_pointer calls
    radix tree test suite: Run iteration tests for longer
    radix tree test suite: Fix split/join memory leaks
    radix tree test suite: Fix leaks in regression2.c
    radix tree test suite: Fix leaky tests
    radix tree test suite: Enable address sanitizer
    radix_tree_iter_resume: Fix out of bounds error
    radix-tree: Store a pointer to the root in each node
    radix-tree: Chain preallocated nodes through ->parent
    radix tree test suite: Dial down verbosity with -v
    radix tree test suite: Introduce kmalloc_verbose
    idr: Return the deleted entry from idr_remove
    radix tree test suite: Build separate binaries for some tests
    ida: Use exceptional entries for small IDAs
    ida: Move ida_bitmap to a percpu variable
    Reimplement IDR and IDA using the radix tree
    radix-tree: Add radix_tree_iter_delete
    ...

    Linus Torvalds
     
  • Pull nfsd updates from Bruce Fields:
    "The nfsd update this round is mainly a lot of miscellaneous cleanups
    and bugfixes.

    A couple changes could theoretically break working setups on upgrade.
    I don't expect complaints in practice, but they seem worth calling out
    just in case:

    - NFS security labels are now off by default; a new security_label
    export flag reenables it per export. But, having them on by default
    is a disaster, as it generally only makes sense if all your clients
    and servers have similar enough selinux policies. Thanks to Jason
    Tibbitts for pointing this out.

    - NFSv4/UDP support is off. It was never really supported, and the
    spec explicitly forbids it. We only ever left it on out of
    laziness; thanks to Jeff Layton for finally fixing that"

    * tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits)
    nfsd: Fix display of the version string
    nfsd: fix configuration of supported minor versions
    sunrpc: don't register UDP port with rpcbind when version needs congestion control
    nfs/nfsd/sunrpc: enforce transport requirements for NFSv4
    sunrpc: flag transports as having congestion control
    sunrpc: turn bitfield flags in svc_version into bools
    nfsd: remove superfluous KERN_INFO
    nfsd: special case truncates some more
    nfsd: minor nfsd_setattr cleanup
    NFSD: Reserve adequate space for LOCKT operation
    NFSD: Get response size before operation for all RPCs
    nfsd/callback: Drop a useless data copy when comparing sessionid
    nfsd/callback: skip the callback tag
    nfsd/callback: Cleanup callback cred on shutdown
    nfsd/idmap: return nfserr_inval for 0-length names
    SUNRPC/Cache: Always treat the invalid cache as unexpired
    SUNRPC: Drop all entries from cache_detail when cache_purge()
    svcrdma: Poll CQs in "workqueue" mode
    svcrdma: Combine list fields in struct svc_rdma_op_ctxt
    svcrdma: Remove unused sc_dto_q field
    ...

    Linus Torvalds
     
  • Pull ceph updates from Ilya Dryomov:
    "This time around we have:

    - support for rbd data-pool feature, which enables rbd images on
    erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
    allow erasure-coded profiles with k+m up to 32.

    - a patch for ceph_d_revalidate() performance regression introduced
    in 4.9, along with some cleanups in the area (Jeff Layton)

    - a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff
    Layton)

    - buffered reads are now processed in rsize windows instead of rasize
    windows (Andreas Gerstmayr). The new default for rsize mount option
    is 64M.

    - ack vs commit distinction is gone, greatly simplifying ->fsync()
    and MOSDOpReply handling code (myself)

    ... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
    computations are still serialized though) and several minor fixes and
    cleanups all over"

    * tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits)
    libceph, rbd, ceph: WRITE | ONDISK -> WRITE
    libceph: get rid of ack vs commit
    ceph: remove special ack vs commit behavior
    ceph: tidy some white space in get_nonsnap_parent()
    crush: fix dprintk compilation
    crush: do is_out test only if we do not collide
    ceph: remove req from unsafe list when unregistering it
    rbd: constify device_type structure
    rbd: kill obj_request->object_name and rbd_segment_name_cache
    rbd: store and use obj_request->object_no
    rbd: RBD_V{1,2}_DATA_FORMAT macros
    rbd: factor out __rbd_osd_req_create()
    rbd: set offset and length outside of rbd_obj_request_create()
    rbd: support for data-pool feature
    rbd: introduce rbd_init_layout()
    rbd: use rbd_obj_bytes() more
    rbd: remove now unused rbd_obj_request_wait() and helpers
    rbd: switch rbd_obj_method_sync() to ceph_osdc_call()
    libceph: pass reply buffer length through ceph_osdc_call()
    rbd: do away with obj_request in rbd_obj_read_sync()
    ...

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Don't save TIPC header values before the header has been validated,
    from Jon Paul Maloy.

    2) Fix memory leak in RDS, from Zhu Yanjun.

    3) We miss to initialize the UID in the flow key in some paths, from
    Julian Anastasov.

    4) Fix latent TOS masking bug in the routing cache removal from years
    ago, also from Julian.

    5) We forget to set the sockaddr port in sctp_copy_local_addr_list(),
    fix from Xin Long.

    6) Missing module ref count drop in packet scheduler actions, from
    Roman Mashak.

    7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu.

    8) Fix use after free which happens because L2TP's ipv4 support returns
    non-zero values from it's backlog_rcv function which ipv4 interprets
    as protocol values. Fix from Paul Hüber.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
    qed: Don't use attention PTT for configuring BW
    qed: Fix race with multiple VFs
    l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
    xfrm: provide correct dst in xfrm_neigh_lookup
    rhashtable: Fix RCU dereference annotation in rht_bucket_nested
    rhashtable: Fix use before NULL check in bucket_table_free
    net sched actions: do not overwrite status of action creation.
    rxrpc: Kernel calls get stuck in recvmsg
    net sched actions: decrement module reference count after table flush.
    lib: Allow compile-testing of parman
    ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt
    sctp: set sin_port for addr param when checking duplicate address
    net/mlx4_en: fix overflow in mlx4_en_init_timestamp()
    netfilter: nft_set_bitmap: incorrect bitmap size
    net: s2io: fix typo argumnet argument
    net: vxge: fix typo argumnet argument
    netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.
    ipv4: mask tos for input route
    ipv4: add missing initialization for flowi4_uid
    lib: fix spelling mistake: "actualy" -> "actually"
    ...

    Linus Torvalds
     

28 Feb, 2017

6 commits

  • When I originally introduced using the driver-indicated station as an
    optimisation to avoid the hashtable lookup/iteration, of course it
    wasn't intended to really functionally change anything.

    I neglected, however, to take into account VLAN interfaces, which have
    the property that management and data frames are handled differently:
    data frames go directly to the station and the VLAN while management
    frames continue to be processed over the underlying/associated AP-type
    interface. As a consequence, when a driver used this optimisation for
    management frames and the user enabled VLANs, my change broke things
    since any management frames, particularly disassoc/deauth, were missed
    by hostapd.

    Fix this by restoring the original code path for non-data frames, they
    aren't critical for performance to begin with.

    This fixes https://bugzilla.kernel.org/show_bug.cgi?id=194713.

    Big thanks goes to Jarek who bisected the issue and provided a very
    detailed bug report, including the crucial information that he was
    using VLANs in his configuration.

    Cc: stable@vger.kernel.org
    Fixes: 771e846bea9e ("mac80211: allow passing transmitter station on RX")
    Reported-and-tested-by: Jarek Kamiński
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Now that %z is standartised in C99 there is no reason to support %Z.
    Unlike %L it doesn't even make format strings smaller.

    Use BUILD_BUG_ON in a couple ATM drivers.

    In case anyone didn't notice lib/vsprintf.o is about half of SLUB which
    is in my opinion is quite an achievement. Hopefully this patch inspires
    someone else to trim vsprintf.c more.

    Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Andy Shevchenko
    Cc: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Fix typos and add the following to the scripts/spelling.txt:

    varible||variable

    While we are here, tidy up the comment blocks that fit in a single line
    for drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c and
    net/sctp/transport.c.

    Link: http://lkml.kernel.org/r/1481573103-11329-11-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Fix typos and add the following to the scripts/spelling.txt:

    aligment||alignment

    I did not touch the "N_BYTE_ALIGMENT" macro in
    drivers/net/wireless/realtek/rtlwifi/wifi.h to avoid unpredictable
    impact.

    I fixed "_aligment_handler" in arch/openrisc/kernel/entry.S because
    it is surrounded by #if 0 ... #endif. It is surely safe and I
    confirmed "_alignment_handler" is correct.

    I also fixed the "controler" I found in the same hunk in
    arch/openrisc/kernel/head.S.

    Link: http://lkml.kernel.org/r/1481573103-11329-8-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Fix typos and add the following to the scripts/spelling.txt:

    an user||a user
    an userspace||a userspace

    I also added "userspace" to the list since it is a common word in Linux.
    I found some instances for "an userfaultfd", but I did not add it to the
    list. I felt it is endless to find words that start with "user" such as
    "userland" etc., so must draw a line somewhere.

    Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Fix typos and add the following to the scripts/spelling.txt:

    swith||switch
    swithable||switchable
    swithed||switched
    swithing||switching

    While we are here, fix the "update" to "updates" in the touched hunk in
    drivers/net/wireless/marvell/mwifiex/wmm.c.

    Link: http://lkml.kernel.org/r/1481573103-11329-2-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

27 Feb, 2017

18 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains netfilter fixes for you net tree,
    they are:

    1) Missing ct zone size in the nft_ct initialization path, patch
    from Florian Westphal.

    2) Two patches for netfilter uapi headers, one to remove unnecessary
    sysctl.h inclusion and another to fix compilation of xt_hashlimit.h
    in userspace, from Dmitry V. Levin.

    3) Patch to fix a sloppy change in nf_ct_expect that incorrectly
    simplified nf_ct_expect_related_report() in the previous nf-next
    batch. This also includes another patch for __nf_ct_expect_check()
    to report success by returning 0 to keep it consistent with other
    existing functions. From Jarno Rajahalme.

    4) The ->walk() iterator of the new bitmap set type goes over the real
    bitmap size, this results in incorrect dumps when NFTA_SET_USERDATA
    is used.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Sara Sharon
    Signed-off-by: Johannes Berg

    Sara Sharon
     
  • Tracing is limited to 100 characters and this message passes
    the limit when there are a few buffered frames. Shorten it.

    Signed-off-by: Sara Sharon
    Signed-off-by: Johannes Berg

    Sara Sharon
     
  • iwlwifi now supports RSS and can't let mac80211 track the
    PS state based on the Rx frames since they can come out of
    order. iwlwifi is now advertising AP_LINK_PS, and uses
    explicit notifications to teach mac80211 about the PS state
    of the stations and the PS poll / uAPSD trigger frames
    coming our way from the peers.

    Because of that, the TIM stopped being maintained in
    mac80211. I tried to fix this in commit c68df2e7be0c
    ("mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE")
    but that was later reverted by Felix in commit 6c18a6b4e799
    ("Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE")
    since it broke drivers that do not implement set_tim.

    Since none of the drivers that set AP_LINK_PS have the
    set_tim() handler set besides iwlwifi, I can bail out in
    __sta_info_recalc_tim if AP_LINK_PS AND .set_tim is not
    implemented.

    Signed-off-by: Emmanuel Grumbach
    Signed-off-by: Johannes Berg

    Emmanuel Grumbach
     
  • When running a BA session, the driver (or the hardware) already takes
    care of retransmitting failed frames, since it has to keep the receiver
    reorder window in sync.

    Adding another layer of retransmit around that does not improve
    anything. In fact, it can only lead to some strong reordering with huge
    latency.

    Cc: stable@vger.kernel.org
    Signed-off-by: Felix Fietkau
    Signed-off-by: Johannes Berg

    Felix Fietkau
     
  • When RX aggregation starts, transmitter may continue send frames
    with SN smaller than SSN until the AddBA response is received.
    However, the reorder buffer is already initialized at this point,
    which will cause the drop of such frames as duplicates since the
    head SN of the reorder buffer is set to the SSN, which is bigger.

    Cc: stable@vger.kernel.org
    Signed-off-by: Sara Sharon
    Signed-off-by: Johannes Berg

    Sara Sharon
     
  • The issue was found when entering suspend and resume.
    It triggers a warning in:
    mac80211/key.c: ieee80211_enable_keys()
    ...
    WARN_ON_ONCE(sdata->crypto_tx_tailroom_needed_cnt ||
    sdata->crypto_tx_tailroom_pending_dec);
    ...

    It points out sdata->crypto_tx_tailroom_pending_dec isn't cleaned up successfully
    in a delayed_work during suspend. Add a flush_delayed_work to fix it.

    Cc: stable@vger.kernel.org
    Signed-off-by: Matt Chen
    Signed-off-by: Johannes Berg

    Matt Chen
     
  • When adding per-CPU statistics, which added statistics back
    to mac80211 for the fast-RX path, I evidently forgot to add
    the "stats->packets++" line. The reason for that is likely
    that I didn't see it since it's done in defragmentation for
    the regular RX path.

    Add the missing line to properly count received packets in
    the fast-RX case.

    Fixes: c9c5962b56c1 ("mac80211: enable collecting station statistics per-CPU")
    Reported-by: Oren Givon
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • l2tp_ip_backlog_recv may not return -1 if the packet gets dropped.
    The return value is passed up to ip_local_deliver_finish, which treats
    negative values as an IP protocol number for resubmission.

    Signed-off-by: Paul Hüber
    Signed-off-by: David S. Miller

    Paul Hüber
     
  • Fix xfrm_neigh_lookup to provide dst->path to the
    neigh_lookup dst_ops method.

    When skb is provided, the IP address in packet should already
    match the dst->path address family. But for the non-skb case,
    we should consider the last tunnel address as nexthop address.

    Fixes: f894cbf847c9 ("net: Add optional SKB arg to dst_ops->neigh_lookup().")
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • nla_memdup_cookie was overwriting err value, declared at function
    scope and earlier initialized with result of ->init(). At success
    nla_memdup_cookie() returns 0, and thus module refcnt decremented,
    although the action was installed.

    $ sudo tc actions add action pass index 1 cookie 1234
    $ sudo tc actions ls action gact

    action order 0: gact action pass
    random type none pass val 0
    index 1 ref 1 bind 0
    $
    $ lsmod
    Module Size Used by
    act_gact 16384 0
    ...
    $
    $ sudo rmmod act_gact
    [ 52.310283] ------------[ cut here ]------------
    [ 52.312551] WARNING: CPU: 1 PID: 455 at kernel/module.c:1113
    module_put+0x99/0xa0
    [ 52.316278] Modules linked in: act_gact(-) crct10dif_pclmul crc32_pclmul
    ghash_clmulni_intel psmouse pcbc evbug aesni_intel aes_x86_64 crypto_simd
    serio_raw glue_helper pcspkr cryptd
    [ 52.322285] CPU: 1 PID: 455 Comm: rmmod Not tainted 4.10.0+ #11
    [ 52.324261] Call Trace:
    [ 52.325132] dump_stack+0x63/0x87
    [ 52.326236] __warn+0xd1/0xf0
    [ 52.326260] warn_slowpath_null+0x1d/0x20
    [ 52.326260] module_put+0x99/0xa0
    [ 52.326260] tcf_hashinfo_destroy+0x7f/0x90
    [ 52.326260] gact_exit_net+0x27/0x40 [act_gact]
    [ 52.326260] ops_exit_list.isra.6+0x38/0x60
    [ 52.326260] unregister_pernet_operations+0x90/0xe0
    [ 52.326260] unregister_pernet_subsys+0x21/0x30
    [ 52.326260] tcf_unregister_action+0x68/0xa0
    [ 52.326260] gact_cleanup_module+0x17/0xa0f [act_gact]
    [ 52.326260] SyS_delete_module+0x1ba/0x220
    [ 52.326260] entry_SYSCALL_64_fastpath+0x1e/0xad
    [ 52.326260] RIP: 0033:0x7f527ffae367
    [ 52.326260] RSP: 002b:00007ffeb402a598 EFLAGS: 00000202 ORIG_RAX:
    00000000000000b0
    [ 52.326260] RAX: ffffffffffffffda RBX: 0000559b069912a0 RCX: 00007f527ffae367
    [ 52.326260] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559b06991308
    [ 52.326260] RBP: 0000000000000003 R08: 00007f5280264420 R09: 00007ffeb4029511
    [ 52.326260] R10: 000000000000087b R11: 0000000000000202 R12: 00007ffeb4029580
    [ 52.326260] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b069912a0
    [ 52.354856] ---[ end trace 90d89401542b0db6 ]---
    $

    With the fix:

    $ sudo modprobe act_gact
    $ lsmod
    Module Size Used by
    act_gact 16384 0
    ...
    $ sudo tc actions add action pass index 1 cookie 1234
    $ sudo tc actions ls action gact

    action order 0: gact action pass
    random type none pass val 0
    index 1 ref 1 bind 0
    $
    $ lsmod
    Module Size Used by
    act_gact 16384 1
    ...
    $ sudo rmmod act_gact
    rmmod: ERROR: Module act_gact is in use
    $
    $ sudo /home/mrv/bin/tc actions del action gact index 1
    $ sudo rmmod act_gact
    $ lsmod
    Module Size Used by
    $

    Fixes: 1045ba77a ("net sched actions: Add support for user cookies")
    Signed-off-by: Roman Mashak
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Roman Mashak
     
  • Calls made through the in-kernel interface can end up getting stuck because
    of a missed variable update in a loop in rxrpc_recvmsg_data(). The problem
    is like this:

    (1) A new packet comes in and doesn't cause a notification to be given to
    the client as there's still another packet in the ring - the
    assumption being that if the client will keep drawing off data until
    the ring is empty.

    (2) The client is in rxrpc_recvmsg_data(), inside the big while loop that
    iterates through the packets. This copies the window pointers into
    variables rather than using the information in the call struct
    because:

    (a) MSG_PEEK might be in effect;

    (b) we need a barrier after reading call->rx_top to pair with the
    barrier in the softirq routine that loads the buffer.

    (3) The reading of call->rx_top is done outside of the loop, and top is
    never updated whilst we're in the loop. This means that even through
    there's a new packet available, we don't see it and may return -EFAULT
    to the caller - who will happily return to the scheduler and await the
    next notification.

    (4) No further notifications are forthcoming until there's an abort as the
    ring isn't empty.

    The fix is to move the read of call->rx_top inside the loop - but it needs
    to be done before the condition is checked.

    Reported-by: Marc Dionne
    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     
  • When tc actions are loaded as a module and no actions have been installed,
    flushing them would result in actions removed from the memory, but modules
    reference count not being decremented, so that the modules would not be
    unloaded.

    Following is example with GACT action:

    % sudo modprobe act_gact
    % lsmod
    Module Size Used by
    act_gact 16384 0
    %
    % sudo tc actions ls action gact
    %
    % sudo tc actions flush action gact
    % lsmod
    Module Size Used by
    act_gact 16384 1
    % sudo tc actions flush action gact
    % lsmod
    Module Size Used by
    act_gact 16384 2
    % sudo rmmod act_gact
    rmmod: ERROR: Module act_gact is in use
    ....

    After the fix:
    % lsmod
    Module Size Used by
    act_gact 16384 0
    %
    % sudo tc actions add action pass index 1
    % sudo tc actions add action pass index 2
    % sudo tc actions add action pass index 3
    % lsmod
    Module Size Used by
    act_gact 16384 3
    %
    % sudo tc actions flush action gact
    % lsmod
    Module Size Used by
    act_gact 16384 0
    %
    % sudo tc actions flush action gact
    % lsmod
    Module Size Used by
    act_gact 16384 0
    % sudo rmmod act_gact
    % lsmod
    Module Size Used by
    %

    Fixes: f97017cdefef ("net-sched: Fix actions flushing")
    Signed-off-by: Roman Mashak
    Signed-off-by: Jamal Hadi Salim
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Roman Mashak
     
  • Commit 5e1859fbcc3c ("ipv4: ipmr: various fixes and cleanups") fixed
    the issue for ipv4 ipmr:

    ip_mroute_setsockopt() & ip_mroute_getsockopt() should not
    access/set raw_sk(sk)->ipmr_table before making sure the socket
    is a raw socket, and protocol is IGMP

    The same fix should be done for ipv6 ipmr as well.

    This patch can fix the panic caused by overwriting the same offset
    as ipmr_table as in raw_sk(sk) when accessing other type's socket
    by ip_mroute_setsockopt().

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • Commit b8607805dd15 ("sctp: not copying duplicate addrs to the assoc's
    bind address list") tried to check for duplicate address before copying
    to asoc's bind_addr list from global addr list.

    But all the addrs' sin_ports in global addr list are 0 while the addrs'
    sin_ports are bp->port in asoc's bind_addr list. It means even if it's
    a duplicate address, af->cmp_addr will still return 0 as the their
    sin_ports are different.

    This patch is to fix it by setting the sin_port for addr param with
    bp->port before comparing the addrs.

    Fixes: b8607805dd15 ("sctp: not copying duplicate addrs to the assoc's bind address list")
    Reported-by: Wei Chen
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • priv->bitmap_size stores the real bitmap size, instead of the full
    struct nft_bitmap object.

    Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Commit 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert()
    returns void") inadvertently changed the successful return value of
    nf_ct_expect_related_report() from 0 to 1 due to
    __nf_ct_expect_check() returning 1 on success. Prevent this
    regression in the future by changing the return value of
    __nf_ct_expect_check() to 0 on success.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Joe Stringer
    Signed-off-by: Pablo Neira Ayuso

    Jarno Rajahalme
     
  • Restore the lost masking of TOS in input route code to
    allow ip rules to match it properly.

    Problem [1] noticed by Shmulik Ladkani

    [1] http://marc.info/?t=137331755300040&r=1&w=2

    Fixes: 89aef8921bfb ("ipv4: Delete routing cache.")
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov