12 Sep, 2015

1 commit

  • Pull Ceph update from Sage Weil:
    "There are a few fixes for snapshot behavior with CephFS and support
    for the new keepalive protocol from Zheng, a libceph fix that affects
    both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
    and several small fixes and cleanups from Jianpeng and others"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: improve readahead for file holes
    ceph: get inode size for each append write
    libceph: check data_len in ->alloc_msg()
    libceph: use keepalive2 to verify the mon session is alive
    rbd: plug rbd_dev->header.object_prefix memory leak
    rbd: fix double free on rbd_dev->header_name
    libceph: set 'exists' flag for newly up osd
    ceph: cleanup use of ceph_msg_get
    ceph: no need to get parent inode in ceph_open
    ceph: remove the useless judgement
    ceph: remove redundant test of head->safe and silence static analysis warnings
    ceph: fix queuing inode to mdsdir's snaprealm
    libceph: rename con_work() to ceph_con_workfn()
    libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
    libceph: remove the unused macro AES_KEY_SIZE
    ceph: invalidate dirty pages after forced umount
    ceph: EIO all operations after forced umount

    Linus Torvalds
     

11 Sep, 2015

3 commits

  • Pull networking fixes from David Miller:

    1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
    Kadlecsik.

    2) Use correct free operation on netfilter conntrack templates, from
    Daniel Borkmann.

    3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.

    4) Fix sizeof(pointer) in mac80211, from Thierry Reding.

    5) Fix cache pointer comparison in ip6mr leading to missed unlock of
    mrt_lock. From Richard Laing.

    6) rds_conn_lookup() needs to consider network namespace in key
    comparison, from Sowmini Varadhan.

    7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
    Dmitriy.

    8) Fix fd leaks in bpf syscall, from Daniel Borkmann.

    9) Fix error recovery when installing ipv6 multipath routes, we would
    delete the old route before we would know if we could fully commit
    to the new set of nexthops. Fix from Roopa Prabhu.

    10) Fix run-time suspend problems in r8152, from Hayes Wang.

    11) In fec, don't program the MAC address into the chip when the clocks
    are gated off. From Fugang Duan.

    12) Fix poll behavior for netlink sockets when using rx ring mmap, from
    Daniel Borkmann.

    13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
    driver, from Corinna Vinschen.

    14) In TCP Cubic congestion control, handle idle periods better where we
    are application limited, in order to keep cwnd from growing out of
    control. From Eric Dumzet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    tcp_cubic: better follow cubic curve after idle period
    tcp: generate CA_EVENT_TX_START on data frames
    xen-netfront: respect user provided max_queues
    xen-netback: respect user provided max_queues
    r8169: Fix sleeping function called during get_stats64, v2
    ether: add IEEE 1722 ethertype - TSN
    netlink, mmap: fix edge-case leakages in nf queue zero-copy
    netlink, mmap: don't walk rx ring on poll if receive queue non-empty
    cxgb4: changes for new firmware 1.14.4.0
    net: fec: add netif status check before set mac address
    r8152: fix the runtime suspend issues
    r8152: split DRIVER_VERSION
    ipv6: fix ifnullfree.cocci warnings
    add microchip LAN88xx phy driver
    stmmac: fix check for phydev being open
    net: qlcnic: delete redundant memsets
    net: mv643xx_eth: use kzalloc
    net: jme: use kzalloc() instead of kmalloc+memset
    net: cavium: liquidio: use kzalloc in setup_glist()
    net: ipv6: use common fib_default_rule_pref
    ...

    Linus Torvalds
     
  • Jana Iyengar found an interesting issue on CUBIC :

    The epoch is only updated/reset initially and when experiencing losses.
    The delta "t" of now - epoch_start can be arbitrary large after app idle
    as well as the bic_target. Consequentially the slope (inverse of
    ca->cnt) would be really large, and eventually ca->cnt would be
    lower-bounded in the end to 2 to have delayed-ACK slow-start behavior.

    This particularly shows up when slow_start_after_idle is disabled
    as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle
    time.

    Jana initial fix was to reset epoch_start if app limited,
    but Neal pointed out it would ask the CUBIC algorithm to recalculate the
    curve so that we again start growing steeply upward from where cwnd is
    now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth
    curve to be the same shape, just shifted later in time by the amount of
    the idle period.

    Reported-by: Jana Iyengar
    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Cc: Stephen Hemminger
    Cc: Sangtae Ha
    Cc: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Issuing a CC TX_START event on control frames like pure ACK
    is a waste of time, as a CC should not care.

    Following patch needs this change, as we want CUBIC to properly track
    idle time at a low cost, with a single TX_START being generated.

    Yuchung might slightly refine the condition triggering TX_START
    on a followup patch.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Cc: Jana Iyengar
    Cc: Stephen Hemminger
    Cc: Sangtae Ha
    Cc: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Neal Cardwell
     

10 Sep, 2015

6 commits

  • When netlink mmap on receive side is the consumer of nf queue data,
    it can happen that in some edge cases, we write skb shared info into
    the user space mmap buffer:

    Assume a possible rx ring frame size of only 4096, and the network skb,
    which is being zero-copied into the netlink skb, contains page frags
    with an overall skb->len larger than the linear part of the netlink
    skb.

    skb_zerocopy(), which is generic and thus not aware of the fact that
    shared info cannot be accessed for such skbs then tries to write and
    fill frags, thus leaking kernel data/pointers and in some corner cases
    possibly writing out of bounds of the mmap area (when filling the
    last slot in the ring buffer this way).

    I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
    an advertised length larger than 4096, where the linear part is visible
    at the slot beginning, and the leaked sizeof(struct skb_shared_info)
    has been written to the beginning of the next slot (also corrupting
    the struct nl_mmap_hdr slot header incl. status etc), since skb->end
    points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.

    The fix adds and lets __netlink_alloc_skb() take the actual needed
    linear room for the network skb + meta data into account. It's completely
    irrelevant for non-mmaped netlink sockets, but in case mmap sockets
    are used, it can be decided whether the available skb_tailroom() is
    really large enough for the buffer, or whether it needs to internally
    fallback to a normal alloc_skb().

    >From nf queue side, the information whether the destination port is
    an mmap RX ring is not really available without extra port-to-socket
    lookup, thus it can only be determined in lower layers i.e. when
    __netlink_alloc_skb() is called that checks internally for this. I
    chose to add the extra ldiff parameter as mmap will then still work:
    We have data_len and hlen in nfqnl_build_packet_message(), data_len
    is the full length (capped at queue->copy_range) for skb_zerocopy()
    and hlen some possible part of data_len that needs to be copied; the
    rem_len variable indicates the needed remaining linear mmap space.

    The only other workaround in nf queue internally would be after
    allocation time by f.e. cap'ing the data_len to the skb_tailroom()
    iff we deal with an mmap skb, but that would 1) expose the fact that
    we use a mmap skb to upper layers, and 2) trim the skb where we
    otherwise could just have moved the full skb into the normal receive
    queue.

    After the patch, in my test case the ring slot doesn't fit and therefore
    shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
    thus needs to be picked up via recv().

    Fixes: 3ab1f683bf8b ("nfnetlink: add support for memory mapped netlink")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In case of netlink mmap, there can be situations where received frames
    have to be placed into the normal receive queue. The ring buffer indicates
    this through NL_MMAP_STATUS_COPY, so the user is asked to pick them up
    via recvmsg(2) syscall, and to put the slot back to NL_MMAP_STATUS_UNUSED.

    Commit 0ef707700f1c ("netlink: rx mmap: fix POLLIN condition") changed
    polling, so that we walk in the worst case the whole ring through the
    new netlink_has_valid_frame(), for example, when the ring would have no
    NL_MMAP_STATUS_VALID, but at least one NL_MMAP_STATUS_COPY frame.

    Since we do a datagram_poll() already earlier to pick up a mask that could
    possibly contain POLLIN | POLLRDNORM already (due to NL_MMAP_STATUS_COPY),
    we can skip checking the rx ring entirely.

    In case the kernel is compiled with !CONFIG_NETLINK_MMAP, then all this is
    irrelevant anyway as netlink_poll() is just defined as datagram_poll().

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • net/ipv6/route.c:2946:3-8: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

    NULL check before some freeing functions is not needed.

    Based on checkpatch warning
    "kfree(NULL) is safe this check is probably not required"
    and kfreeaddr.cocci by Julia Lawall.

    Generated by: scripts/coccinelle/free/ifnullfree.cocci

    CC: Roopa Prabhu
    Signed-off-by: Fengguang Wu
    Signed-off-by: David S. Miller

    Wu Fengguang
     
  • This switches IPv6 policy routing to use the shared
    fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
    multicast routing for IPv4 as well as IPv6.

    The motivation for this patch is a complaint about iproute2 behaving
    inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
    IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
    assigned priority value was decreased with each rule added.

    Since then all users of the default_pref field have been converted to
    assign the generic function fib_default_rule_pref(), fib_nl_newrule()
    may just use it directly instead. Therefore get rid of the function
    pointer altogether and make fib_default_rule_pref() static, as it's not
    used outside fib_rules.c anymore.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • Problem:
    The ecmp route replace support for ipv6 in the kernel, deletes the
    existing ecmp route too early, ie when it installs the first nexthop.
    If there is an error in installing the subsequent nexthops, its too late
    to recover the already deleted existing route leaving the fib
    in an inconsistent state.

    This patch reduces the possibility of this by doing the following:
    a) Changes the existing multipath route add code to a two stage process:
    build rt6_infos + insert them
    ip6_route_add rt6_info creation code is moved into
    ip6_route_info_create.
    b) This ensures that most errors are caught during building rt6_infos
    and we fail early
    c) Separates multipath add and del code. Because add needs the special
    two stage mode in a) and delete essentially does not care.
    d) In any event if the code fails during inserting a route again, a
    warning is printed (This should be unlikely)

    Before the patch:
    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    /* Try replacing the route with a duplicate nexthop */
    $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
    fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
    swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
    RTNETLINK answers: File exists

    $ip -6 route show
    /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
    * kernel */

    After the patch:
    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    /* Try replacing the route with a duplicate nexthop */
    $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
    fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
    swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
    RTNETLINK answers: File exists

    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Roopa Prabhu
    Reviewed-by: Nikolay Aleksandrov
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • There was no verification that an underlying transport exists when creating
    a connection, this would cause dereferencing a NULL ptr.

    It might happen on sockets that weren't properly bound before attempting to
    send a message, which will cause a NULL ptr deref:

    [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [135546.051270] Modules linked in:
    [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
    [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
    [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
    [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202
    [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
    [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
    [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
    [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
    [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
    [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
    [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
    [135546.064723] Stack:
    [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
    [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
    [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
    [135546.068629] Call Trace:
    [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
    [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
    [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
    [135546.071981] rds_sendmsg (net/rds/send.c:1058)
    [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
    [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
    [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
    [135546.076349] ? __might_fault (mm/memory.c:3795)
    [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
    [135546.078856] SYSC_sendto (net/socket.c:1657)
    [135546.079596] ? SYSC_connect (net/socket.c:1628)
    [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
    [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
    [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
    [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
    [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     

09 Sep, 2015

10 commits

  • Pull inifiniband/rdma updates from Doug Ledford:
    "This is a fairly sizeable set of changes. I've put them through a
    decent amount of testing prior to sending the pull request due to
    that.

    There are still a few fixups that I know are coming, but I wanted to
    go ahead and get the big, sizable chunk into your hands sooner rather
    than waiting for those last few fixups.

    Of note is the fact that this creates what is intended to be a
    temporary area in the drivers/staging tree specifically for some
    cleanups and additions that are coming for the RDMA stack. We
    deprecated two drivers (ipath and amso1100) and are waiting to hear
    back if we can deprecate another one (ehca). We also put Intel's new
    hfi1 driver into this area because it needs to be refactored and a
    transfer library created out of the factored out code, and then it and
    the qib driver and the soft-roce driver should all be modified to use
    that library.

    I expect drivers/staging/rdma to be around for three or four kernel
    releases and then to go away as all of the work is completed and final
    deletions of deprecated drivers are done.

    Summary of changes for 4.3:

    - Create drivers/staging/rdma
    - Move amso1100 driver to staging/rdma and schedule for deletion
    - Move ipath driver to staging/rdma and schedule for deletion
    - Add hfi1 driver to staging/rdma and set TODO for move to regular
    tree
    - Initial support for namespaces to be used on RDMA devices
    - Add RoCE GID table handling to the RDMA core caching code
    - Infrastructure to support handling of devices with differing read
    and write scatter gather capabilities
    - Various iSER updates
    - Kill off unsafe usage of global mr registrations
    - Update SRP driver
    - Misc mlx4 driver updates
    - Support for the mr_alloc verb
    - Support for a netlink interface between kernel and user space cache
    daemon to speed path record queries and route resolution
    - Ininitial support for safe hot removal of verbs devices"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
    IB/ipoib: Suppress warning for send only join failures
    IB/ipoib: Clean up send-only multicast joins
    IB/srp: Fix possible protection fault
    IB/core: Move SM class defines from ib_mad.h to ib_smi.h
    IB/core: Remove unnecessary defines from ib_mad.h
    IB/hfi1: Add PSM2 user space header to header_install
    IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
    mlx5: Fix incorrect wc pkey_index assignment for GSI messages
    IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
    IB/uverbs: reject invalid or unknown opcodes
    IB/cxgb4: Fix if statement in pick_local_ip6adddrs
    IB/sa: Fix rdma netlink message flags
    IB/ucma: HW Device hot-removal support
    IB/mlx4_ib: Disassociate support
    IB/uverbs: Enable device removal when there are active user space applications
    IB/uverbs: Explicitly pass ib_dev to uverbs commands
    IB/uverbs: Fix race between ib_uverbs_open and remove_one
    IB/uverbs: Fix reference counting usage of event files
    IB/core: Make ib_dealloc_pd return void
    IB/srp: Create an insecure all physical rkey only if needed
    ...

    Linus Torvalds
     
  • Only ->alloc_msg() should check data_len of the incoming message
    against the preallocated ceph_msg, doing it in the messenger is not
    right. The contract is that either ->alloc_msg() returns a ceph_msg
    which will fit all of the portions of the incoming message, or it
    returns NULL and possibly sets skip, signaling whether NULL is due to
    an -ENOMEM. ->alloc_msg() should be the only place where we make the
    skip/no-skip decision.

    I stumbled upon this while looking at con/osd ref counting. Right now,
    if we get a non-extent message with a larger data portion than we are
    prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
    it in the messenger, we don't put the con/osd ref acquired in
    ceph_con_in_msg_alloc() (which is normally put in process_message()),
    so this also fixes a memory leak.

    An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
    corrupt random memory should a buggy ->alloc_msg() return an unfit
    ceph_msg.

    While at it, I changed the "unknown tid" dout() to a pr_warn() to make
    sure all skips are seen and unified format strings.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • If an attempt to wake up users of broadcast link is made when there is
    no enough place in send queue than it may hang up inside the
    tipc_sk_rcv() function since the loop breaks only after the wake up
    queue becomes empty. This can lead to complete CPU stall with the
    following message generated by RCU:

    INFO: rcu_sched self-detected stall on CPU { 0} (t=2101 jiffies
    g=54225 c=54224 q=11465)
    Task dump for CPU 0:
    tpch R running task 0 39949 39948 0x0000000a
    ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000
    ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800
    0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680
    Call Trace:
    [] sched_show_task+0xae/0x120
    [] dump_cpu_task+0x38/0x40
    [] rcu_dump_cpu_stacks+0x90/0xd0
    [] rcu_check_callbacks+0x3eb/0x6e0
    [] ? account_system_time+0x7f/0x170
    [] update_process_times+0x34/0x60
    [] tick_sched_handle.isra.18+0x31/0x40
    [] tick_sched_timer+0x3c/0x70
    [] __run_hrtimer.isra.34+0x3d/0xc0
    [] hrtimer_interrupt+0xc5/0x1e0
    [] ? native_smp_send_reschedule+0x42/0x60
    [] local_apic_timer_interrupt+0x34/0x60
    [] smp_apic_timer_interrupt+0x3c/0x60
    [] apic_timer_interrupt+0x6b/0x70
    [] ? _raw_spin_unlock_irqrestore+0x9/0x10
    [] __wake_up_sync_key+0x4f/0x60
    [] tipc_write_space+0x31/0x40 [tipc]
    [] filter_rcv+0x31f/0x520 [tipc]
    [] ? tipc_sk_lookup+0xc9/0x110 [tipc]
    [] ? _raw_spin_lock_bh+0x19/0x30
    [] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
    [] tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
    [] tipc_node_unlock+0x186/0x190 [tipc]
    [] ? kfree_skb+0x2c/0x40
    [] tipc_rcv+0x2ac/0x8c0 [tipc]
    [] tipc_l2_rcv_msg+0x38/0x50 [tipc]
    [] __netif_receive_skb_core+0x5a3/0x950
    [] __netif_receive_skb+0x13/0x60
    [] netif_receive_skb_internal+0x1e/0x90
    [] napi_gro_receive+0x78/0xa0
    [] tg3_poll_work+0xc54/0xf40 [tg3]
    [] ? consume_skb+0x2c/0x40
    [] tg3_poll_msix+0x41/0x160 [tg3]
    [] net_rx_action+0xe2/0x290
    [] __do_softirq+0xda/0x1f0
    [] irq_exit+0x76/0xa0
    [] do_IRQ+0x55/0xf0
    [] common_interrupt+0x6b/0x6b

    The issue occurs only when tipc_sk_rcv() is used to wake up postponed
    senders:

    tipc_bclink_wakeup_users()
    // wakeupq - is a queue which consists of special
    // messages with SOCK_WAKEUP type.
    tipc_sk_rcv(wakeupq)
    ...
    while (skb_queue_len(inputq)) {
    filter_rcv(skb)
    // Here the type of message is checked
    // and if it is SOCK_WAKEUP then
    // it tries to wake up a sender.
    tipc_write_space(sk)
    wake_up_interruptible_sync_poll()
    }

    After the sender thread is woke up it can gather control and perform
    an attempt to send a message. But if there is no enough place in send
    queue it will call link_schedule_user() function which puts a message
    of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep.
    Thus the size of the queue actually is not changed and the while()
    loop never exits.

    The approach I proposed is to wake up only senders for which there is
    enough place in send queue so the described issue can't occur.
    Moreover the same approach is already used to wake up senders on
    unicast links.

    I have got into the issue on our product code but to reproduce the
    issue I changed a benchmark test application (from
    tipcutils/demos/benchmark) to perform the following scenario:
    1. Run 64 instances of test application (nodes). It can be done
    on the one physical machine.
    2. Each application connects to all other using TIPC sockets in
    RDM mode.
    3. When setup is done all nodes start simultaneously send
    broadcast messages.
    4. Everything hangs up.

    The issue is reproducible only when a congestion on broadcast link
    occurs. For example, when there are only 8 nodes it works fine since
    congestion doesn't occur. Send queue limit is 40 in my case (I use a
    critical importance level) and when 64 nodes send a message at the
    same moment a congestion occurs every time.

    Signed-off-by: Dmitry S Kolmakov
    Reviewed-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Kolmakov Dmitriy
     
  • Remove the unnecessary switchdev.h include from br_netlink.c.

    Signed-off-by: Vivien Didelot
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • Since __vlan_del can return an error code, change its inner function
    __vlan_vid_del to return an eventual error from switchdev_port_obj_del.

    Signed-off-by: Vivien Didelot
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • Signed-off-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Even though it's static, con_work(), being a work func, shows up in
    various stacktraces a lot. Prefix it with ceph_.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • ceph_msgr_slab_init may fail due to a temporary ENOMEM.

    Delay a bit the initialization of zero_page in ceph_msgr_init and
    reorder its cleanup in _ceph_msgr_exit so it's done in reverse
    order of setup.

    BUG_ON() will not suffer to be postponed in case it is triggered.

    Signed-off-by: Benoît Canet
    Reviewed-by: Alex Elder
    Signed-off-by: Ilya Dryomov

    Benoît Canet
     
  • This removes the no longer used macro AES_KEY_SIZE as no functions use
    this macro anymore and thus this macro can be removed due it no longer
    being required.

    Signed-off-by: Nicholas Krause
    Signed-off-by: Ilya Dryomov

    Nicholas Krause
     

08 Sep, 2015

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix atomicity of pNFS commit list updates
    - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
    - nfs_set_pgio_error sometimes misses errors
    - Fix a thinko in xs_connect()
    - Fix borkage in _same_data_server_addrs_locked()
    - Fix a NULL pointer dereference of migration recovery ops for v4.2
    client
    - Don't let the ctime override attribute barriers.
    - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
    - Ensure flexfiles pNFS driver updates the inode after write finishes
    - flexfiles must not pollute the attribute cache with attrbutes from
    the DS
    - Fix a protocol error in layoutreturn
    - Fix a protocol issue with NFSv4.1 CLOSE stateids

    Bugfixes + cleanups
    - pNFS blocks bugfixes from Christoph
    - Various cleanups from Anna
    - More fixes for delegation corner cases
    - Don't fsync twice for O_SYNC/IS_SYNC files
    - Fix pNFS and flexfiles layoutstats bugs
    - pnfs/flexfiles: avoid duplicate tracking of mirror data
    - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
    issues
    - pnfs/flexfiles: error handling retries a layoutget before fallback
    to MDS

    Features:
    - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
    Kinglong
    - More RDMA client transport improvements from Chuck
    - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
    verbs from the SUNRPC, Lustre and core infiniband tree.
    - Optimise away the close-to-open getattr if there is no cached data"

    * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
    NFSv4: Respect the server imposed limit on how many changes we may cache
    NFSv4: Express delegation limit in units of pages
    Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
    NFS: Optimise away the close-to-open getattr if there is no cached data
    NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
    NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
    nfs: Remove unneeded checking of the return value from scnprintf
    nfs: Fix truncated client owner id without proto type
    NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
    NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
    NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
    NFSv4.1/flexfiles: Fix freeing of mirrors
    NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
    NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
    NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
    NFSv4.1: Fix a protocol issue with CLOSE stateids
    NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
    SUNRPC: Prevent SYN+SYNACK+RST storms
    SUNRPC: xs_reset_transport must mark the connection as disconnected
    NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
    ...

    Linus Torvalds
     

07 Sep, 2015

2 commits

  • There's no particular desire to have conntrack action support in Open
    vSwitch as an independently configurable bit, rather just to ensure
    there is not a hard dependency. This exposed option doesn't accurately
    reflect the conntrack dependency when enabled, so simplify this by
    removing the option. Compile the support if NF_CONNTRACK is enabled.

    Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
    Signed-off-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    For the first round of fixes, we have this:
    * fix for the sizeof() pointer type issue
    * a fix for regulatory getting into a restore loop
    * a fix for rfkill global 'all' state, it needs to be stored
    everywhere to apply correctly to new rfkill instances
    * properly refuse CQM RSSI when it cannot actually be used
    * protect HT TDLS traffic properly in non-HT networks
    * don't incorrectly advertise 80 MHz support when not allowed
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     

06 Sep, 2015

5 commits

  • Only return a conn if the rds_conn_net(conn) matches the struct
    net passed to rds_conn_lookup().

    Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints,
    one per netns.")

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • switchdev_port_fdb_dump is used as .ndo_fdb_dump. Its return value is
    idx, so we cannot return errval.

    Fixes: 45d4122ca7cd ("switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.")
    Signed-off-by: Jiri Pirko
    Acked-by: Sridhar Samudrala
    Acked-by: Scott Feldman
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Conflicts:
    include/net/netfilter/nf_conntrack.h

    The conflict was an overlap between changing the type of the zone
    argument to nf_ct_tmpl_alloc() whilst exporting nf_ct_tmpl_free.

    Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net, they are:

    1) Oneliner to restore maps in nf_tables since we support addressing registers
    at 32 bits level.

    2) Restore previous default behaviour in bridge netfilter when CONFIG_IPV6=n,
    oneliner from Bernhard Thaler.

    3) Out of bound access in ipset hash:net* set types, reported by Dave Jones'
    KASan utility, patch from Jozsef Kadlecsik.

    4) Fix ipset compilation with gcc 4.4.7 related to C99 initialization of
    unnamed unions, patch from Elad Raz.

    5) Add a workaround to address inconsistent endianess in the res_id field of
    nfnetlink batch messages, reported by Florian Westphal.

    6) Fix error paths of CT/synproxy since the conntrack template was moved to use
    kmalloc, patch from Daniel Borkmann.

    All of them look good to me to reach 4.2, I can route this to -stable myself
    too, just let me know what you prefer.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In the IPv6 multicast routing code the mrt_lock was not being released
    correctly in the MFC iterator, as a result adding or deleting a MIF would
    cause a hang because the mrt_lock could not be acquired.

    This fix is a copy of the code for the IPv4 case and ensures that the lock
    is released correctly.

    Signed-off-by: Richard Laing
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Richard Laing
     
  • Pull nfsd updates from Bruce Fields:
    "Nothing major, but:

    - Add Jeff Layton as an nfsd co-maintainer: no change to existing
    practice, just an acknowledgement of the status quo.

    - Two patches ("nfsd: ensure that...") for a race overlooked by the
    state locking rewrite, causing a crash noticed by multiple users.

    - Lots of smaller bugfixes all over from Kinglong Mee.

    - From Jeff, some cleanup of server rpc code in preparation for
    possible shift of nfsd threads to workqueues"

    * tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits)
    nfsd: deal with DELEGRETURN racing with CB_RECALL
    nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case
    nfsd: ensure that delegation stateid hash references are only put once
    nfsd: ensure that the ol stateid hash reference is only put once
    net: sunrpc: fix tracepoint Warning: unknown op '->'
    nfsd: allow more than one laundry job to run at a time
    nfsd: don't WARN/backtrace for invalid container deployment.
    fs: fix fs/locks.c kernel-doc warning
    nfsd: Add Jeff Layton as co-maintainer
    NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
    NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
    nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
    nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
    NFSD: Store parent's stat in a separate value
    nfsd: Fix two typos in comments
    lockd: NLM grace period shouldn't block NFSv4 opens
    nfsd: include linux/nfs4.h in export.h
    sunrpc: Switch to using hash list instead single list
    sunrpc/nfsd: Remove redundant code by exports seq_operations functions
    sunrpc: Store cache_detail in seq_file's private directly
    ...

    Linus Torvalds
     

05 Sep, 2015

2 commits

  • userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
    of the current hardcoded 1 (that would wake just the first waitqueue in
    the head list).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

04 Sep, 2015

8 commits

  • When beacon filtering is enabled the mac80211 software implementation
    for RSSI CQM cannot work as beacons will not be available. Rather than
    accepting such a configuration without proper effect, reject it.

    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Currently if 80MHz channels are not allowed for use, the VHT IE is not
    included in the probe request for an AP. This is not good enough if the
    AP is configured with the wrong regulatory and supports VHT even where
    prohibited or in TDLS scenarios.
    Mark the ifmgd with the DISABLE_VHT flag for the misbehaving-AP case, and
    unset VHT support from the peer-station entry for the TDLS case.

    Signed-off-by: Arik Nemtsov
    Signed-off-by: Emmanuel Grumbach
    Signed-off-by: Johannes Berg

    Arik Nemtsov
     
  • restore_regulatory_settings() should restore alpha2
    as computed in restore_alpha2(), not raw user_alpha2 to
    behave as described in the comment just above that code.

    This fixes endless loop of calling CRDA for "00" and "97"
    countries after resume from suspend on my laptop.

    Looks like others had the same problem, too:
    http://ath9k-devel.ath9k.narkive.com/knY5W6St/ath9k-and-crda-messages-in-logs
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/899335
    https://forum.porteus.org/viewtopic.php?t=4975&p=36436
    https://forums.opensuse.org/showthread.php/483356-Authentication-Regulatory-Domain-issues-ath5k-12-2

    Signed-off-by: Maciej Szmigiero
    Signed-off-by: Johannes Berg

    Maciej S. Szmigiero
     
  • When switching the state of all RFKill switches of type all we need to
    replicate the RFKILL_TYPE_ALL global state to all the other types global
    state, so it is used to initialize persistent RFKill switches on
    register.

    Signed-off-by: João Paulo Rechi Vita
    Acked-by: Marcel Holtmann
    Signed-off-by: Johannes Berg

    João Paulo Rechi Vita
     
  • HT TDLS traffic should be protected in a non-HT BSS to avoid
    collisions. Therefore, when TDLS peers join/leave, check if
    protection is (now) needed and set the ht_operation_mode of
    the virtual interface according to the HT capabilities of the
    TDLS peer(s).

    This works because a non-HT BSS connection never sets (or
    otherwise uses) the ht_operation_mode; it just means that
    drivers must be aware that this field applies to all HT
    traffic for this virtual interface, not just the traffic
    within the BSS. Document that.

    Signed-off-by: Avri Altman
    Signed-off-by: Johannes Berg

    Avri Altman
     
  • The rate_control_cap_mask() function takes a parameter mcs_mask, which
    GCC will take to be u8 * even though it was declared with a fixed size.
    This causes the following warning:

    net/mac80211/rate.c: In function 'rate_control_cap_mask':
    net/mac80211/rate.c:719:25: warning: 'sizeof' on array function parameter 'mcs_mask' will return size of 'u8 * {aka unsigned char *}' [-Wsizeof-array-argument]
    for (i = 0; i < sizeof(mcs_mask); i++)
    ^
    net/mac80211/rate.c:684:10: note: declared here
    u8 mcs_mask[IEEE80211_HT_MCS_MASK_LEN],
    ^

    This can be easily fixed by using the IEEE80211_HT_MCS_MASK_LEN directly
    within the loop condition.

    Signed-off-by: Thierry Reding
    Signed-off-by: Johannes Berg

    Thierry Reding
     
  • Commit 0ca50d12fe46 added a restriction that the address must belong to
    the output interface, so that sctp will use the right interface even
    when using secondary addresses.

    But it breaks IPVS setups, on which people is used to attach VIP
    addresses to loopback interface on real servers. It's preferred to
    attach to the interface actually in use, but it's a very common setup
    and that used to work.

    This patch then saves the first routing good result, even if it would be
    going out through an interface that doesn't have that address. If no
    better hit found, it's then used. This effectively restores the original
    behavior if no better interface could be found.

    Fixes: 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses")
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • Commit 0ca50d12fe46 failed to release the reference to dst entries that
    it decided to skip.

    Fixes: 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses")
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

03 Sep, 2015

2 commits

  • Pull networking updates from David Miller:
    "Another merge window, another set of networking changes. I've heard
    rumblings that the lightweight tunnels infrastructure has been voted
    networking change of the year. But what do I know?

    1) Add conntrack support to openvswitch, from Joe Stringer.

    2) Initial support for VRF (Virtual Routing and Forwarding), which
    allows the segmentation of routing paths without using multiple
    devices. There are some semantic kinks to work out still, but
    this is a reasonably strong foundation. From David Ahern.

    3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.

    4) Ignore route nexthops with a link down state in ipv6, just like
    ipv4. From Andy Gospodarek.

    5) Remove spinlock from fast path of act_gact and act_mirred, from
    Eric Dumazet.

    6) Document the DSA layer, from Florian Fainelli.

    7) Add netconsole support to bcmgenet, systemport, and DSA. Also
    from Florian Fainelli.

    8) Add Mellanox Switch Driver and core infrastructure, from Jiri
    Pirko.

    9) Add support for "light weight tunnels", which allow for
    encapsulation and decapsulation without bearing the overhead of a
    full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
    others.

    10) Add Identifier Locator Addressing support for ipv6, from Tom
    Herbert.

    11) Support fragmented SKBs in iwlwifi, from Johannes Berg.

    12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.

    13) Add BQL support to 3c59x driver, from Loganaden Velvindron.

    14) Stop using a zero TX queue length to mean that a device shouldn't
    have a qdisc attached, use an explicit flag instead. From Phil
    Sutter.

    15) Use generic geneve netdevice infrastructure in openvswitch, from
    Pravin B Shelar.

    16) Add infrastructure to avoid re-forwarding a packet in software
    that was already forwarded by a hardware switch. From Scott
    Feldman.

    17) Allow AF_PACKET fanout function to be implemented in a bpf
    program, from Willem de Bruijn"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
    netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
    netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
    net: fec: clear receive interrupts before processing a packet
    ipv6: fix exthdrs offload registration in out_rt path
    xen-netback: add support for multicast control
    bgmac: Update fixed_phy_register()
    sock, diag: fix panic in sock_diag_put_filterinfo
    flow_dissector: Use 'const' where possible.
    flow_dissector: Fix function argument ordering dependency
    ixgbe: Resolve "initialized field overwritten" warnings
    ixgbe: Remove bimodal SR-IOV disabling
    ixgbe: Add support for reporting 2.5G link speed
    ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
    ixgbe: support for ethtool set_rxfh
    ixgbe: Avoid needless PHY access on copper phys
    ixgbe: cleanup to use cached mask value
    ixgbe: Remove second instance of lan_id variable
    ixgbe: use kzalloc for allocating one thing
    flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
    ixgbe: Remove unused PCI bus types
    ...

    Linus Torvalds
     
  • Fengguang reported, that some randconfig generated the following linker
    issue with nf_ct_zone_dflt object involved:

    [...]
    CC init/version.o
    LD init/built-in.o
    net/built-in.o: In function `ipv4_conntrack_defrag':
    nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
    net/built-in.o: In function `ipv6_defrag':
    nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
    make: *** [vmlinux] Error 1

    Given that configurations exist where we have a built-in part, which is
    accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
    and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
    module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
    area when netfilter is configured in general.

    Therefore, split the more generic parts into a common header under
    include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
    section that already holds parts related to CONFIG_NF_CONNTRACK in the
    netfilter core. This fixes the issue on my side.

    Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann