21 Nov, 2018

1 commit

  • commit f393808dc64149ccd0e5a8427505ba2974a59854 upstream.

    If there's no entry to drop in bucket that corresponds to the hash,
    early_drop() should look for it in other buckets. But since it increments
    hash instead of bucket number, it actually looks in the same bucket 8
    times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
    reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
    most cases.

    Fix it by increasing bucket number instead of hash and rename _hash
    to bucket to avoid future confusion.

    Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
    Cc: # v4.7+
    Signed-off-by: Vasily Khoruzhick
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Vasily Khoruzhick
     

24 Aug, 2018

1 commit

  • [ Upstream commit 2045cdfa1b40d66f126f3fd05604fc7c754f0022 ]

    Loading the nf_conntrack module with doubled hashsize parameter, i.e.
    modprobe nf_conntrack hashsize=12345 hashsize=12345
    causes NULL-ptr deref.

    If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
    will be called also twice.
    The first nf_conntrack_set_hashsize() call will set the
    'nf_conntrack_htable_size' variable:

    nf_conntrack_set_hashsize()
    ...
    /* On boot, we can set this without any fancy locking. */
    if (!nf_conntrack_htable_size)
    return param_set_uint(val, kp);

    But on the second invocation, the nf_conntrack_htable_size is already set,
    so the nf_conntrack_set_hashsize() will take a different path and call
    the nf_conntrack_hash_resize() function. Which will crash on the attempt
    to dereference 'nf_conntrack_hash' pointer:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
    Call Trace:
    nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
    parse_args+0x1f9/0x5a0
    load_module+0x1281/0x1a50
    __se_sys_finit_module+0xbe/0xf0
    do_syscall_64+0x7c/0x390
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fix this, by checking !nf_conntrack_hash instead of
    !nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
    after the module loaded, so the second invocation of the
    nf_conntrack_set_hashsize() won't crash, it will just reinitialize
    nf_conntrack_htable_size again.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

07 Sep, 2017

1 commit

  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

04 Sep, 2017

2 commits


25 Aug, 2017

1 commit


02 Aug, 2017

1 commit

  • When a nf_conntrack_l3/4proto parameter is not on the left hand side
    of an assignment, its address is not taken, and it is not passed to a
    function that may modify its fields, then it can be declared as const.

    This change is useful from a documentation point of view, and can
    possibly facilitate making some nf_conntrack_l3/4proto structures const
    subsequently.

    Done with the help of Coccinelle.

    Signed-off-by: Julia Lawall
    Signed-off-by: Pablo Neira Ayuso

    Julia Lawall
     

01 Aug, 2017

3 commits


26 Jul, 2017

1 commit

  • As we want to remove spin_unlock_wait() and replace it with explicit
    spin_lock()/spin_unlock() calls, we can use this to simplify the
    locking.

    In addition:
    - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
    - The new code avoids the backwards loop.

    Only slightly tested, I did not manage to trigger calls to
    nf_conntrack_all_lock().

    V2: With improved comments, to clearly show how the barriers
    pair.

    Fixes: b16c29191dc8 ("netfilter: nf_conntrack: use safer way to lock all buckets")
    Signed-off-by: Manfred Spraul
    Cc:
    Cc: Alan Stern
    Cc: Sasha Levin
    Cc: Pablo Neira Ayuso
    Cc: netfilter-devel@vger.kernel.org
    Signed-off-by: Paul E. McKenney

    Manfred Spraul
     

24 Jul, 2017

1 commit

  • This patch removes duplicate rcu_read_lock().

    1. IPVS part:

    According to Julian Anastasov's mention, contexts of ipvs are described
    at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary:

    - packet RX/TX: does not need locks because packets come from hooks.
    - sync msg RX: backup server uses RCU locks while registering new
    connections.
    - ip_vs_ctl.c: configuration get/set, RCU locks needed.
    - xt_ipvs.c: It is a netfilter match, running from hook context.

    As result, rcu_read_lock and rcu_read_unlock can be removed from:

    - ip_vs_core.c: all
    - ip_vs_ctl.c:
    - only from ip_vs_has_real_service
    - ip_vs_ftp.c: all
    - ip_vs_proto_sctp.c: all
    - ip_vs_proto_tcp.c: all
    - ip_vs_proto_udp.c: all
    - ip_vs_xmit.c: all (contains only packet processing)

    2. Netfilter part:

    There are three types of functions that are guaranteed the rcu_read_lock().
    First, as result, functions are only called by nf_hook():

    - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp().
    - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6().
    - match_lookup_rt6(), check_hlist(), hashlimit_mt_common().
    - xt_osf_match_packet().

    Second, functions that caller already held the rcu_read_lock().
    - destroy_conntrack(), ctnetlink_conntrack_event().
    - ctnl_timeout_find_get(), nfqnl_nf_hook_drop().

    Third, functions that are mixed with type1 and type2.

    These functions are called by nf_hook() also these are called by
    ordinary functions that already held the rcu_read_lock():

    - __ctnetlink_glue_build(), ctnetlink_expect_event().
    - ctnetlink_proto_size().

    Applied files are below:

    - nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c.
    - nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c.
    - nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c.
    - xt_connlimit.c, xt_hashlimit.c, xt_osf.c

    Detailed calltrace can be found at:
    http://marc.info/?l=netfilter-devel&m=149667610710350&w=2

    Signed-off-by: Taehee Yoo
    Acked-by: Julian Anastasov
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

20 Jun, 2017

1 commit

  • Quoting Joe Stringer:
    If a user loads nf_conntrack_ftp, sends FTP traffic through a network
    namespace, destroys that namespace then unloads the FTP helper module,
    then the kernel will crash.

    Events that lead to the crash:
    1. conntrack is created with ftp helper in netns x
    2. This netns is destroyed
    3. netns destruction is scheduled
    4. netns destruction wq starts, removes netns from global list
    5. ftp helper is unloaded, which resets all helpers of the conntracks
    via for_each_net()

    but because netns is already gone from list the for_each_net() loop
    doesn't include it, therefore all of these conntracks are unaffected.

    6. helper module unload finishes
    7. netns wq invokes destructor for rmmod'ed helper

    CC: "Eric W. Biederman"
    Reported-by: Joe Stringer
    Signed-off-by: Florian Westphal
    Acked-by: David S. Miller
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

29 May, 2017

4 commits

  • We could some conntracks when a resize occurs in parallel.

    Avoid this by sampling generation seqcnt and doing a restart if needed.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • sledgehammer to be used on module unload (to remove affected conntracks
    from all namespaces).

    It will also flag all unconfirmed conntracks as dying, i.e. they will
    not be committed to main table.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • nf_ct_iterate_cleanup_net currently calls iter() callback also for
    conntracks on the unconfirmed list, but this is unsafe.

    Acesses to nf_conn are fine, but some users access the extension area
    in the iter() callback, but that does only work reliably for confirmed
    conntracks (ct->ext can be reallocated at any time for unconfirmed
    conntrack).

    The seond issue is that there is a short window where a conntrack entry
    is neither on the list nor in the table: To confirm an entry, it is first
    removed from the unconfirmed list, then insert into the table.

    Fix this by iterating the unconfirmed list first and marking all entries
    as dying, then wait for rcu grace period.

    This makes sure all entries that were about to be confirmed either are
    in the main table, or will be dropped soon.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • There are several places where we needlesly call nf_ct_iterate_cleanup,
    we should instead iterate the full table at module unload time.

    This is a leftover from back when the conntrack table got duplicated
    per net namespace.

    So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net.
    A later patch will then add a non-net variant.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

03 May, 2017

1 commit

  • If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the
    build will fail with:

    net/built-in.o: In function `nf_conntrack_init_start':
    (.text+0x9baf6): undefined reference to `__compiletime_assert_1893'

    or

    ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined!

    Fix this by forcing inlining of total_extension_size().

    Fixes: b3a5db109e0670d6 ("netfilter: conntrack: use u8 for extension sizes again")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Arnd Bergmann
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Geert Uytterhoeven
     

23 Apr, 2017

1 commit


19 Apr, 2017

3 commits

  • If insertion of a new conntrack fails because the table is full, the kernel
    searches the next buckets of the hash slot where the new connection
    was supposed to be inserted at for an entry that hasn't seen traffic
    in reply direction (non-assured), if it finds one, that entry is
    is dropped and the new connection entry is allocated.

    Allow the conntrack gc worker to also remove *assured* conntracks if
    resources are low.

    Do this by querying the l4 tracker, e.g. tcp connections are now dropped
    if they are no longer established (e.g. in finwait).

    This could be refined further, e.g. by adding 'soft' established timeout
    (i.e., a timeout that is only used once we get close to resource
    exhaustion).

    Cc: Jozsef Kadlecsik
    Signed-off-by: Florian Westphal
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • commit 223b02d923ecd7c84cf9780bb3686f455d279279
    ("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len")
    had to increase size of the extension offsets because total size of the
    extensions had increased to a point where u8 did overflow.

    3 years later we've managed to diet extensions a bit and we no longer
    need u16. Furthermore we can now add a compile-time assertion for this
    problem.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

15 Apr, 2017

1 commit

  • resurrect an old patch from Pablo Neira to remove the untracked objects.

    Currently, there are four possible states of an skb wrt. conntrack.

    1. No conntrack attached, ct is NULL.
    2. Normal (kmem cache allocated) ct attached.
    3. a template (kmalloc'd), not in any hash tables at any point in time
    4. the 'untracked' conntrack, a percpu nf_conn object, tagged via
    IPS_UNTRACKED_BIT in ct->status.

    Untracked is supposed to be identical to case 1. It exists only
    so users can check

    -m conntrack --ctstate UNTRACKED vs.
    -m conntrack --ctstate INVALID

    e.g. attempts to set connmark on INVALID or UNTRACKED conntracks is
    supposed to be a no-op.

    Thus currently we need to check
    ct == NULL || nf_ct_is_untracked(ct)

    in a lot of places in order to avoid altering untracked objects.

    The other consequence of the percpu untracked object is that all
    -j NOTRACK (and, later, kfree_skb of such skbs) result in an atomic op
    (inc/dec the untracked conntracks refcount).

    This adds a new kernel-private ctinfo state, IP_CT_UNTRACKED, to
    make the distinction instead.

    The (few) places that care about packet invalid (ct is NULL) vs.
    packet untracked now need to test ct == NULL vs. ctinfo == IP_CT_UNTRACKED,
    but all other places can omit the nf_ct_is_untracked() check.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

07 Apr, 2017

1 commit


24 Mar, 2017

1 commit


14 Mar, 2017

1 commit

  • also mark init_conntrack noinline, in most cases resolve_normal_ct will
    find an existing conntrack entry.

    text data bss dec hex filename
    16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o
    16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

13 Mar, 2017

1 commit

  • Since the nfct and nfctinfo have been combined, the nf_conn structure
    must be at least 8 bytes aligned, as the 3 LSB bits are used for the
    nfctinfo. But there's a fake nf_conn structure to denote untracked
    connections, which is created by a PER_CPU construct. This does not
    guarantee that it will be 8 bytes aligned and can break the logic in
    determining the correct nfctinfo.

    I triggered this on a 32bit machine with the following error:

    BUG: unable to handle kernel NULL pointer dereference at 00000af4
    IP: nf_ct_deliver_cached_events+0x1b/0xfb
    *pdpt = 0000000031962001 *pde = 0000000000000000

    Oops: 0000 [#1] SMP
    [Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
    OK ]
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
    Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
    task: c126ec00 task.stack: c1258000
    EIP: nf_ct_deliver_cached_events+0x1b/0xfb
    EFLAGS: 00010202 CPU: 0
    EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
    ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
    Call Trace:

    ? ipv6_skip_exthdr+0xac/0xcb
    ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
    nf_hook_slow+0x22/0xc7
    nf_hook+0x9a/0xad [ipv6]
    ? ip6t_do_table+0x356/0x379 [ip6_tables]
    ? ip6_fragment+0x9e9/0x9e9 [ipv6]
    ip6_output+0xee/0x107 [ipv6]
    ? ip6_fragment+0x9e9/0x9e9 [ipv6]
    dst_output+0x36/0x4d [ipv6]
    NF_HOOK.constprop.37+0xb2/0xba [ipv6]
    ? icmp6_dst_alloc+0x2c/0xfd [ipv6]
    ? local_bh_enable+0x14/0x14 [ipv6]
    mld_sendpack+0x1c5/0x281 [ipv6]
    ? mark_held_locks+0x40/0x5c
    mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
    call_timer_fn+0x135/0x283
    ? detach_if_pending+0x55/0x55
    ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
    __run_timers+0x111/0x14b
    ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
    run_timer_softirq+0x1c/0x36
    __do_softirq+0x185/0x37c
    ? test_ti_thread_flag.constprop.19+0xd/0xd
    do_softirq_own_stack+0x22/0x28

    irq_exit+0x5a/0xa4
    smp_apic_timer_interrupt+0x2a/0x34
    apic_timer_interrupt+0x37/0x3c

    By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
    alignment as all cache line sizes are at least 8 bytes or more.

    Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area")
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Steven Rostedt (VMware)
     

04 Feb, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree, they are:

    1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
    sk_buff so we only access one single cacheline in the conntrack
    hotpath. Patchset from Florian Westphal.

    2) Don't leak pointer to internal structures when exporting x_tables
    ruleset back to userspace, from Willem DeBruijn. This includes new
    helper functions to copy data to userspace such as xt_data_to_user()
    as well as conversions of our ip_tables, ip6_tables and arp_tables
    clients to use it. Not surprinsingly, ebtables requires an ad-hoc
    update. There is also a new field in x_tables extensions to indicate
    the amount of bytes that we copy to userspace.

    3) Add nf_log_all_netns sysctl: This new knob allows you to enable
    logging via nf_log infrastructure for all existing netnamespaces.
    Given the effort to provide pernet syslog has been discontinued,
    let's provide a way to restore logging using netfilter kernel logging
    facilities in trusted environments. Patch from Michal Kubecek.

    4) Validate SCTP checksum from conntrack helper, from Davide Caratti.

    5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
    a copy&paste from the original helper, from Florian Westphal.

    6) Reset netfilter state when duplicating packets, also from Florian.

    7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
    nft_meta, from Liping Zhang.

    8) Add missing code to deal with loopback packets from nft_meta when
    used by the netdev family, also from Liping.

    9) Several cleanups on nf_tables, one to remove unnecessary check from
    the netlink control plane path to add table, set and stateful objects
    and code consolidation when unregister chain hooks, from Gao Feng.

    10) Fix harmless reference counter underflow in IPVS that, however,
    results in problems with the introduction of the new refcount_t
    type, from David Windsor.

    11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
    from Davide Caratti.

    12) Missing documentation on nf_tables uapi header, from Liping Zhang.

    13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Feb, 2017

6 commits

  • After this change conntrack operations (lookup, creation, matching from
    ruleset) only access one instead of two sk_buff cache lines.

    This works for normal conntracks because those are allocated from a slab
    that guarantees hw cacheline or 8byte alignment (whatever is larger)
    so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

    Template allocation now does manual address alignment (see previous change)
    on arches that don't have sufficent kmalloc min alignment.

    Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
    this is to avoid undoing the skb_nfct() use when we remove untracked
    conntrack object in the future.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The next change will merge skb->nfct pointer and skb->nfctinfo
    status bits into single skb->_nfct (unsigned long) area.

    For this to work nf_conn addresses must always be aligned at least on
    an 8 byte boundary since we will need the lower 3bits to store nfctinfo.

    Conntrack templates are allocated via kmalloc.
    kbuild test robot reported
    BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN
    on v1 of this patchset, so not all platforms meet this requirement.

    Do manual alignment if needed, the alignment offset is stored in the
    nf_conn entry protocol area. This works because templates are not
    handed off to L4 protocol trackers.

    Reported-by: kbuild test robot
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff.
    This avoids changing code in followup patch that merges skb->nfct and
    skb->nfctinfo into skb->_nfct.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Followup patch renames skb->nfct and changes its type so add a helper to
    avoid intrusive rename change later.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Next patch makes direct skb->nfct access illegal, reduce noise
    in next patch by using accessors we already have.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • It is never accessed for reading and the only places that write to it
    are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).

    The conntrack core specifically checks for attached skb->nfct after
    ->error() invocation and returns early in this case.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

19 Jan, 2017

2 commits

  • This further refines the changes made to conntrack gc_worker in
    commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics").

    The main idea of that change was to reduce the scan interval when evictions
    take place.

    However, on the reporters' setup, there are 1-2 million conntrack entries
    in total and roughly 8k new (and closing) connections per second.

    In this case we'll always evict at least one entry per gc cycle and scan
    interval is always at 1 jiffy because of this test:

    } else if (expired_count) {
    gc_work->next_gc_run /= 2U;
    next_run = msecs_to_jiffies(1);

    being true almost all the time.

    Given we scan ~10k entries per run its clearly wrong to reduce interval
    based on nonzero eviction count, it will only waste cpu cycles since a vast
    majorities of conntracks are not timed out.

    Thus only look at the ratio (scanned entries vs. evicted entries) to make
    a decision on whether to reduce or not.

    Because evictor is supposed to only kick in when system turns idle after
    a busy period, pick a high ratio -- this makes it 50%. We thus keep
    the idea of increasing scan rate when its likely that table contains many
    expired entries.

    In order to not let timed-out entries hang around for too long
    (important when using event logging, in which case we want to timely
    destroy events), we now scan the full table within at most
    GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all
    timed-out entries sit in same slot.

    I tested this with a vm under synflood (with
    sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3).

    While flood is ongoing, interval now stays at its max rate
    (GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms).

    With feedback from Nicolas Dichtel.

    Reported-by: Denys Fedoryshchenko
    Cc: Nicolas Dichtel
    Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries")
    Signed-off-by: Florian Westphal
    Tested-by: Nicolas Dichtel
    Acked-by: Nicolas Dichtel
    Tested-by: Denys Fedoryshchenko
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Instead of breaking loop and instant resched, don't bother checking
    this in first place (the loop calls cond_resched for every bucket anyway).

    Suggested-by: Nicolas Dichtel
    Signed-off-by: Florian Westphal
    Acked-by: Nicolas Dichtel
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

26 Dec, 2016

1 commit

  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

15 Nov, 2016

1 commit


10 Nov, 2016

1 commit

  • gcc correctly identified a theoretical uninitialized variable use:

    net/netfilter/nf_conntrack_core.c: In function 'nf_conntrack_in':
    net/netfilter/nf_conntrack_core.c:1125:14: error: 'l4proto' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    This could only happen when we 'goto out' before looking up l4proto,
    and then enter the retry, implying that l3proto->get_l4proto()
    returned NF_REPEAT. This does not currently get returned in any
    code path and probably won't ever happen, but is not good to
    rely on.

    Moving the repeat handling up a little should have the same
    behavior as today but avoids the warning by making that case
    impossible to enter.

    [ I have mangled this original patch to remove the check for tmpl, we
    should inconditionally jump back to the repeat label in case we hit
    NF_REPEAT instead. I have also moved the comment that explains this
    where it belongs. --pablo ]

    Fixes: 08733a0cb7de ("netfilter: handle NF_REPEAT from nf_conntrack_in()")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Pablo Neira Ayuso

    Arnd Bergmann