29 Aug, 2020

1 commit

  • Frontend callback reports EAGAIN to nfnetlink to retry a command, this
    is used to signal that module autoloading is required. Unfortunately,
    nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
    full, so it enters a busy-loop.

    This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
    to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
    since this is always MSG_DONTWAIT in the existing code which is exactly
    what nlmsg_unicast() passes to netlink_unicast() as parameter.

    Fixes: 96518518cc41 ("netfilter: add nftables")
    Reported-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

29 Mar, 2020

1 commit


15 Jan, 2020

1 commit


13 Aug, 2019

1 commit


22 Jun, 2019

1 commit


19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 Jun, 2019

1 commit


28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

22 Apr, 2019

1 commit

  • setting net.netfilter.nf_conntrack_timestamp=1 breaks xmit with fq
    scheduler. skb->tstamp might be "refreshed" using ktime_get_real(),
    but fq expects CLOCK_MONOTONIC.

    This patch removes all places in netfilter that check/set skb->tstamp:

    1. To fix the bogus "start" time seen with conntrack timestamping for
    outgoing packets, never use skb->tstamp and always use current time.
    2. In nfqueue and nflog, only use skb->tstamp for incoming packets,
    as determined by current hook (prerouting, input, forward).
    3. xt_time has to use system clock as well rather than skb->tstamp.
    We could still use skb->tstamp for prerouting/input/foward, but
    I see no advantage to make this conditional.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Cc: Eric Dumazet
    Reported-by: Michal Soltys
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

20 Dec, 2018

1 commit


09 Nov, 2018

1 commit


13 Sep, 2018

1 commit


11 Sep, 2018

2 commits


18 Jun, 2018

1 commit

  • Three attributes are currently not verified, thus can trigger KMSAN
    warnings such as :

    BUG: KMSAN: uninit-value in __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
    BUG: KMSAN: uninit-value in __fswab32 include/uapi/linux/swab.h:59 [inline]
    BUG: KMSAN: uninit-value in nfqnl_recv_config+0x939/0x17d0 net/netfilter/nfnetlink_queue.c:1268
    CPU: 1 PID: 4521 Comm: syz-executor120 Not tainted 4.17.0+ #5
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:113
    kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117
    __msan_warning_32+0x70/0xc0 mm/kmsan/kmsan_instr.c:620
    __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
    __fswab32 include/uapi/linux/swab.h:59 [inline]
    nfqnl_recv_config+0x939/0x17d0 net/netfilter/nfnetlink_queue.c:1268
    nfnetlink_rcv_msg+0xb2e/0xc80 net/netfilter/nfnetlink.c:212
    netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448
    nfnetlink_rcv+0x2fe/0x680 net/netfilter/nfnetlink.c:513
    netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
    netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336
    netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
    sock_sendmsg_nosec net/socket.c:629 [inline]
    sock_sendmsg net/socket.c:639 [inline]
    ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
    __sys_sendmsg net/socket.c:2155 [inline]
    __do_sys_sendmsg net/socket.c:2164 [inline]
    __se_sys_sendmsg net/socket.c:2162 [inline]
    __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
    do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x43fd59
    RSP: 002b:00007ffde0e30d28 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd59
    RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
    R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401680
    R13: 0000000000401710 R14: 0000000000000000 R15: 0000000000000000

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189
    kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315
    kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan.c:322
    slab_post_alloc_hook mm/slab.h:446 [inline]
    slab_alloc_node mm/slub.c:2753 [inline]
    __kmalloc_node_track_caller+0xb35/0x11b0 mm/slub.c:4395
    __kmalloc_reserve net/core/skbuff.c:138 [inline]
    __alloc_skb+0x2cb/0x9e0 net/core/skbuff.c:206
    alloc_skb include/linux/skbuff.h:988 [inline]
    netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
    netlink_sendmsg+0x76e/0x1350 net/netlink/af_netlink.c:1876
    sock_sendmsg_nosec net/socket.c:629 [inline]
    sock_sendmsg net/socket.c:639 [inline]
    ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
    __sys_sendmsg net/socket.c:2155 [inline]
    __do_sys_sendmsg net/socket.c:2164 [inline]
    __se_sys_sendmsg net/socket.c:2162 [inline]
    __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
    do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: fdb694a01f1f ("netfilter: Add fail-open support")
    Fixes: 829e17a1a602 ("[NETFILTER]: nfnetlink_queue: allow changing queue length through netlink")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: Pablo Neira Ayuso

    Eric Dumazet
     

07 Jun, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Add Maglev hashing scheduler to IPVS, from Inju Song.

    2) Lots of new TC subsystem tests from Roman Mashak.

    3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

    4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

    5) Add ttl inherit support to vxlan, from Hangbin Liu.

    6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

    7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

    8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

    9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

    10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

    11) Plumb extack down into fib_rules, from Roopa Prabhu.

    12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

    13) Add UDP GSO support, from Willem de Bruijn.

    14) Add documentation for eBPF helpers, from Quentin Monnet.

    15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

    16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

    17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

    18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

    19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

    20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

    21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

    22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

    23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

    24) Add TCP SACK compression, from Eric Dumazet.

    25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

    26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

    27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

    28) Add lots of forwarding selftests, from Petr Machata.

    29) Add generic network device failover driver, from Sridhar Samudrala.

    * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
    strparser: Add __strp_unpause and use it in ktls.
    rxrpc: Fix terminal retransmission connection ID to include the channel
    net: hns3: Optimize PF CMDQ interrupt switching process
    net: hns3: Fix for VF mailbox receiving unknown message
    net: hns3: Fix for VF mailbox cannot receiving PF response
    bnx2x: use the right constant
    Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
    net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
    enic: fix UDP rss bits
    netdev-FAQ: clarify DaveM's position for stable backports
    rtnetlink: validate attributes in do_setlink()
    mlxsw: Add extack messages for port_{un, }split failures
    netdevsim: Add extack error message for devlink reload
    devlink: Add extack to reload and port_{un, }split operations
    net: metrics: add proper netlink validation
    ipmr: fix error path when ipmr_new_table fails
    ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
    net: hns3: remove unused hclgevf_cfg_func_mta_filter
    netfilter: provide udp*_lib_lookup for nf_tproxy
    qed*: Utilize FW 8.37.2.0
    ...

    Linus Torvalds
     

23 May, 2018

1 commit

  • In nfqueue, two consecutive skbuffs may race to create the conntrack
    entry. Hence, the one that loses the race gets dropped due to clash in
    the insertion into the hashes from the nf_conntrack_confirm() path.

    This patch adds a new nf_conntrack_update() function which searches for
    possible clashes and resolve them. NAT mangling for the packet losing
    race is corrected by using the conntrack information that won race.

    In order to avoid direct module dependencies with conntrack and NAT, the
    nf_ct_hook and nf_nat_hook structures are used for this purpose.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

16 May, 2018

1 commit

  • Variants of proc_create{,_data} that directly take a struct seq_operations
    and deal with network namespaces in ->open and ->release. All callers of
    proc_create + seq_open_net converted over, and seq_{open,release}_net are
    removed entirely.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

20 Mar, 2018

1 commit

  • Using pr_() is more concise than printk(KERN_).
    This patch:
    * Replace printks having a log level with the appropriate
    pr_*() macros.
    * Define pr_fmt() to include relevant name.
    * Remove redundant prefixes from pr_*() calls.
    * Indent the code where possible.
    * Remove the useless output messages.
    * Remove periods from messages.

    Signed-off-by: Arushi Singhal
    Signed-off-by: Pablo Neira Ayuso

    Arushi Singhal
     

19 Jan, 2018

1 commit

  • /proc has been ignoring struct file_operations::owner field for 10 years.
    Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba
    ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
    inode->i_fop is initialized with proxy struct file_operations for
    regular files:

    - if (de->proc_fops)
    - inode->i_fop = de->proc_fops;
    + if (de->proc_fops) {
    + if (S_ISREG(inode->i_mode))
    + inode->i_fop = &proc_reg_file_ops;
    + else
    + inode->i_fop = de->proc_fops;
    + }

    VFS stopped pinning module at this point.

    # ipvs
    Acked-by: Julian Anastasov
    Signed-off-by: Alexey Dobriyan
    Acked-by: Simon Horman
    Signed-off-by: Pablo Neira Ayuso

    Alexey Dobriyan
     

09 Jan, 2018

1 commit


20 Nov, 2017

1 commit


25 Oct, 2017

1 commit

  • For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't currently harmful.

    However, for some features it is necessary to instrument reads and
    writes separately, which is not possible with ACCESS_ONCE(). This
    distinction is critical to correct operation.

    It's possible to transform the bulk of kernel code using the Coccinelle
    script below. However, this doesn't handle comments, leaving references
    to ACCESS_ONCE() instances which have been removed. As a preparatory
    step, this patch converts netlink and netfilter code and comments to use
    {READ,WRITE}_ONCE() consistently.

    ----
    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland
    Signed-off-by: Paul E. McKenney
    Cc: David S. Miller
    Cc: Florian Westphal
    Cc: Jozsef Kadlecsik
    Cc: Linus Torvalds
    Cc: Pablo Neira Ayuso
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-7-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

01 Aug, 2017

1 commit

  • When skb is queued to userspace it leaves softirq/rcu protection.
    skb->nfct (via conntrack extensions such as helper) could then reference
    modules that no longer exist if the conntrack was not yet confirmed.

    nf_ct_iterate_destroy() will set the DYING bit for unconfirmed
    conntracks, we therefore solve this race as follows:

    1. take the queue spinlock.
    2. check if the conntrack is unconfirmed and has dying bit set.
    In this case, we must discard skb while we're still inside
    rcu read-side section.
    3. If nf_ct_iterate_destroy() is called right after the packet is queued
    to userspace, it will be removed from the queue via
    nf_ct_iterate_destroy -> nf_queue_nf_hook_drop.

    When userspace sends the verdict (nfnetlink takes rcu read lock), there
    are two cases to consider:

    1. nf_ct_iterate_destroy() was called while packet was out.
    In this case, skb will have been removed from the queue already
    and no reinject takes place as we won't find a matching entry for the
    packet id.

    2. nf_ct_iterate_destroy() gets called right after verdict callback
    found and removed the skb from queue list.

    In this case, skb->nfct is marked as dying but it is still valid.
    The skb will be dropped either in nf_conntrack_confirm (we don't
    insert DYING conntracks into hash table) or when we try to queue
    the skb again, but either events don't occur before the rcu read lock
    is dropped.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

24 Jul, 2017

1 commit

  • This patch removes duplicate rcu_read_lock().

    1. IPVS part:

    According to Julian Anastasov's mention, contexts of ipvs are described
    at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary:

    - packet RX/TX: does not need locks because packets come from hooks.
    - sync msg RX: backup server uses RCU locks while registering new
    connections.
    - ip_vs_ctl.c: configuration get/set, RCU locks needed.
    - xt_ipvs.c: It is a netfilter match, running from hook context.

    As result, rcu_read_lock and rcu_read_unlock can be removed from:

    - ip_vs_core.c: all
    - ip_vs_ctl.c:
    - only from ip_vs_has_real_service
    - ip_vs_ftp.c: all
    - ip_vs_proto_sctp.c: all
    - ip_vs_proto_tcp.c: all
    - ip_vs_proto_udp.c: all
    - ip_vs_xmit.c: all (contains only packet processing)

    2. Netfilter part:

    There are three types of functions that are guaranteed the rcu_read_lock().
    First, as result, functions are only called by nf_hook():

    - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp().
    - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6().
    - match_lookup_rt6(), check_hlist(), hashlimit_mt_common().
    - xt_osf_match_packet().

    Second, functions that caller already held the rcu_read_lock().
    - destroy_conntrack(), ctnetlink_conntrack_event().
    - ctnl_timeout_find_get(), nfqnl_nf_hook_drop().

    Third, functions that are mixed with type1 and type2.

    These functions are called by nf_hook() also these are called by
    ordinary functions that already held the rcu_read_lock():

    - __ctnetlink_glue_build(), ctnetlink_expect_event().
    - ctnetlink_proto_size().

    Applied files are below:

    - nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c.
    - nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c.
    - nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c.
    - xt_connlimit.c, xt_hashlimit.c, xt_osf.c

    Detailed calltrace can be found at:
    http://marc.info/?l=netfilter-devel&m=149667610710350&w=2

    Signed-off-by: Taehee Yoo
    Acked-by: Julian Anastasov
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

30 Jun, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. This batch contains connection tracking updates for the cleanup
    iteration path, patches from Florian Westphal:

    X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
    dying bit to let the CPU release them.

    X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
    conntrack from all namespace.

    X) Restart iteration on hashtable resizing, since both may occur at
    the same time.

    X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
    mapping on module removal.

    X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
    module removal, from Liping Zhang.

    X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
    if user requests this, also from Liping.

    X) Add net_ns_barrier() and use it from FTP helper, so make sure
    no concurrent namespace removal happens at the same time while
    the helper module is being removed.

    X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
    module size. Same thing in nf_tables.

    Updates for the nf_tables infrastructure:

    X) Prepare usage of the extended ACK reporting infrastructure for
    nf_tables.

    X) Remove unnecessary forward declaration in nf_tables hash set.

    X) Skip set size estimation if number of element is not specified.

    X) Changes to accomodate a (faster) unresizable hash set implementation,
    for anonymous sets and dynamic size fixed sets with no timeouts.

    X) Faster lookup function for unresizable hash table for 2 and 4
    bytes key.

    And, finally, a bunch of asorted small updates and cleanups:

    X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
    to device events and look up for index from the packet path, this
    is fixing an issue that is present since the very beginning, patch
    from Xin Long.

    X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

    X) Use ebt_invalid_target() whenever possible in the ebtables tree,
    from Gao Feng.

    X) Calm down compilation warning in nf_dup infrastructure, patch from
    stephen hemminger.

    X) Statify functions in nftables rt expression, also from stephen.

    X) Update Makefile to use canonical method to specify nf_tables-objs.
    From Jike Song.

    X) Use nf_conntrack_helpers_register() in amanda and H323.

    X) Space cleanup for ctnetlink, from linzhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Jun, 2017

1 commit

  • Pass down struct netlink_ext_ack as parameter to all of our nfnetlink
    subsystem callbacks, so we can work on follow up patches to provide
    finer grain error reporting using the new infrastructure that
    2d4bc93368f5 ("netlink: extended ACK reporting") provides.

    No functional change, just pass down this new object to callbacks.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

16 Jun, 2017

1 commit

  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions (skb_put, __skb_put and pskb_put) return void *
    and remove all the casts across the tree, adding a (u8 *) cast only
    where the unsigned char pointer was used directly, all done with the
    following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    which actually doesn't cover pskb_put since there are only three
    users overall.

    A handful of stragglers were converted manually, notably a macro in
    drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
    instances in net/bluetooth/hci_sock.c. In the former file, I also
    had to fix one whitespace problem spatch introduced.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

01 May, 2017

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. A large bunch of code cleanups, simplify the conntrack extension
    codebase, get rid of the fake conntrack object, speed up netns by
    selective synchronize_net() calls. More specifically, they are:

    1) Check for ct->status bit instead of using nfct_nat() from IPVS and
    Netfilter codebase, patch from Florian Westphal.

    2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

    3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

    4) Introduce nft_is_base_chain() helper function.

    5) Enforce expectation limit from userspace conntrack helper,
    from Gao Feng.

    6) Add nf_ct_remove_expect() helper function, from Gao Feng.

    7) NAT mangle helper function return boolean, from Gao Feng.

    8) ctnetlink_alloc_expect() should only work for conntrack with
    helpers, from Gao Feng.

    9) Add nfnl_msg_type() helper function to nfnetlink to build the
    netlink message type.

    10) Get rid of unnecessary cast on void, from simran singhal.

    11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

    12) Use list_prev_entry() from nf_tables, from simran signhal.

    13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

    14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

    15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

    16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

    17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

    18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

    19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

    20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

    21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

    22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

    23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

    24) Add event mask support to nft_ct, also from Florian.

    25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

    26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

    27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

    28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

    29) Shrink size of nf_conntrack_ecache structure, from Florian.

    30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

    31) Register SYNPROXY hooks on demand, from Florian Westphal.

    32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

    33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

    34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

    35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

    36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

    37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

    38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

    39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

    40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

    41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

    42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • nf_unregister_net_hook(s) can avoid a second call to synchronize_net,
    provided there is no nfqueue active in that net namespace (which is
    the common case).

    This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally
    this gets called during netns cleanup so no packets should be queued.

    For the rare case of base chain being unregistered or module removal
    while nfqueue is in use the extra hiccup due to the packet drops isn't
    a big deal.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

14 Apr, 2017

1 commit


08 Apr, 2017

1 commit


07 Apr, 2017

1 commit


29 Mar, 2017

1 commit


26 Dec, 2016

1 commit

  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

18 Nov, 2016

1 commit

  • Make struct pernet_operations::id unsigned.

    There are 2 reasons to do so:

    1)
    This field is really an index into an zero based array and
    thus is unsigned entity. Using negative value is out-of-bound
    access by definition.

    2)
    On x86_64 unsigned 32-bit data which are mixed with pointers
    via array indexing or offsets added or subtracted to pointers
    are preffered to signed 32-bit data.

    "int" being used as an array index needs to be sign-extended
    to 64-bit before being used.

    void f(long *p, int i)
    {
    g(p[i]);
    }

    roughly translates to

    movsx rsi, esi
    mov rdi, [rsi+...]
    call g

    MOVSX is 3 byte instruction which isn't necessary if the variable is
    unsigned because x86_64 is zero extending by default.

    Now, there is net_generic() function which, you guessed it right, uses
    "int" as an array index:

    static inline void *net_generic(const struct net *net, int id)
    {
    ...
    ptr = ng->ptr[id - 1];
    ...
    }

    And this function is used a lot, so those sign extensions add up.

    Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
    messing with code generation):

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

    Unfortunately some functions actually grow bigger.
    This is a semmingly random artefact of code generation with register
    allocator being used differently. gcc decides that some variable
    needs to live in new r8+ registers and every access now requires REX
    prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
    used which is longer than [r8]

    However, overall balance is in negative direction:

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
    function old new delta
    nfsd4_lock 3886 3959 +73
    tipc_link_build_proto_msg 1096 1140 +44
    mac80211_hwsim_new_radio 2776 2808 +32
    tipc_mon_rcv 1032 1058 +26
    svcauth_gss_legacy_init 1413 1429 +16
    tipc_bcbase_select_primary 379 392 +13
    nfsd4_exchange_id 1247 1260 +13
    nfsd4_setclientid_confirm 782 793 +11
    ...
    put_client_renew_locked 494 480 -14
    ip_set_sockfn_get 730 716 -14
    geneve_sock_add 829 813 -16
    nfsd4_sequence_done 721 703 -18
    nlmclnt_lookup_host 708 686 -22
    nfsd4_lockt 1085 1063 -22
    nfs_get_client 1077 1050 -27
    tcf_bpf_init 1106 1076 -30
    nfsd4_encode_fattr 5997 5930 -67
    Total: Before=154856051, After=154854321, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

03 Nov, 2016

1 commit


02 Nov, 2016

1 commit