09 Dec, 2016

1 commit


17 May, 2016

1 commit

  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Feb, 2016

1 commit

  • Devices may have limits on the number of fragments in an skb they support.
    Current codebase uses a constant as maximum for number of fragments one
    skb can hold and use.
    When enabling scatter/gather and running traffic with many small messages
    the codebase uses the maximum number of fragments and may thereby violate
    the max for certain devices.
    The patch introduces a global variable as max number of fragments.

    Signed-off-by: Hans Westgaard Ry
    Reviewed-by: Håkon Bugge
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hans Westgaard Ry
     

21 Mar, 2015

2 commits

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    net/core/sysctl_net_core.c
    net/ipv4/inet_diag.c

    The be_main.c conflict resolution was really tricky. The conflict
    hunks generated by GIT were very unhelpful, to say the least. It
    split functions in half and moved them around, when the real actual
    conflict only existed solely inside of one function, that being
    be_map_pci_bars().

    So instead, to resolve this, I checked out be_main.c from the top
    of net-next, then I applied the be_main.c changes from 'net' since
    the last time I merged. And this worked beautifully.

    The inet_diag.c and sysctl_net_core.c conflicts were simple
    overlapping changes, and were easily to resolve.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
    listen() backlog was limited to 65535.

    It is time to increase the width to allow much bigger backlog,
    if admins change /proc/sys/net/core/somaxconn &
    /proc/sys/net/ipv4/tcp_max_syn_backlog default values.

    Tested:

    echo 5000000 >/proc/sys/net/core/somaxconn
    echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog

    Ran a SYNFLOOD test against a listener using listen(fd, 5000000)

    myhost~# grep request_sock_TCP /proc/slabinfo
    request_sock_TCP 4185642 4411940 304 13 1 : tunables 54 27 8 : slabdata 339380 339380 0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Mar, 2015

1 commit

  • sysctl has sysctl.net.core.rmem_*/wmem_* parameters which can be
    set to incorrect values. Given that 'struct sk_buff' allocates from
    rcvbuf, incorrectly set buffer length could result to memory
    allocation failures. For example, set them as follows:

    # sysctl net.core.rmem_default=64
    net.core.wmem_default = 64
    # sysctl net.core.wmem_default=64
    net.core.wmem_default = 64
    # ping localhost -s 1024 -i 0 > /dev/null

    This could result to the following failure:

    skbuff: skb_over_panic: text:ffffffff81628db4 len:-32 put:-32
    head:ffff88003a1cc200 data:ffff88003a1cc200 tail:0xffffffe0 end:0xc0 dev:
    kernel BUG at net/core/skbuff.c:102!
    invalid opcode: 0000 [#1] SMP
    ...
    task: ffff88003b7f5550 ti: ffff88003ae88000 task.ti: ffff88003ae88000
    RIP: 0010:[] [] skb_put+0xa1/0xb0
    RSP: 0018:ffff88003ae8bc68 EFLAGS: 00010296
    RAX: 000000000000008d RBX: 00000000ffffffe0 RCX: 0000000000000000
    RDX: ffff88003fdcf598 RSI: ffff88003fdcd9c8 RDI: ffff88003fdcd9c8
    RBP: ffff88003ae8bc88 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000001 R11: 00000000000002b2 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff88003d3f7300 R15: ffff88000012a900
    FS: 00007fa0e2b4a840(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000d0f7e0 CR3: 000000003b8fb000 CR4: 00000000000006f0
    Stack:
    ffff88003a1cc200 00000000ffffffe0 00000000000000c0 ffffffff818cab1d
    ffff88003ae8bd68 ffffffff81628db4 ffff88003ae8bd48 ffff88003b7f5550
    ffff880031a09408 ffff88003b7f5550 ffff88000012aa48 ffff88000012ab00
    Call Trace:
    [] unix_stream_sendmsg+0x2c4/0x470
    [] sock_write_iter+0x146/0x160
    [] new_sync_write+0x92/0xd0
    [] vfs_write+0xd6/0x180
    [] SyS_write+0x59/0xd0
    [] system_call_fastpath+0x12/0x17
    Code: 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00
    00 00 48 c7 c7 30 db 91 81 48 89 04 24 31 c0 e8 4f a8 0e 00 0b
    eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83
    RIP [] skb_put+0xa1/0xb0
    RSP
    Kernel panic - not syncing: Fatal exception

    Moreover, the possible minimum is 1, so we can get another kernel panic:
    ...
    BUG: unable to handle kernel paging request at ffff88013caee5c0
    IP: [] __alloc_skb+0x12f/0x1f0
    ...

    Signed-off-by: Alexey Kodanev
    Signed-off-by: David S. Miller

    Alexey Kodanev
     

14 Feb, 2015

1 commit

  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    Signed-off-by: Tejun Heo
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

09 Feb, 2015

2 commits

  • Make sure root user does not try something stupid.

    Also make sure mask field in struct rps_sock_flow_table
    does not share a cache line with the potentially often dirtied
    flow table.

    Signed-off-by: Eric Dumazet
    Fixes: 567e4b79731c ("net: rfs: add hash collision detection")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Receive Flow Steering is a nice solution but suffers from
    hash collisions when a mix of connected and unconnected traffic
    is received on the host, when flow hash table is populated.

    Also, clearing flow in inet_release() makes RFS not very good
    for short lived flows, as many packets can follow close().
    (FIN , ACK packets, ...)

    This patch extends the information stored into global hash table
    to not only include cpu number, but upper part of the hash value.

    I use a 32bit value, and dynamically split it in two parts.

    For host with less than 64 possible cpus, this gives 6 bits for the
    cpu number, and 26 (32-6) bits for the upper part of the hash.

    Since hash bucket selection use low order bits of the hash, we have
    a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
    enough.

    If the hash found in flow table does not match, we fallback to RPS (if
    it is enabled for the rxqueue).

    This means that a packet for an non connected flow can avoid the
    IPI through a unrelated/victim CPU.

    This also means we no longer have to clear the table at socket
    close time, and this helps short lived flows performance.

    Signed-off-by: Eric Dumazet
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Feb, 2015

1 commit

  • Tx timestamps are looped onto the error queue on top of an skb. This
    mechanism leaks packet headers to processes unless the no-payload
    options SOF_TIMESTAMPING_OPT_TSONLY is set.

    Add a sysctl that optionally drops looped timestamp with data. This
    only affects processes without CAP_NET_RAW.

    The policy is checked when timestamps are generated in the stack.
    It is possible for timestamps with data to be reported after the
    sysctl is set, if these were queued internally earlier.

    No vulnerability is immediately known that exploits knowledge
    gleaned from packet headers, but it may still be preferable to allow
    administrators to lock down this path at the cost of possible
    breakage of legacy applications.

    Signed-off-by: Willem de Bruijn

    ----

    Changes
    (v1 -> v2)
    - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
    (rfc -> v1)
    - document the sysctl in Documentation/sysctl/net.txt
    - fix access control race: read .._OPT_TSONLY only once,
    use same value for permission check and skb generation.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

17 Nov, 2014

1 commit

  • RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
    RSS key.

    Some drivers use a constant (and well known key), some drivers use a random
    key per port, making bonding setups hard to tune. Well known keys increase
    attack surface, considering that number of queues is usually a power of two.

    This patch provides infrastructure to help drivers doing the right thing.

    netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
    even if they provide ethtool -X support to let user redefine the key later.

    A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
    RSS key even for drivers not providing ethtool -x support, in case some
    applications want to precisely setup flows to match some RX queues.

    Tested:

    myhost:~# cat /proc/sys/net/core/netdev_rss_key
    11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72

    myhost:~# ethtool -x eth0
    RX flow hash indirection table for eth0 with 8 RX ring(s):
    0: 0 1 2 3 4 5 6 7
    RSS hash key:
    11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Nov, 2014

1 commit

  • Use the more common dynamic_debug capable net_dbg_ratelimited
    and remove the LIMIT_NETDEBUG macro.

    All messages are still ratelimited.

    Some KERN_ uses are changed to KERN_DEBUG.

    This may have some negative impact on messages that were
    emitted at KERN_INFO that are not not enabled at all unless
    DEBUG is defined or dynamic_debug is enabled. Even so,
    these messages are now _not_ emitted by default.

    This also eliminates the use of the net_msg_warn sysctl
    "/proc/sys/net/core/warnings". For backward compatibility,
    the sysctl is not removed, but it has no function. The extern
    declaration of net_msg_warn is removed from sock.h and made
    static in net/core/sysctl_net_core.c

    Miscellanea:

    o Update the sysctl documentation
    o Remove the embedded uses of pr_fmt
    o Coalesce format fragments
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

20 Dec, 2013

1 commit

  • Given we allocate memory for each cpu, we can do this
    using NUMA affinities, instead of using NUMA policies
    of the process changing flow_limit_cpu_bitmap value.

    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Aug, 2013

1 commit

  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

03 Aug, 2013

1 commit

  • It's possible to assign an invalid value to the net.core.somaxconn
    sysctl variable, because there is no checks at all.

    The sk_max_ack_backlog field of the sock structure is defined as
    unsigned short. Therefore, the backlog argument in inet_listen()
    shouldn't exceed USHRT_MAX. The backlog argument in the listen() syscall
    is truncated to the somaxconn value. So, the somaxconn value shouldn't
    exceed 65535 (USHRT_MAX).
    Also, negative values of somaxconn are meaningless.

    before:
    $ sysctl -w net.core.somaxconn=256
    net.core.somaxconn = 256
    $ sysctl -w net.core.somaxconn=65536
    net.core.somaxconn = 65536
    $ sysctl -w net.core.somaxconn=-100
    net.core.somaxconn = -100

    after:
    $ sysctl -w net.core.somaxconn=256
    net.core.somaxconn = 256
    $ sysctl -w net.core.somaxconn=65536
    error: "Invalid argument" setting key "net.core.somaxconn"
    $ sysctl -w net.core.somaxconn=-100
    error: "Invalid argument" setting key "net.core.somaxconn"

    Based on a prior patch from Changli Gao.

    Signed-off-by: Roman Gushchin
    Reported-by: Changli Gao
    Suggested-by: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Roman Gushchin
     

02 Aug, 2013

1 commit


11 Jul, 2013

2 commits


26 Jun, 2013

1 commit

  • select/poll busy-poll support.

    Split sysctl value into two separate ones, one for read and one for poll.
    updated Documentation/sysctl/net.txt

    Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
    sk_poll_ll if possible. sock_poll sets this flag in its return value
    to indicate to select/poll when a socket that can busy poll is found.

    When poll/select have nothing to report, call the low-level
    sock_poll again until we are out of time or we find something.

    Once the system call finds something, it stops setting POLL_LL, so it can
    return the result to the user ASAP.

    Signed-off-by: Eliezer Tamir
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

18 Jun, 2013

1 commit


14 Jun, 2013

1 commit

  • Caught by sparse:
    - __rcu: missing annotation to sd->flow_limit
    - __user: direct access in cpumask_scnprintf

    Also
    - add endline character when printing bitmap if room in buffer
    - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY

    The last item warrants some explanation. The hashtable buckets are
    subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
    to bucket size, since all packets may end up in a single bucket. The
    current (rather arbitrary) history value of 256 happens to match the
    buffer size (u8).

    As a result, with a single flow, the first 128 packets are accepted
    (correct), the second 128 packets dropped (correct) and then the
    history[] array has filled, so that each subsequent new packet
    causes an increment in the bucket for new_flow plus a decrement
    for old_flow: a steady state.

    This is fine if packets are dropped, as the steady state goes away
    as soon as a mix of traffic reappears. But, because the 256th packet
    overflowed the bucket to 0: no packets are dropped.

    Instead of explicitly adding an overflow check, this patch changes
    FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.

    Reported-by: Fengguang Wu
    (first item)

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

13 Jun, 2013

1 commit

  • Reduce the uses of this unnecessary typedef.

    Done via perl script:

    $ git grep --name-only -w ctl_table net | \
    xargs perl -p -i -e '\
    sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
    s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

    Reflow the modified lines that now exceed 80 columns.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

11 Jun, 2013

1 commit

  • Adds an ndo_ll_poll method and the code that supports it.
    This method can be used by low latency applications to busy-poll
    Ethernet device queues directly from the socket code.
    sysctl_net_ll_poll controls how many microseconds to poll.
    Default is zero (disabled).
    Individual protocol support will be added by subsequent patches.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Jesse Brandeburg
    Signed-off-by: Eliezer Tamir
    Acked-by: Eric Dumazet
    Tested-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

21 May, 2013

1 commit

  • A cpu executing the network receive path sheds packets when its input
    queue grows to netdev_max_backlog. A single high rate flow (such as a
    spoofed source DoS) can exceed a single cpu processing rate and will
    degrade throughput of other flows hashed onto the same cpu.

    This patch adds a more fine grained hashtable. If the netdev backlog
    is above a threshold, IRQ cpus track the ratio of total traffic of
    each flow (using 4096 buckets, configurable). The ratio is measured
    by counting the number of packets per flow over the last 256 packets
    from the source cpu. Any flow that occupies a large fraction of this
    (set at 50%) will see packet drop while above the threshold.

    Tested:
    Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
    kernel receive (RPS) on cpu0 and application threads on cpus 2--7
    each handling 20k req/s. Throughput halves when hit with a 400 kpps
    antagonist storm. With this patch applied, antagonist overload is
    dropped and the server processes its complete load.

    The patch is effective when kernel receive processing is the
    bottleneck. The above RPS scenario is a extreme, but the same is
    reached with RFS and sufficient kernel processing (iptables, packet
    socket tap, ..).

    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

29 Jan, 2013

1 commit

  • I found if we write a larger than 4GB value to some sysctl
    variables, the sending syscall will hang up forever, because these
    variables are 32 bits, such large values make them overflow to 0 or
    negative.

    This patch try to fix overflow or prevent from zero value setup
    of below sysctl variables:

    net.core.wmem_default
    net.core.rmem_default

    net.core.rmem_max
    net.core.wmem_max

    net.ipv4.udp_rmem_min
    net.ipv4.udp_wmem_min

    net.ipv4.tcp_wmem
    net.ipv4.tcp_rmem

    Signed-off-by: Eric Dumazet
    Signed-off-by: Li Yu
    Signed-off-by: David S. Miller

    bingtian.ly@taobao.com
     

19 Nov, 2012

1 commit

  • In preparation for supporting the creation of network namespaces
    by unprivileged users, modify all of the per net sysctl exports
    and refuse to allow them to unprivileged users.

    This makes it safe for unprivileged users in general to access
    per net sysctls, and allows sysctls to be exported to unprivileged
    users on an individual basis as they are deemed safe.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

21 Apr, 2012

5 commits

  • We don't use struct ctl_path anymore so delete the exported constants.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This results in code with less boiler plate that is a bit easier
    to read.

    Additionally stops us from using compatibility code in the sysctl
    core, hastening the day when the compatibility code can be removed.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • On the next line we register the net_core_table in net/core which
    creates the directory and ensures it exists.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This makes it clearer which sysctls are relative to your current network
    namespace.

    This makes it a little less error prone by not exposing sysctls for the
    initial network namespace in other namespaces.

    This is the same way we handle all of our other network interfaces to
    userspace and I can't honestly remember why we didn't do this for
    sysctls right from the start.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • register_sysctl_rotable never caught on as an interesting way to
    register sysctls. My take on the situation is that what we want are
    sysctls that we can only see in the initial network namespace. What we
    have implemented with register_sysctl_rotable are sysctls that we can
    see in all of the network namespaces and can only change in the initial
    network namespace.

    That is a very silly way to go. Just register the network sysctls
    in the initial network namespace and we don't have any weird special
    cases to deal with.

    The sysctls affected are:
    /proc/sys/net/ipv4/ipfrag_secret_interval
    /proc/sys/net/ipv4/ipfrag_max_dist
    /proc/sys/net/ipv6/ip6frag_secret_interval
    /proc/sys/net/ipv6/mld_max_msf

    I really don't expect anyone will miss them if they can't read them in a
    child user namespace.

    CC: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

19 Apr, 2012

1 commit


18 Apr, 2012

1 commit


24 Feb, 2012

1 commit

  • …_key_slow_[inc|dec]()

    So here's a boot tested patch on top of Jason's series that does
    all the cleanups I talked about and turns jump labels into a
    more intuitive to use facility. It should also address the
    various misconceptions and confusions that surround jump labels.

    Typical usage scenarios:

    #include <linux/static_key.h>

    struct static_key key = STATIC_KEY_INIT_TRUE;

    if (static_key_false(&key))
    do unlikely code
    else
    do likely code

    Or:

    if (static_key_true(&key))
    do likely code
    else
    do unlikely code

    The static key is modified via:

    static_key_slow_inc(&key);
    ...
    static_key_slow_dec(&key);

    The 'slow' prefix makes it abundantly clear that this is an
    expensive operation.

    I've updated all in-kernel code to use this everywhere. Note
    that I (intentionally) have not pushed through the rename
    blindly through to the lowest levels: the actual jump-label
    patching arch facility should be named like that, so we want to
    decouple jump labels from the static-key facility a bit.

    On non-jump-label enabled architectures static keys default to
    likely()/unlikely() branches.

    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Acked-by: Jason Baron <jbaron@redhat.com>
    Acked-by: Steven Rostedt <rostedt@goodmis.org>
    Cc: a.p.zijlstra@chello.nl
    Cc: mathieu.desnoyers@efficios.com
    Cc: davem@davemloft.net
    Cc: ddaney.cavm@gmail.com
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

    Ingo Molnar
     

18 Nov, 2011

1 commit

  • Most machines dont use RPS/RFS, and pay a fair amount of instructions in
    netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
    RPS/RFS is not setup.

    Add a jump_label named rps_needed.

    If no device rps_map or global rps_sock_flow_table is setup,
    netif_receive_skb() / netif_rx() do a single instruction instead of many
    ones, including conditional jumps.

    jmp +0 (if CONFIG_JUMP_LABEL=y)

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 May, 2011

1 commit

  • Ingo Molnar noticed that we have this unnecessary ratelimit.h
    dependency in linux/net.h, which hid compilation problems from
    people doing builds only with CONFIG_NET enabled.

    Move this stuff out to a seperate net/net_ratelimit.h file and
    include that in the only two places where this thing is needed.

    Signed-off-by: David S. Miller
    Acked-by: Ingo Molnar

    David S. Miller
     

28 Apr, 2011

1 commit

  • In order to speedup packet filtering, here is an implementation of a
    JIT compiler for x86_64

    It is disabled by default, and must be enabled by the admin.

    echo 1 >/proc/sys/net/core/bpf_jit_enable

    It uses module_alloc() and module_free() to get memory in the 2GB text
    kernel range since we call helpers functions from the generated code.

    EAX : BPF A accumulator
    EBX : BPF X accumulator
    RDI : pointer to skb (first argument given to JIT function)
    RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
    r9d : skb->len - skb->data_len (headlen)
    r8 : skb->data

    To get a trace of generated code, use :

    echo 2 >/proc/sys/net/core/bpf_jit_enable

    Example of generated code :

    # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

    flen=18 proglen=147 pass=3 image=ffffffffa00b5000
    JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
    JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
    JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
    JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
    JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
    JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
    JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
    JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
    JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
    JIT code: ffffffffa00b5090: c0 c9 c3

    BPF program is 144 bytes long, so native program is almost same size ;)

    (000) ldh [12]
    (001) jeq #0x800 jt 2 jf 8
    (002) ld [26]
    (003) and #0xffffff00
    (004) jeq #0xc0a81400 jt 16 jf 5
    (005) ld [30]
    (006) and #0xffffff00
    (007) jeq #0xc0a81400 jt 16 jf 17
    (008) jeq #0x806 jt 10 jf 9
    (009) jeq #0x8035 jt 10 jf 17
    (010) ld [28]
    (011) and #0xffffff00
    (012) jeq #0xc0a81400 jt 16 jf 13
    (013) ld [38]
    (014) and #0xffffff00
    (015) jeq #0xc0a81400 jt 16 jf 17
    (016) ret #65535
    (017) ret #0

    Signed-off-by: Eric Dumazet
    Cc: Arnaldo Carvalho de Melo
    Cc: Ben Hutchings
    Cc: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2010

1 commit

  • Add __rcu annotations to :
    (struct netdev_rx_queue)->rps_map
    (struct netdev_rx_queue)->rps_flow_table
    struct rps_sock_flow_table *rps_sock_flow_table;

    And use appropriate rcu primitives.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 May, 2010

1 commit

  • With RPS inclusion, skb timestamping is not consistent in RX path.

    If netif_receive_skb() is used, its deferred after RPS dispatch.

    If netif_rx() is used, its done before RPS dispatch.

    This can give strange tcpdump timestamps results.

    I think timestamping should be done as soon as possible in the receive
    path, to get meaningful values (ie timestamps taken at the time packet
    was delivered by NIC driver to our stack), even if NAPI already can
    defer timestamping a bit (RPS can help to reduce the gap)

    Tom Herbert prefer to sample timestamps after RPS dispatch. In case
    sampling is expensive (HPET/acpi_pm on x86), this makes sense.

    Let admins switch from one mode to another, using a new
    sysctl, /proc/sys/net/core/netdev_tstamp_prequeue

    Its default value (1), means timestamps are taken as soon as possible,
    before backlog queueing, giving accurate timestamps.

    Setting a 0 value permits to sample timestamps when processing backlog,
    after RPS dispatch, to lower the load of the pre-RPS cpu.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Apr, 2010

1 commit

  • This patch implements receive flow steering (RFS). RFS steers
    received packets for layer 3 and 4 processing to the CPU where
    the application for the corresponding flow is running. RFS is an
    extension of Receive Packet Steering (RPS).

    The basic idea of RFS is that when an application calls recvmsg
    (or sendmsg) the application's running CPU is stored in a hash
    table that is indexed by the connection's rxhash which is stored in
    the socket structure. The rxhash is passed in skb's received on
    the connection from netif_receive_skb. For each received packet,
    the associated rxhash is used to look up the CPU in the hash table,
    if a valid CPU is set then the packet is steered to that CPU using
    the RPS mechanisms.

    The convolution of the simple approach is that it would potentially
    allow OOO packets. If threads are thrashing around CPUs or multiple
    threads are trying to read from the same sockets, a quickly changing
    CPU value in the hash table could cause rampant OOO packets--
    we consider this a non-starter.

    To avoid OOO packets, this solution implements two types of hash
    tables: rps_sock_flow_table and rps_dev_flow_table.

    rps_sock_table is a global hash table. Each entry is just a CPU
    number and it is populated in recvmsg and sendmsg as described above.
    This table contains the "desired" CPUs for flows.

    rps_dev_flow_table is specific to each device queue. Each entry
    contains a CPU and a tail queue counter. The CPU is the "current"
    CPU for a matching flow. The tail queue counter holds the value
    of a tail queue counter for the associated CPU's backlog queue at
    the time of last enqueue for a flow matching the entry.

    Each backlog queue has a queue head counter which is incremented
    on dequeue, and so a queue tail counter is computed as queue head
    count + queue length. When a packet is enqueued on a backlog queue,
    the current value of the queue tail counter is saved in the hash
    entry of the rps_dev_flow_table.

    And now the trick: when selecting the CPU for RPS (get_rps_cpu)
    the rps_sock_flow table and the rps_dev_flow table for the RX queue
    are consulted. When the desired CPU for the flow (found in the
    rps_sock_flow table) does not match the current CPU (found in the
    rps_dev_flow table), the current CPU is changed to the desired CPU
    if one of the following is true:

    - The current CPU is unset (equal to RPS_NO_CPU)
    - Current CPU is offline
    - The current CPU's queue head counter >= queue tail counter in the
    rps_dev_flow table. This checks if the queue tail has advanced
    beyond the last packet that was enqueued using this table entry.
    This guarantees that all packets queued using this entry have been
    dequeued, thus preserving in order delivery.

    Making each queue have its own rps_dev_flow table has two advantages:
    1) the tail queue counters will be written on each receive, so
    keeping the table local to interrupting CPU s good for locality. 2)
    this allows lockless access to the table-- the CPU number and queue
    tail counter need to be accessed together under mutual exclusion
    from netif_receive_skb, we assume that this is only called from
    device napi_poll which is non-reentrant.

    This patch implements RFS for TCP and connected UDP sockets.
    It should be usable for other flow oriented protocols.

    There are two configuration parameters for RFS. The
    "rps_flow_entries" kernel init parameter sets the number of
    entries in the rps_sock_flow_table, the per rxqueue sysfs entry
    "rps_flow_cnt" contains the number of entries in the rps_dev_flow
    table for the rxqueue. Both are rounded to power of two.

    The obvious benefit of RFS (over just RPS) is that it achieves
    CPU locality between the receive processing for a flow and the
    applications processing; this can result in increased performance
    (higher pps, lower latency).

    The benefits of RFS are dependent on cache hierarchy, application
    load, and other factors. On simple benchmarks, we don't necessarily
    see improvement and sometimes see degradation. However, for more
    complex benchmarks and for applications where cache pressure is
    much higher this technique seems to perform very well.

    Below are some benchmark results which show the potential benfit of
    this patch. The netperf test has 500 instances of netperf TCP_RR
    test with 1 byte req. and resp. The RPC test is an request/response
    test similar in structure to netperf RR test ith 100 threads on
    each host, but does more work in userspace that netperf.

    e1000e on 8 core Intel
    No RFS or RPS 104K tps at 30% CPU
    No RFS (best RPS config): 290K tps at 63% CPU
    RFS 303K tps at 61% CPU

    RPC test tps CPU% 50/90/99% usec latency Latency StdDev
    No RFS/RPS 103K 48% 757/900/3185 4472.35
    RPS only: 174K 73% 415/993/2468 491.66
    RFS 223K 73% 379/651/1382 315.61

    Signed-off-by: Tom Herbert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Herbert