26 Dec, 2016

2 commits

  • No point in going through loops and hoops instead of just comparing the
    values.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

1 commit


24 Dec, 2016

2 commits

  • By setting certain socket options on ipv6 raw sockets, we can confuse the
    length calculation in rawv6_push_pending_frames triggering a BUG_ON.

    RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
    RSP: 0018:ffff881f6c4a7c18 EFLAGS: 00010282
    RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
    RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
    RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
    R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
    R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80

    Call Trace:
    [] ? unmap_page_range+0x693/0x830
    [] inet_sendmsg+0x67/0xa0
    [] sock_sendmsg+0x38/0x50
    [] SYSC_sendto+0xef/0x170
    [] SyS_sendto+0xe/0x10
    [] do_syscall_64+0x50/0xa0
    [] entry_SYSCALL64_slow_path+0x25/0x25

    Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 504

    int main(int argc, char* argv[])
    {
    int fd;
    int zero = 0;
    char buf[LEN];

    memset(buf, 0, LEN);

    fd = socket(AF_INET6, SOCK_RAW, 7);

    setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4);
    setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN);

    sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
    }

    Signed-off-by: Dave Jones
    Signed-off-by: David S. Miller

    Dave Jones
     
  • Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
    the packet. For sockets that have transport headers pulled, transport
    offset can be negative. Use signed comparison to avoid overflow.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Nisar Jagabar
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

18 Dec, 2016

2 commits

  • The protocol field is checked when deleting IPv4 routes, but ignored for
    IPv6, which causes problems with routing daemons accidentally deleting
    externally set routes (observed by multiple bird6 users).

    This can be verified using `ip -6 route del proto something`.

    Signed-off-by: Mantas Mikulėnas
    Signed-off-by: David S. Miller

    Mantas M
     
  • A user may call listen with binding an explicit port with the intent
    that the kernel will assign an available port to the socket. In this
    case inet_csk_get_port does a port scan. For such sockets, the user may
    also set soreuseport with the intent a creating more sockets for the
    port that is selected. The problem is that the initial socket being
    opened could inadvertently choose an existing and unreleated port
    number that was already created with soreuseport.

    This patch adds a boolean parameter to inet_bind_conflict that indicates
    rather soreuseport is allowed for the check (in addition to
    sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
    the argument is set to true if an explicit port is being looked up (snum
    argument is nonzero), and is false if port scan is done.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

08 Dec, 2016

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains a large Netfilter update for net-next,
    to summarise:

    1) Add support for stateful objects. This series provides a nf_tables
    native alternative to the extended accounting infrastructure for
    nf_tables. Two initial stateful objects are supported: counters and
    quotas. Objects are identified by a user-defined name, you can fetch
    and reset them anytime. You can also use a maps to allow fast lookups
    using any arbitrary key combination. More info at:

    http://marc.info/?l=netfilter-devel&m=148029128323837&w=2

    2) On-demand registration of nf_conntrack and defrag hooks per netns.
    Register nf_conntrack hooks if we have a stateful ruleset, ie.
    state-based filtering or NAT. The new nf_conntrack_default_on sysctl
    enables this from newly created netnamespaces. Default behaviour is not
    modified. Patches from Florian Westphal.

    3) Allocate 4k chunks and then use these for x_tables counter allocation
    requests, this improves ruleset load time and also datapath ruleset
    evaluation, patches from Florian Westphal.

    4) Add support for ebpf to the existing x_tables bpf extension.
    From Willem de Bruijn.

    5) Update layer 4 checksum if any of the pseudoheader fields is updated.
    This provides a limited form of 1:1 stateless NAT that make sense in
    specific scenario, eg. load balancing.

    6) Add support to flush sets in nf_tables. This series comes with a new
    set->ops->deactivate_one() indirection given that we have to walk
    over the list of set elements, then deactivate them one by one.
    The existing set->ops->deactivate() performs an element lookup that
    we don't need.

    7) Two patches to avoid cloning packets, thus speed up packet forwarding
    via nft_fwd from ingress. From Florian Westphal.

    8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to
    prevent infinite loops, patch from Dwip Banerjee. And one minor
    refactoring from Gao feng.

    9) Revisit recent log support for nf_tables netdev families: One patch
    to ensure that we correctly handle non-ethernet packets. Another
    patch to add missing logger definition for netdev. Patches from
    Liping Zhang.

    10) Three patches for nft_fib, one to address insufficient register
    initialization and another to solve incorrect (although harmless)
    byteswap operation. Moreover update xt_rpfilter and nft_fib to match
    lbcast packets with zeronet as source, eg. DHCP Discover packets
    (0.0.0.0 -> 255.255.255.255). Also from Liping Zhang.

    11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from
    Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has
    been broken in many-cast mode for some little time, let's give them a
    chance by placing them at the same level as other existing protocols.
    Thus, users don't explicitly have to modprobe support for this and
    NAT rules work for them. Some people point to the lack of support in
    SOHO Linux-based routers that make deployment of new protocols harder.
    I guess other middleboxes outthere on the Internet are also to blame.
    Anyway, let's see if this has any impact in the midrun.

    12) Skip software SCTP software checksum calculation if the NIC comes
    with SCTP checksum offload support. From Davide Caratti.

    13) Initial core factoring to prepare conversion to hook array. Three
    patches from Aaron Conole.

    14) Gao Feng made a wrong conversion to switch in the xt_multiport
    extension in a patch coming in the previous batch. Fix it in this
    batch.

    15) Get vmalloc call in sync with kmalloc flags to avoid a warning
    and likely OOM killer intervention from x_tables. From Marcelo
    Ricardo Leitner.

    16) Update Arturo Borrero's email address in all source code headers.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

07 Dec, 2016

5 commits

  • Acctually ntohl and htonl are identical, so this doesn't affect
    anything, but it is conceptually wrong.

    Signed-off-by: Liping Zhang
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • instead of allocating each xt_counter individually, allocate 4k chunks
    and then use these for counter allocation requests.

    This should speed up rule evaluation by increasing data locality,
    also speeds up ruleset loading because we reduce calls to the percpu
    allocator.

    As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
    arches with 64k page size.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Keeps some noise away from a followup patch.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • On SMP we overload the packet counter (unsigned long) to contain
    percpu offset. Hide this from callers and pass xt_counters address
    instead.

    Preparation patch to allocate the percpu counters in page-sized batch
    chunks.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • nf_defrag modules for ipv4 and ipv6 export an empty stub function.
    Any module that needs the defragmentation hooks registered simply 'calls'
    this empty function to create a phony module dependency -- modprobe will
    then load the defrag module too.

    This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
    registration until the functionality is requested within a network namespace
    instead of module load time for all namespaces.

    Hooks are only un-registered on module unload or when a namespace that used
    such defrag functionality exits.

    We have to use struct net for this as the register hooks can be called
    before netns initialization here from the ipv4/ipv6 conntrack module
    init path.

    There is no unregister functionality support, defrag will always be
    active once it was requested inside a net namespace.

    The reason is that defrag has impact on nft and iptables rulesets
    (without defrag we might see framents).

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

06 Dec, 2016

2 commits

  • Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.

    It is possible to configure IP interfaces with IPv4-mapped addresses, and
    one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
    to this fix the kernel returned an EINVAL when attempting to add an IPv6
    route with an IPv4-mapped address as a nexthop/gateway.

    RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
    thus in order to support that type of address configuration the kernel
    needs to allow IPv4-mapped addresses as nexthops.

    Signed-off-by: Erik Nordmark
    Signed-off-by: Bob Gilligan
    Signed-off-by: David S. Miller

    Erik Nordmark
     
  • tsq_flags being in the same cache line than sk_wmem_alloc
    makes a lot of sense. Both fields are changed from tcp_wfree()
    and more generally by various TSQ related functions.

    Prior patch made room in struct sock and added sk_tsq_flags,
    this patch deletes tsq_flags from struct tcp_sock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Dec, 2016

7 commits

  • This makes use of nf_ct_netns_get/put added in previous patch.
    We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6
    then implement use-count to track how many users (nft or xtables modules)
    have a dependency on ipv4 and/or ipv6 connection tracking functionality.

    When count reaches zero, the hooks are unregistered.

    This delays activation of connection tracking inside a namespace until
    stateful firewall rule or nat rule gets added.

    This patch breaks backwards compatibility in the sense that connection
    tracking won't be active anymore when the protocol tracker module is
    loaded. This breaks e.g. setups that ctnetlink for flow accounting and
    the like, without any '-m conntrack' packet filter rules.

    Followup patch restores old behavour and makes new delayed scheme
    optional via sysctl.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • so that conntrack core will add the needed hooks in this namespace.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • currently aliased to try_module_get/_put.
    Will be changed in next patch when we add functions to make use of ->net
    argument to store usercount per l3proto tracker.

    This is needed to avoid registering the conntrack hooks in all netns and
    later only enable connection tracking in those that need conntrack.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y,
    connection tracking support for UDPlite protocol is built-in into
    nf_conntrack.ko.

    footprint test:
    $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \
    net/ipv4/netfilter/nf_conntrack_ipv4.ko \
    net/ipv6/netfilter/nf_conntrack_ipv6.ko

    (builtin)|| udplite| ipv4 | ipv6 |nf_conntrack
    ---------++--------+--------+--------+--------------
    none || 432538 | 828755 | 828676 | 6141434
    UDPlite || - | 829649 | 829362 | 6498204

    Signed-off-by: Davide Caratti
    Signed-off-by: Pablo Neira Ayuso

    Davide Caratti
     
  • CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection
    tracking support for SCTP protocol is built-in into nf_conntrack.ko.

    footprint test:
    $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \
    net/ipv4/netfilter/nf_conntrack_ipv4.ko \
    net/ipv6/netfilter/nf_conntrack_ipv6.ko

    (builtin)|| sctp | ipv4 | ipv6 | nf_conntrack
    ---------++--------+--------+--------+--------------
    none || 498243 | 828755 | 828676 | 6141434
    SCTP || - | 829254 | 829175 | 6547872

    Signed-off-by: Davide Caratti
    Signed-off-by: Pablo Neira Ayuso

    Davide Caratti
     
  • CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
    tracking support for DCCP protocol is built-in into nf_conntrack.ko.

    footprint test:
    $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
    net/ipv4/netfilter/nf_conntrack_ipv4.ko \
    net/ipv6/netfilter/nf_conntrack_ipv6.ko

    (builtin)|| dccp | ipv4 | ipv6 | nf_conntrack
    ---------++--------+--------+--------+--------------
    none || 469140 | 828755 | 828676 | 6141434
    DCCP || - | 830566 | 829935 | 6533526

    Signed-off-by: Davide Caratti
    Signed-off-by: Pablo Neira Ayuso

    Davide Caratti
     
  • The email address has changed, let's update the copyright statements.

    Signed-off-by: Arturo Borrero Gonzalez
    Signed-off-by: Pablo Neira Ayuso

    Arturo Borrero Gonzalez
     

04 Dec, 2016

2 commits

  • Implemented RFC7527 Enhanced DAD.
    IPv6 duplicate address detection can fail if there is some temporary
    loopback of Ethernet frames. RFC7527 solves this by including a random
    nonce in the NS messages used for DAD, and if an NS is received with the
    same nonce it is assumed to be a looped back DAD probe and is ignored.
    RFC7527 is enabled by default. Can be disabled by setting both of
    conf/{all,interface}/enhanced_dad to zero.

    Signed-off-by: Erik Nordmark
    Signed-off-by: Bob Gilligan
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Erik Nordmark
     
  • Couple conflicts resolved here:

    1) In the MACB driver, a bug fix to properly initialize the
    RX tail pointer properly overlapped with some changes
    to support variable sized rings.

    2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
    overlapping with a reorganization of the driver to support
    ACPI, OF, as well as PCI variants of the chip.

    3) In 'net' we had several probe error path bug fixes to the
    stmmac driver, meanwhile a lot of this code was cleaned up
    and reorganized in 'net-next'.

    4) The cls_flower classifier obtained a helper function in
    'net-next' called __fl_delete() and this overlapped with
    Daniel Borkamann's bug fix to use RCU for object destruction
    in 'net'. It also overlapped with Jiri's change to guard
    the rhashtable_remove_fast() call with a check against
    tc_skip_sw().

    5) In mlx4, a revert bug fix in 'net' overlapped with some
    unrelated changes in 'net-next'.

    6) In geneve, a stale header pointer after pskb_expand_head()
    bug fix in 'net' overlapped with a large reorganization of
    the same code in 'net-next'. Since the 'net-next' code no
    longer had the bug in question, there was nothing to do
    other than to simply take the 'net-next' hunks.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Dec, 2016

5 commits

  • Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
    BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
    any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
    Currently only sk_bound_dev_if is exported to userspace for modification
    by a bpf program.

    This allows a cgroup to be configured such that AF_INET{6} sockets opened
    by processes are automatically bound to a specific device. In turn, this
    enables the running of programs that do not support SO_BINDTODEVICE in a
    specific VRF context / L3 domain.

    Signed-off-by: David Ahern
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    David Ahern
     
  • segs needs to be checked for being NULL in ipv6_gso_segment() before calling
    skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference:

    [ 97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
    [ 97.819112] IP: [] ipv6_gso_segment+0x119/0x2f0
    [ 97.825214] PGD 0 [ 97.827047]
    [ 97.828540] Oops: 0000 [#1] SMP
    [ 97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5
    nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
    iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
    ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
    bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel
    snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device
    snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport
    sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
    ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon
    broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
    ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core
    i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod
    [ 97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1
    [ 97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010
    [ 97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000
    [ 97.947720] RIP: 0010:[] [] ipv6_gso_segment+0x119/0x2f0
    [ 97.956251] RSP: 0018:ffff88012fc43a10 EFLAGS: 00010207
    [ 97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594
    [ 97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000
    [ 97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000
    [ 97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3
    [ 97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000
    [ 97.997198] FS: 0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
    [ 98.005280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0
    [ 98.018149] Stack:
    [ 98.020157] 00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e
    [ 98.027584] 0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3
    [ 98.035010] ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98
    [ 98.042437] Call Trace:
    [ 98.044879] [ 98.046803] [] ? tg3_start_xmit+0x84a/0xd60 [tg3]
    [ 98.053156] [] skb_mac_gso_segment+0xb0/0x130
    [ 98.059158] [] __skb_gso_segment+0x73/0x110
    [ 98.064985] [] validate_xmit_skb+0x12d/0x2b0
    [ 98.070899] [] validate_xmit_skb_list+0x42/0x70
    [ 98.077073] [] sch_direct_xmit+0xd0/0x1b0
    [ 98.082726] [] __dev_queue_xmit+0x486/0x690
    [ 98.088554] [] ? cpumask_next_and+0x35/0x50
    [ 98.094380] [] dev_queue_xmit+0x10/0x20
    [ 98.099863] [] br_dev_queue_push_xmit+0xa7/0x170 [bridge]
    [ 98.106907] [] br_forward_finish+0x41/0xc0 [bridge]
    [ 98.113430] [] ? nf_iterate+0x52/0x60
    [ 98.118735] [] ? nf_hook_slow+0x6b/0xc0
    [ 98.124216] [] __br_forward+0x14c/0x1e0 [bridge]
    [ 98.130480] [] ? br_dev_queue_push_xmit+0x170/0x170 [bridge]
    [ 98.137785] [] br_forward+0x9d/0xb0 [bridge]
    [ 98.143701] [] br_handle_frame_finish+0x267/0x560 [bridge]
    [ 98.150834] [] br_handle_frame+0x174/0x2f0 [bridge]
    [ 98.157355] [] ? sched_clock+0x9/0x10
    [ 98.162662] [] ? sched_clock_cpu+0x72/0xa0
    [ 98.168403] [] __netif_receive_skb_core+0x1e5/0xa20
    [ 98.174926] [] ? timerqueue_add+0x59/0xb0
    [ 98.180580] [] __netif_receive_skb+0x18/0x60
    [ 98.186494] [] process_backlog+0x95/0x140
    [ 98.192145] [] net_rx_action+0x16d/0x380
    [ 98.197713] [] __do_softirq+0xd1/0x283
    [ 98.203106] [] do_softirq_own_stack+0x1c/0x30
    [ 98.209107] [ 98.211029] [] do_softirq+0x50/0x60
    [ 98.216166] [] netif_rx_ni+0x33/0x80
    [ 98.221386] [] tun_get_user+0x487/0x7f0 [tun]
    [ 98.227388] [] tun_sendmsg+0x4b/0x60 [tun]
    [ 98.233129] [] handle_tx+0x282/0x540 [vhost_net]
    [ 98.239392] [] handle_tx_kick+0x15/0x20 [vhost_net]
    [ 98.245916] [] vhost_worker+0x9e/0xf0 [vhost]
    [ 98.251919] [] ? vhost_umem_alloc+0x40/0x40 [vhost]
    [ 98.258440] [] ? do_syscall_64+0x67/0x180
    [ 98.264094] [] kthread+0xd9/0xf0
    [ 98.268965] [] ? kthread_park+0x60/0x60
    [ 98.274444] [] ret_from_fork+0x25/0x30
    [ 98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66
    [ 98.299425] RIP [] ipv6_gso_segment+0x119/0x2f0
    [ 98.305612] RSP
    [ 98.309094] CR2: 00000000000000cc
    [ 98.312406] ---[ end trace 726a2c7a2d2d78d0 ]---

    Signed-off-by: Artem Savkov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Artem Savkov
     
  • jiffies based timestamps allow for easy inference of number of devices
    behind NAT translators and also makes tracking of hosts simpler.

    commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
    added the main infrastructure that is needed for per-connection ts
    randomization, in particular writing/reading the on-wire tcp header
    format takes the offset into account so rest of stack can use normal
    tcp_time_stamp (jiffies).

    So only two items are left:
    - add a tsoffset for request sockets
    - extend the tcp isn generator to also return another 32bit number
    in addition to the ISN.

    Re-use of ISN generator also means timestamps are still monotonically
    increasing for same connection quadruple, i.e. PAWS will still work.

    Includes fixes from Eric Dumazet.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d
    ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

    skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
    dst_output() is called. It is no longer necessary to do it for each tunnel.

    Cc: stable@vger.kernel.org
    Signed-off-by: Eli Cooper
    Signed-off-by: David S. Miller

    Eli Cooper
     
  • When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

    where skb_gso_segment() relies on skb->protocol to function properly.

    This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
    fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
    when xfrm is involved.

    Cc: stable@vger.kernel.org
    Signed-off-by: Eli Cooper
    Signed-off-by: David S. Miller

    Eli Cooper
     

02 Dec, 2016

2 commits

  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2016-12-01

    1) Change the error value when someone tries to run 32bit
    userspace on a 64bit host from -ENOTSUPP to the userspace
    exported -EOPNOTSUPP. Fix from Yi Zhao.

    2) On inbound, ESN sequence numbers are already in network
    byte order. So don't try to convert it again, this fixes
    integrity verification for ESN. Fixes from Tobias Brunner.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    This is a large batch of Netfilter fixes for net, they are:

    1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
    structure that allows to have several objects with the same key.
    Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
    expecting a return value similar to memcmp(). Change location of
    the nat_bysource field in the nf_conn structure to avoid zeroing
    this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
    to crashes. From Florian Westphal.

    2) Don't allow malformed fragments go through in IPv6, drop them,
    otherwise we hit GPF, patch from Florian Westphal.

    3) Fix crash if attributes are missing in nft_range, from Liping Zhang.

    4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.

    5) Two patches from David Ahern to fix netfilter interaction with vrf.
    From David Ahern.

    6) Fix element timeout calculation in nf_tables, we take milliseconds
    from userspace, but we use jiffies from kernelspace. Patch from
    Anders K. Pedersen.

    7) Missing validation length netlink attribute for nft_hash, from
    Laura Garcia.

    8) Fix nf_conntrack_helper documentation, we don't default to off
    anymore for a bit of time so let's get this in sync with the code.

    I know is late but I think these are important, specifically the NAT
    bits, as they are mostly addressing fallout from recent changes. I also
    read there are chances to have -rc8, if that is the case, that would
    also give us a bit more time to test this.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Dec, 2016

1 commit


30 Nov, 2016

2 commits

  • When handling inbound packets, the two halves of the sequence number
    stored on the skb are already in network order.

    Fixes: 000ae7b2690e ("esp6: Switch to new AEAD interface")
    Signed-off-by: Tobias Brunner
    Acked-by: Herbert Xu
    Signed-off-by: Steffen Klassert

    Tobias Brunner
     
  • Dmitry Vyukov reported GPF in network stack that Andrey traced down to
    negative nh offset in nf_ct_frag6_queue().

    Problem is that all network headers before fragment header are pulled.
    Normal ipv6 reassembly will drop the skb when errors occur further down
    the line.

    netfilter doesn't do this, and instead passed the original fragment
    along. That was also fine back when netfilter ipv6 defrag worked with
    cloned fragments, as the original, pristine fragment was passed on.

    So we either have to undo the pull op, or discard such fragments.
    Since they're malformed after all (e.g. overlapping fragment) it seems
    preferrable to just drop them.

    Same for temporary errors -- it doesn't make sense to accept (and
    perhaps forward!) only some fragments of same datagram.

    Fixes: 029f7f3b8701cc7ac ("netfilter: ipv6: nf_defrag: avoid/free clone operations")
    Reported-by: Dmitry Vyukov
    Debugged-by: Andrey Konovalov
    Diagnosed-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

29 Nov, 2016

1 commit

  • Andrey reported the following while fuzzing the kernel with syzkaller:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    Modules linked in:
    CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff8800666d4200 task.stack: ffff880067348000
    RIP: 0010:[] []
    icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
    RSP: 0018:ffff88006734f2c0 EFLAGS: 00010206
    RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
    RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
    R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
    R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
    FS: 00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
    Stack:
    ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
    ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
    ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
    Call Trace:
    [] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
    [< inline >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
    [] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
    [] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
    [] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
    ...

    icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
    cases the dst->dev should be preferred for determining the L3 domain
    if the dst has been set on the skb. Fallback to the skb->dev if it has
    not. This covers the case reported here where icmp6_send is invoked on
    Rx before the route lookup.

    Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain")
    Reported-by: Andrey Konovalov
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

28 Nov, 2016

1 commit

  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2016-11-25

    1) Fix a refcount leak in vti6.
    From Nicolas Dichtel.

    2) Fix a wrong if statement in xfrm_sk_policy_lookup.
    From Florian Westphal.

    3) The flowcache watermarks are per cpu. Take this into
    account when comparing to the threshold where we
    refusing new allocations. From Miroslav Urbanek.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Nov, 2016

1 commit


26 Nov, 2016

1 commit

  • If the cgroup associated with the receiving socket has an eBPF
    programs installed, run them from ip_output(), ip6_output() and
    ip_mc_output(). From mentioned functions we have two socket contexts
    as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
    okfn()."). We explicitly need to use sk instead of skb->sk here,
    since otherwise the same program would run multiple times on egress
    when encap devices are involved, which is not desired in our case.

    eBPF programs used in this context are expected to either return 1 to
    let the packet pass, or != 1 to drop them. The programs have access to
    the skb through bpf_skb_load_bytes(), and the payload starts at the
    network headers (L3).

    Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
    for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
    the feature is unused.

    Signed-off-by: Daniel Mack
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Mack
     

25 Nov, 2016

2 commits

  • In commits 93821778def10 ("udp: Fix rcv socket locking") and
    f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
    __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
    was forgotten.

    This leads to crashes if UDPlite header is pulled twice, which happens
    starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets
    before queueing")

    Bug found by syzkaller team, thanks a lot guys !

    Note that backlog use in UDP/UDPlite is scheduled to be removed starting
    from linux-4.10, so this patch is only needed up to linux-4.9

    Fixes: 93821778def1 ("udp: Fix rcv socket locking")
    Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Cc: Benjamin LaHaise
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When an ipv6 address has the tentative flag set, it can't be
    used as source for egress traffic, while the associated route,
    if any, can be looked up and even stored into some dst_cache.

    In the latter scenario, the source ipv6 address selected and
    stored in the cache is most probably wrong (e.g. with
    link-local scope) and the entity using the dst_cache will
    experience lack of ipv6 connectivity until said cache is
    cleared or invalidated.

    Overall this may cause lack of connectivity over most IPv6 tunnels
    (comprising geneve and vxlan), if the first egress packet reaches
    the tunnel before the DaD is completed for the used ipv6
    address.

    This patch bumps a new genid after that the IFA_F_TENTATIVE flag
    is cleared, so that dst_cache will be invalidated on
    next lookup and ipv6 connectivity restored.

    Fixes: 0c1d70af924b ("net: use dst_cache for vxlan device")
    Fixes: 468dfffcd762 ("geneve: add dst caching support")
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni