25 Feb, 2018

1 commit

  • commit dfec091439bb2acf763497cfc58f2bdfc67c56b7 upstream.

    After commit 3f34cfae1238 ("netfilter: on sockopt() acquire sock lock
    only in the required scope"), the caller of nf_{get/set}sockopt() must
    not hold any lock, but, in such changeset, I forgot to cope with DECnet.

    This commit addresses the issue moving the nf call outside the lock,
    in the dn_{get,set}sockopt() with the same schema currently used by
    ipv4 and ipv6. Also moves the unhandled sockopts of the end of the main
    switch statements, to improve code readability.

    Reported-by: Petr Vandrovec
    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=198791#c2
    Fixes: 3f34cfae1238 ("netfilter: on sockopt() acquire sock lock only in the required scope")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Sep, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. Basically, updates to the conntrack core, enhancements for
    nf_tables, conversion of netfilter hooks from linked list to array to
    improve memory locality and asorted improvements for the Netfilter
    codebase. More specifically, they are:

    1) Add expection to hashes after timer initialization to prevent
    access from another CPU that walks on the hashes and calls
    del_timer(), from Florian Westphal.

    2) Don't update nf_tables chain counters from hot path, this is only
    used by the x_tables compatibility layer.

    3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
    Hooks are always guaranteed to run from rcu read side, so remove
    nested rcu_read_lock() where possible. Patch from Taehee Yoo.

    4) nf_tables new ruleset generation notifications include PID and name
    of the process that has updated the ruleset, from Phil Sutter.

    5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
    the nf_family netdev family. Patch from Pablo M. Bermudo.

    6) Add support for nft_fib in nf_tables netdev family, also from Pablo.

    7) Use deferrable workqueue for conntrack garbage collection, to reduce
    power consumption, from Patch from Subash Abhinov Kasiviswanathan.

    8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
    Westphal.

    9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.

    10) Drop references on conntrack removal path when skbuffs has escaped via
    nfqueue, from Florian.

    11) Don't queue packets to nfqueue with dying conntrack, from Florian.

    12) Constify nf_hook_ops structure, from Florian.

    13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.

    14) Add nla_strdup(), from Phil Sutter.

    15) Rise nf_tables objects name size up to 255 chars, people want to use
    DNS names, so increase this according to what RFC 1035 specifies.
    Patch series from Phil Sutter.

    16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
    registration on demand, suggested by Eric Dumazet, patch from Florian.

    17) Remove unused variables in compat_copy_entry_from_user both in
    ip_tables and arp_tables code. Patch from Taehee Yoo.

    18) Constify struct nf_conntrack_l4proto, from Julia Lawall.

    19) Constify nf_loginfo structure, also from Julia.

    20) Use a single rb root in connlimit, from Taehee Yoo.

    21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.

    22) Use audit_log() instead of open-coding it, from Geliang Tang.

    23) Allow to mangle tcp options via nft_exthdr, from Florian.

    24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
    a fix for a miscalculation of the minimal length.

    25) Simplify branch logic in h323 helper, from Nick Desaulniers.

    26) Calculate netlink attribute size for conntrack tuple at compile
    time, from Florian.

    27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
    From Florian.

    28) Remove holes in nf_conntrack_l4proto structure, so it becomes
    smaller. From Florian.

    29) Get rid of print_tuple() indirection for /proc conntrack listing.
    Place all the code in net/netfilter/nf_conntrack_standalone.c.
    Patch from Florian.

    30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
    off. From Florian.

    31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
    Florian.

    32) Fix broken indentation in ebtables extensions, from Colin Ian King.

    33) Fix several harmless sparse warning, from Florian.

    34) Convert netfilter hook infrastructure to use array for better memory
    locality, joint work done by Florian and Aaron Conole. Moreover, add
    some instrumentation to debug this.

    35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
    per batch, from Florian.

    36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.

    37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.

    38) Remove unused code in the generic protocol tracker, from Davide
    Caratti.

    I think I will have material for a second Netfilter batch in my queue if
    time allow to make it fit in this merge window.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Aug, 2017

1 commit

  • Florian reported UDP xmit drops that could be root caused to the
    too small neigh limit.

    Current limit is 64 KB, meaning that even a single UDP socket would hit
    it, since its default sk_sndbuf comes from net.core.wmem_default
    (~212992 bytes on 64bit arches).

    Once ARP/ND resolution is in progress, we should allow a little more
    packets to be queued, at least for one producer.

    Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
    limit and either block in sendmsg() or return -EAGAIN.

    Signed-off-by: Eric Dumazet
    Reported-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2017

1 commit

  • This change allows us to later indicate to rtnetlink core that certain
    doit functions should be called without acquiring rtnl_mutex.

    This change should have no effect, we simply replace the last (now
    unused) calcit argument with the new flag.

    Signed-off-by: Florian Westphal
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florian Westphal
     

01 Aug, 2017

1 commit


05 Jul, 2017

1 commit


01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

22 Jun, 2017

1 commit


18 Jun, 2017

2 commits

  • Now that all the components have been changed to release dst based on
    refcnt only and not depend on dst gc anymore, we can remove the
    temporary flag DST_NOGC.

    Note that we also need to remove the DST_NOCACHE check in dst_release()
    and dst_hold_safe() because now all the dst are released based on refcnt
    and behaves as DST_NOCACHE.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     
  • struct dn_route is inserted into dn_rt_hash_table but no dst->__refcnt
    is taken.
    This patch makes sure the dn_rt_hash_table's reference to the dst is ref
    counted.

    As the dst is always ref counted properly, we can safely mark
    DST_NOGC flag so dst_release() will release dst based on refcnt only.
    And dst gc is no longer needed and all dst_free() or its related
    function calls should be replaced with dst_release() or
    dst_release_immediate(). And dst_dev_put() is called when removing dst
    from the hash table to release the reference on dst->dev before we lose
    pointer to it.

    Also, correct the logic in dn_dst_check_expire() and dn_dst_gc() to
    check dst->__refcnt to be > 1 to indicate it is referenced by other
    users.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     

17 Jun, 2017

2 commits

  • In the existing dn_route.c code, dn_route_output_slow() takes
    dst->__refcnt before calling dn_insert_route() while dn_route_input_slow()
    does not take dst->__refcnt before calling dn_insert_route().
    This makes the whole routing code very buggy.
    In dn_dst_check_expire(), dnrt_free() is called when rt expires. This
    makes the routes inserted by dn_route_output_slow() not able to be
    freed as the refcnt is not released.
    In dn_dst_gc(), dnrt_drop() is called to release rt which could
    potentially cause the dst->__refcnt to be dropped to -1.
    In dn_run_flush(), dst_free() is called to release all the dst. Again,
    it makes the dst inserted by dn_route_output_slow() not able to be
    released and also, it does not wait on the rcu and could potentially
    cause crash in the path where other users still refer to this dst.

    This patch makes sure both input and output path do not take
    dst->__refcnt before calling dn_insert_route() and also makes sure
    dnrt_free()/dst_free() is called when removing dst from the hash table.
    The only difference between those 2 calls is that dnrt_free() waits on
    the rcu while dst_free() does not.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     
  • In the existing dn_route.c code, dn_route_output_slow() takes
    dst->__refcnt before calling dn_insert_route() while dn_route_input_slow()
    does not take dst->__refcnt before calling dn_insert_route().
    This makes the whole routing code very buggy.
    In dn_dst_check_expire(), dnrt_free() is called when rt expires. This
    makes the routes inserted by dn_route_output_slow() not able to be
    freed as the refcnt is not released.
    In dn_dst_gc(), dnrt_drop() is called to release rt which could
    potentially cause the dst->__refcnt to be dropped to -1.
    In dn_run_flush(), dst_free() is called to release all the dst. Again,
    it makes the dst inserted by dn_route_output_slow() not able to be
    released and also, it does not wait on the rcu and could potentially
    cause crash in the path where other users still refer to this dst.

    This patch makes sure both input and output path do not take
    dst->__refcnt before calling dn_insert_route() and also makes sure
    dnrt_free()/dst_free() is called when removing dst from the hash table.
    The only difference between those 2 calls is that dnrt_free() waits on
    the rcu while dst_free() does not.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     

16 Jun, 2017

4 commits

  • Joe and Bjørn suggested that it'd be nicer to not have the
    cast in the fairly common case of doing
    *(u8 *)skb_put(skb, 1) = c;

    Add skb_put_u8() for this case, and use it across the code,
    using the following spatch:

    @@
    expression SKB, C, S;
    typedef u8;
    identifier fn = {skb_put};
    fresh identifier fn2 = fn ## "_u8";
    @@
    - *(u8 *)fn(SKB, S) = C;
    + fn2(SKB, C);

    Note that due to the "S", the spatch isn't perfect, it should
    have checked that S is 1, but there's also places that use a
    sizeof expression like sizeof(var) or sizeof(u8) etc. Turns
    out that nobody ever did something like
    *(u8 *)skb_put(skb, 2) = c;

    which would be wrong anyway since the second byte wouldn't be
    initialized.

    Suggested-by: Joe Perches
    Suggested-by: Bjørn Mork
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions return void * and remove all the casts across
    the tree, adding a (u8 *) cast only where the unsigned char pointer
    was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

    Note that the last part there converts from push(...)[0] to the
    more idiomatic *(u8 *)push(...).

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions (skb_put, __skb_put and pskb_put) return void *
    and remove all the casts across the tree, adding a (u8 *) cast only
    where the unsigned char pointer was used directly, all done with the
    following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    which actually doesn't cover pskb_put since there are only three
    users overall.

    A handful of stragglers were converted manually, notably a macro in
    drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
    instances in net/bluetooth/hci_sock.c. In the former file, I also
    had to fix one whitespace problem spatch introduced.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • A common pattern with skb_put() is to just want to memcpy()
    some data into the new space, introduce skb_put_data() for
    this.

    An spatch similar to the one for skb_put_zero() converts many
    of the places using it:

    @@
    identifier p, p2;
    expression len, skb, data;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, len);
    |
    -memcpy(p, data, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb, data;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, sizeof(*p));
    |
    -memcpy(p, data, sizeof(*p));
    )

    @@
    expression skb, len, data;
    @@
    -memcpy(skb_put(skb, len), data, len);
    +skb_put_data(skb, data, len);

    (again, manually post-processed to retain some comments)

    Reviewed-by: Stephen Hemminger
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

15 Jun, 2017

1 commit


08 Jun, 2017

4 commits

  • DRAM supply shortage and poor memory pressure tracking in TCP
    stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
    limits) and tcp_mem[] quite hazardous.

    TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
    limits being hit, but only tracking number of transitions.

    If TCP stack behavior under stress was perfect :
    1) It would maintain memory usage close to the limit.
    2) Memory pressure state would be entered for short times.

    We certainly prefer 100 events lasting 10ms compared to one event
    lasting 200 seconds.

    This patch adds a new SNMP counter tracking cumulative duration of
    memory pressure events, given in ms units.

    $ cat /proc/sys/net/ipv4/tcp_mem
    3088 4117 6176
    $ grep TCP /proc/net/sockstat
    TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
    $ nstat -n ; sleep 10 ; nstat |grep Pressure
    TcpExtTCPMemoryPressures 1700
    TcpExtTCPMemoryPressuresChrono 5209

    v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
    instructed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Verify that the length of the socket buffer is sufficient to cover the
    nlmsghdr structure before accessing the nlh->nlmsg_len field for further
    input sanitization. If the client only supplies 1-3 bytes of data in
    sk_buff, then nlh->nlmsg_len remains partially uninitialized and
    contains leftover memory from the corresponding kernel allocation.
    Operating on such data may result in indeterminate evaluation of the
    nlmsg_len < sizeof(*nlh) expression.

    The bug was discovered by a runtime instrumentation designed to detect
    use of uninitialized memory in the kernel. The patch prevents this and
    other similar tools (e.g. KMSAN) from flagging this behavior in the future.

    Signed-off-by: Mateusz Jurczyk
    Signed-off-by: David S. Miller

    Mateusz Jurczyk
     
  • This reverts commit 85eac2ba35a2dbfbdd5767c7447a4af07444a5b4.

    There is an updated version of this fix which we should
    use instead.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Verify that the length of the socket buffer is sufficient to cover the
    entire nlh->nlmsg_len field before accessing that field for further
    input sanitization. If the client only supplies 1-3 bytes of data in
    sk_buff, then nlh->nlmsg_len remains partially uninitialized and
    contains leftover memory from the corresponding kernel allocation.
    Operating on such data may result in indeterminate evaluation of the
    nlmsg_len < sizeof(*nlh) expression.

    The bug was discovered by a runtime instrumentation designed to detect
    use of uninitialized memory in the kernel. The patch prevents this and
    other similar tools (e.g. KMSAN) from flagging this behavior in the future.

    Signed-off-by: Mateusz Jurczyk
    Signed-off-by: David S. Miller

    Mateusz Jurczyk
     

10 May, 2017

1 commit

  • Pull networking fixes from David Miller:

    1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko.

    2) cdc_ncm doesn't actually fully zero out the padding area is
    allocates on TX, from Jim Baxter.

    3) Don't leak map addresses in BPF verifier, from Daniel Borkmann.

    4) If we randomize TCP timestamps, we have to do it everywhere
    including SYN cookies. From Eric Dumazet.

    5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous.

    6) Fix allocation size for ntp filter bitmap in bnxt_en driver, from
    Dan Carpenter.

    7) Add missing memory allocation return value check to DSA loop driver,
    from Christophe Jaillet.

    8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy
    Kalluru.

    9) Don't inherit MC list from parent inet connection sockets, another
    syzkaller spotted gem. Fix from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
    dccp/tcp: do not inherit mc_list from parent
    qede: Split PF/VF ndos.
    qed: Correct doorbell configuration for !4Kb pages
    qed: Tell QM the number of tasks
    qed: Fix VF removal sequence
    qede: Fix XDP memory leak on unload
    net/mlx4_core: Reduce harmless SRIOV error message to debug level
    net/mlx4_en: Avoid adding steering rules with invalid ring
    net/mlx4_en: Change the error print to debug print
    drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison
    DECnet: Use container_of() for embedded struct
    Revert "ipv4: restore rt->fi for reference counting"
    net: mdio-mux: bcm-iproc: call mdiobus_free() in error path
    net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
    ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
    net: cdc_ncm: Fix TX zero padding
    stmmac: pci: split out common_default_data() helper
    stmmac: pci: RX queue routing configuration
    stmmac: pci: TX and RX queue priority configuration
    stmmac: pci: set default number of rx and tx queues
    ...

    Linus Torvalds
     

09 May, 2017

2 commits

  • Instead of a direct cross-type cast, use conatiner_of() to locate
    the embedded structure, even in the face of future struct layout
    randomization.

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     
  • While examining output from trial builds with -Wformat-security enabled,
    many strings were found that should be defined as "const", or as a char
    array instead of char pointer. This makes some static analysis easier,
    by producing fewer false positives.

    As these are all trivial changes, it seemed best to put them all in a
    single patch rather than chopping them up per maintainer.

    Link: http://lkml.kernel.org/r/20170405214711.GA5711@beast
    Signed-off-by: Kees Cook
    Acked-by: Jes Sorensen [runner.c]
    Cc: Tony Lindgren
    Cc: Russell King
    Cc: "Maciej W. Rozycki"
    Cc: Ralf Baechle
    Cc: Arnd Bergmann
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Viresh Kumar
    Cc: Daniel Vetter
    Cc: Jani Nikula
    Cc: Sean Paul
    Cc: David Airlie
    Cc: Yisen Zhuang
    Cc: Salil Mehta
    Cc: Thomas Bogendoerfer
    Cc: Jiri Slaby
    Cc: Patrice Chotard
    Cc: "David S. Miller"
    Cc: James Hogan
    Cc: Paul Burton
    Cc: Matt Redfearn
    Cc: Paolo Bonzini
    Cc: Ingo Molnar
    Cc: Rasmus Villemoes
    Cc: Mugunthan V N
    Cc: Felipe Balbi
    Cc: Jarod Wilson
    Cc: Florian Westphal
    Cc: Antonio Quartulli
    Cc: Dmitry Torokhov
    Cc: Kejian Yan
    Cc: Daode Huang
    Cc: Qianqian Xie
    Cc: Philippe Reynes
    Cc: Colin Ian King
    Cc: Eric Dumazet
    Cc: Christian Gromm
    Cc: Andrey Shvetsov
    Cc: Jason Litzinger
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

01 May, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. A large bunch of code cleanups, simplify the conntrack extension
    codebase, get rid of the fake conntrack object, speed up netns by
    selective synchronize_net() calls. More specifically, they are:

    1) Check for ct->status bit instead of using nfct_nat() from IPVS and
    Netfilter codebase, patch from Florian Westphal.

    2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

    3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

    4) Introduce nft_is_base_chain() helper function.

    5) Enforce expectation limit from userspace conntrack helper,
    from Gao Feng.

    6) Add nf_ct_remove_expect() helper function, from Gao Feng.

    7) NAT mangle helper function return boolean, from Gao Feng.

    8) ctnetlink_alloc_expect() should only work for conntrack with
    helpers, from Gao Feng.

    9) Add nfnl_msg_type() helper function to nfnetlink to build the
    netlink message type.

    10) Get rid of unnecessary cast on void, from simran singhal.

    11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

    12) Use list_prev_entry() from nf_tables, from simran signhal.

    13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

    14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

    15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

    16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

    17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

    18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

    19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

    20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

    21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

    22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

    23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

    24) Add event mask support to nft_ct, also from Florian.

    25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

    26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

    27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

    28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

    29) Shrink size of nf_conntrack_ecache structure, from Florian.

    30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

    31) Register SYNPROXY hooks on demand, from Florian Westphal.

    32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

    33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

    34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

    35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

    36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

    37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

    38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

    39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

    40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

    41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

    42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Apr, 2017

1 commit


18 Apr, 2017

1 commit

  • Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
    for doit functions that call it directly.

    This is the first step to using extended error reporting in rtnetlink.
    >From here individual subsystems can be updated to set netlink_ext_ack as
    needed.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

14 Apr, 2017

2 commits

  • Pass the new extended ACK reporting struct to all of the generic
    netlink parsing functions. For now, pass NULL in almost all callers
    (except for some in the core.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Add the base infrastructure and UAPI for netlink extended ACK
    reporting. All "manual" calls to netlink_ack() pass NULL for now and
    thus don't get extended ACK reporting.

    Big thanks goes to Pablo Neira Ayuso for not only bringing up the
    whole topic at netconf (again) but also coming up with the nlattr
    passing trick and various other ideas.

    Signed-off-by: Johannes Berg
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Johannes Berg
     

16 Mar, 2017

1 commit


10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

08 Mar, 2017

1 commit


02 Mar, 2017

1 commit


25 Dec, 2016

1 commit


18 Dec, 2016

1 commit

  • Prepare to mark sensitive kernel structures for randomization by making
    sure they're using designated initializers. These were identified during
    allyesconfig builds of x86, arm, and arm64, with most initializer fixes
    extracted from grsecurity.

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     

15 Nov, 2016

1 commit

  • Similar to commit 14135f30e33c ("inet: fix sleeping inside inet_wait_for_connect()"),
    sk_wait_event() needs to fix too, because release_sock() is blocking,
    it changes the process state back to running after sleep, which breaks
    the previous prepare_to_wait().

    Switch to the new wait API.

    Cc: Eric Dumazet
    Cc: Peter Zijlstra
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

06 Jul, 2016

1 commit

  • dn_fib_count_nhs() could enter an infinite loop if nhp->rtnh_len == 0
    (i.e. if userspace passes a malformed netlink message).

    Let's use the helpers from net/nexthop.h which take care of all this
    stuff. We can do exactly the same as e.g. fib_count_nexthops() and
    fib_get_nhs() from net/ipv4/fib_semantics.c.

    This fixes the softlockup for me.

    Cc: Thomas Graf
    Signed-off-by: Vegard Nossum
    Signed-off-by: David S. Miller

    Vegard Nossum
     

11 Apr, 2016

1 commit


15 Dec, 2015

1 commit

  • 郭永刚 reported that one could simply crash the kernel as root by
    using a simple program:

    int socket_fd;
    struct sockaddr_in addr;
    addr.sin_port = 0;
    addr.sin_addr.s_addr = INADDR_ANY;
    addr.sin_family = 10;

    socket_fd = socket(10,3,0x40000000);
    connect(socket_fd , &addr,16);

    AF_INET, AF_INET6 sockets actually only support 8-bit protocol
    identifiers. inet_sock's skc_protocol field thus is sized accordingly,
    thus larger protocol identifiers simply cut off the higher bits and
    store a zero in the protocol fields.

    This could lead to e.g. NULL function pointer because as a result of
    the cut off inet_num is zero and we call down to inet_autobind, which
    is NULL for raw sockets.

    kernel: Call Trace:
    kernel: [] ? inet_autobind+0x2e/0x70
    kernel: [] inet_dgram_connect+0x54/0x80
    kernel: [] SYSC_connect+0xd9/0x110
    kernel: [] ? ptrace_notify+0x5b/0x80
    kernel: [] ? syscall_trace_enter_phase2+0x108/0x200
    kernel: [] SyS_connect+0xe/0x10
    kernel: [] tracesys_phase2+0x84/0x89

    I found no particular commit which introduced this problem.

    CVE: CVE-2015-8543
    Cc: Cong Wang
    Reported-by: 郭永刚
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa