28 Oct, 2014

2 commits


24 Oct, 2014

3 commits

  • The kernel should reserve enough room in the skb so that the DONE
    message can always be appended. However, in case of e.g. new attribute
    erronously not being size-accounted for, __nfulnl_send() will still
    try to put next nlmsg into this full skbuf, causing the skb to be stuck
    forever and blocking delivery of further messages.

    Fix issue by releasing skb immediately after nlmsg_put error and
    WARN() so we can track down the cause of such size mismatch.

    [ fw@strlen.de: add tailroom/len info to WARN ]

    Signed-off-by: Houcheng Lin
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Houcheng Lin
     
  • don't try to queue payloads > 0xffff - NLA_HDRLEN, it does not work.
    The nla length includes the size of the nla struct, so anything larger
    results in u16 integer overflow.

    This patch is similar to
    9cefbbc9c8f9abe (netfilter: nfnetlink_queue: cleanup copy_range usage).

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • We currently neither account for the nlattr size, nor do we consider
    the size of the trailing NLMSG_DONE when allocating nlmsg skb.

    This can result in nflog to stop working, as __nfulnl_send() re-tries
    sending forever if it failed to append NLMSG_DONE (which will never
    work if buffer is not large enough).

    Reported-by: Houcheng Lin
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

22 Oct, 2014

3 commits

  • alloc_percpu returns NULL on failure, not a negative error code.

    Fixes: ff3cd7b3c922 ("netfilter: nf_tables: refactor chain statistic routines")
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: Pablo Neira Ayuso

    Sabrina Dubroca
     
  • The ->ip_set_list[] array is initialized in ip_set_net_init() and it
    has ->ip_set_max elements so this check should be >= instead of >
    otherwise we are off by one.

    Signed-off-by: Dan Carpenter
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Dan Carpenter
     
  • When a port that was used to listen for inbound connections gets closed
    and reused for outgoing connections (like rsh ends up doing for stderr
    flow), current we may reject the SYN/ACK packet for the new connection
    because tcp_conntracks states forbirds a port to become a client while
    there is still a TIME_WAIT entry in there for it.

    As TCP may expire the TIME_WAIT socket in 60s and conntrack's timeout
    for it is 120s, there is a ~60s window that the application can end up
    opening a port that conntrack will end up blocking.

    This patch fixes this by simply allowing such state transition: if we
    see a SYN, in TIME_WAIT state, on REPLY direction, move it to sSS. Note
    that the rest of the code already handles this situation, more
    specificly in tcp_packet(), first switch clause.

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Marcelo Leitner
     

21 Oct, 2014

1 commit

  • skb_gso_segment has three possible return values:
    1. a pointer to the first segmented skb
    2. an errno value (IS_ERR())
    3. NULL. This can happen when GSO is used for header verification.

    However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL
    and would oops when NULL is returned.

    Note that these call sites should never actually see such a NULL return
    value; all callers mask out the GSO bits in the feature argument.

    However, there have been issues with some protocol handlers erronously not
    respecting the specified feature mask in some cases.

    It is preferable to get 'have to turn off hw offloading, else slow' reports
    rather than 'kernel crashes'.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

20 Oct, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    netfilter fixes for net

    The following patchset contains netfilter fixes for your net tree,
    they are:

    1) Fix missing MODULE_LICENSE() in the new nf_reject_ipv{4,6} modules.

    2) Restrict nat and masq expressions to the nat chain type. Otherwise,
    users may crash their kernel if they attach a nat/masq rule to a non
    nat chain.

    3) Fix hook validation in nft_compat when non-base chains are used.
    Basically, initialize hook_mask to zero.

    4) Make sure you use match/targets in nft_compat from the right chain
    type. The existing validation relies on the table name which can be
    avoided by

    5) Better netlink attribute validation in nft_nat. This expression has
    to reject the configuration when no address and proto configurations
    are specified.

    6) Interpret NFTA_NAT_REG_*_MAX if only if NFTA_NAT_REG_*_MIN is set.
    Yet another sanity check to reject incorrect configurations from
    userspace.

    7) Conditional NAT attribute dumping depending on the existing
    configuration.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Oct, 2014

4 commits


14 Oct, 2014

3 commits

  • Set hook_mask to zero for non-base chains, otherwise people may hit
    bogus errors from the xt_check_target() and xt_check_match() when
    validating the uninitialized hook_mask.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • The kernel used to contain two functions for length-delimited,
    case-insensitive string comparison, strnicmp with correct semantics and
    a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp
    was renamed to strncasecmp, and strnicmp made into a wrapper for the new
    strncasecmp to avoid breaking existing users.

    To allow the compat wrapper strnicmp to be removed at some point in the
    future, and to avoid the extra indirection cost, do
    s/strnicmp/strncasecmp/g.

    Signed-off-by: Rasmus Villemoes
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • This adds the missing validation code to avoid the use of nat/masq from
    non-nat chains. The validation assumes two possible configuration
    scenarios:

    1) Use of nat from base chain that is not of nat type. Reject this
    configuration from the nft_*_init() path of the expression.

    2) Use of nat from non-base chain. In this case, we have to wait until
    the non-base chain is referenced by at least one base chain via
    jump/goto. This is resolved from the nft_*_validate() path which is
    called from nf_tables_check_loops().

    The user gets an -EOPNOTSUPP in both cases.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

11 Oct, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net-next

    This batch contains two fixes for what you have in your net-next,
    they are:

    1) Remove nf_send_reset6() from header file. This function now resides
    in the nf_reject_ipv6 module. Reported by Eric Dumazet.

    2) Fix wrong NFT_REJECT_ICMPX_MAX definition and adjust code to fix
    errors reported by Dan Carpenter's static analysis tools.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Oct, 2014

1 commit

  • Pull networking updates from David Miller:
    "Most notable changes in here:

    1) By far the biggest accomplishment, thanks to a large range of
    contributors, is the addition of multi-send for transmit. This is
    the result of discussions back in Chicago, and the hard work of
    several individuals.

    Now, when the ->ndo_start_xmit() method of a driver sees
    skb->xmit_more as true, it can choose to defer the doorbell
    telling the driver to start processing the new TX queue entires.

    skb->xmit_more means that the generic networking is guaranteed to
    call the driver immediately with another SKB to send.

    There is logic added to the qdisc layer to dequeue multiple
    packets at a time, and the handling mis-predicted offloads in
    software is now done with no locks held.

    Finally, pktgen is extended to have a "burst" parameter that can
    be used to test a multi-send implementation.

    Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
    virtio_net

    Adding support is almost trivial, so export more drivers to
    support this optimization soon.

    I want to thank, in no particular or implied order, Jesper
    Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
    Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
    David Tat, Hannes Frederic Sowa, and Rusty Russell.

    2) PTP and timestamping support in bnx2x, from Michal Kalderon.

    3) Allow adjusting the rx_copybreak threshold for a driver via
    ethtool, and add rx_copybreak support to enic driver. From
    Govindarajulu Varadarajan.

    4) Significant enhancements to the generic PHY layer and the bcm7xxx
    driver in particular (EEE support, auto power down, etc.) from
    Florian Fainelli.

    5) Allow raw buffers to be used for flow dissection, allowing drivers
    to determine the optimal "linear pull" size for devices that DMA
    into pools of pages. The objective is to get exactly the
    necessary amount of headers into the linear SKB area pre-pulled,
    but no more. The new interface drivers use is eth_get_headlen().
    From WANG Cong, with driver conversions (several had their own
    by-hand duplicated implementations) by Alexander Duyck and Eric
    Dumazet.

    6) Support checksumming more smoothly and efficiently for
    encapsulations, and add "foo over UDP" facility. From Tom
    Herbert.

    7) Add Broadcom SF2 switch driver to DSA layer, from Florian
    Fainelli.

    8) eBPF now can load programs via a system call and has an extensive
    testsuite. Alexei Starovoitov and Daniel Borkmann.

    9) Major overhaul of the packet scheduler to use RCU in several major
    areas such as the classifiers and rate estimators. From John
    Fastabend.

    10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
    Duyck.

    11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
    Dumazet.

    12) Add Datacenter TCP congestion control algorithm support, From
    Florian Westphal.

    13) Reorganize sk_buff so that __copy_skb_header() is significantly
    faster. From Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
    netlabel: directly return netlbl_unlabel_genl_init()
    net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
    net: description of dma_cookie cause make xmldocs warning
    cxgb4: clean up a type issue
    cxgb4: potential shift wrapping bug
    i40e: skb->xmit_more support
    net: fs_enet: Add NAPI TX
    net: fs_enet: Remove non NAPI RX
    r8169:add support for RTL8168EP
    net_sched: copy exts->type in tcf_exts_change()
    wimax: convert printk to pr_foo()
    af_unix: remove 0 assignment on static
    ipv6: Do not warn for informational ICMP messages, regardless of type.
    Update Intel Ethernet Driver maintainers list
    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    tipc: fix bug in multicast congestion handling
    net: better IFF_XMIT_DST_RELEASE support
    net/mlx4_en: remove NETDEV_TX_BUSY
    3c59x: fix bad split of cpu_to_le32(pci_map_single())
    net: bcmgenet: fix Tx ring priority programming
    ...

    Linus Torvalds
     

08 Oct, 2014

2 commits

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     
  • NFT_REJECT_ICMPX_MAX should be __NFT_REJECT_ICMPX_MAX - 1.

    nft_reject_icmp_code() and nft_reject_icmpv6_code() are called from the
    packet path, so BUG_ON in case we try to access an unknown abstracted
    ICMP code. This should not happen since we already validate this from
    nft_reject_{inet,bridge}_init().

    Fixes: 51b0a5d ("netfilter: nft_reject: introduce icmp code abstraction for inet and bridge")
    Reported-by: Dan Carpenter
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

06 Oct, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains another batch with Netfilter/IPVS updates
    for net-next, they are:

    1) Add abstracted ICMP codes to the nf_tables reject expression. We
    introduce four reasons to reject using ICMP that overlap in IPv4
    and IPv6 from the semantic point of view. This should simplify the
    maintainance of dual stack rule-sets through the inet table.

    2) Move nf_send_reset() functions from header files to per-family
    nf_reject modules, suggested by Patrick McHardy.

    3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the
    code now that br_netfilter can be modularized. Convert remaining spots
    in the network stack code.

    4) Use rcu_barrier() in the nf_tables module removal path to ensure that
    we don't leave object that are still pending to be released via
    call_rcu (that may likely result in a crash).

    5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad)
    idea was to probe the word size based on the xtables match/target info
    size, but this assumption is wrong when you have to dump the information
    back to userspace.

    6) Allow to filter from prerouting and postrouting in the nf_tables bridge.
    In order to emulate the ebtables NAT chains (which are actually simple
    filter chains with no special semantics), we have support filtering from
    this hooks too.

    7) Add explicit module dependency between xt_physdev and br_netfilter.
    This provides a way to detect if the user needs br_netfilter from
    the configuration path. This should reduce the breakage of the
    br_netfilter modularization.

    8) Cleanup coding style in ip_vs.h, from Simon Horman.

    9) Fix crash in the recently added nf_tables masq expression. We have
    to register/unregister the notifiers to clean up the conntrack table
    entries from the module init/exit path, not from the rule addition /
    deletion path. From Arturo Borrero.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Oct, 2014

6 commits

  • Conflicts:
    drivers/net/usb/r8152.c
    net/netfilter/nfnetlink.c

    Both r8152 and nfnetlink conflicts were simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • You can use physdev to match the physical interface enslaved to the
    bridge device. This information is stored in skb->nf_bridge and it is
    set up by br_netfilter. So, this is only available when iptables is
    used from the bridge netfilter path.

    Since 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
    the br_netfilter code is modular. To reduce the impact of this change,
    we can autoload the br_netfilter if the physdev match is used since
    we assume that the users need br_netfilter in place.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This code was based on the wrong asumption that you can probe based
    on the match/target private size that we get from userspace. This
    doesn't work at all when you have to dump the info back to userspace
    since you don't know what word size the userspace utility is using.

    Currently, the extensions that require arch compat are limit match
    and the ebt_mark match/target. The standard targets are not used by
    the nft-xt compat layer, so they are not affected. We can work around
    this limitation with a new revision that uses arch agnostic types.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Make sure the objects have been released before the nf_tables modules
    is removed.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • In 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
    the bridge netfilter code has been modularized.

    Use IS_ENABLED instead of ifdef to cover the module case.

    Fixes: 34666d4 ("netfilter: bridge: move br_netfilter out of the core")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This patch introduces the NFT_REJECT_ICMPX_UNREACH type which provides
    an abstraction to the ICMP and ICMPv6 codes that you can use from the
    inet and bridge tables, they are:

    * NFT_REJECT_ICMPX_NO_ROUTE: no route to host - network unreachable
    * NFT_REJECT_ICMPX_PORT_UNREACH: port unreachable
    * NFT_REJECT_ICMPX_HOST_UNREACH: host unreachable
    * NFT_REJECT_ICMPX_ADMIN_PROHIBITED: administratevely prohibited

    You can still use the specific codes when restricting the rule to match
    the corresponding layer 3 protocol.

    I decided to not overload the existing NFT_REJECT_ICMP_UNREACH to have
    different semantics depending on the table family and to allow the user
    to specify ICMP family specific codes if they restrict it to the
    corresponding family.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

30 Sep, 2014

1 commit

  • In order to run qdisc's without locking statistics and estimators
    need to be handled correctly.

    To resolve bstats make the statistics per cpu. And because this is
    only needed for qdiscs that are running without locks which is not
    the case for most qdiscs in the near future only create percpu
    stats when qdiscs set the TCQ_F_CPUSTATS flag.

    Next because estimators use the bstats to calculate packets per
    second and bytes per second the estimator code paths are updated
    to use the per cpu statistics.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

29 Sep, 2014

2 commits

  • Given following iptables ruleset:

    -P FORWARD DROP
    -A FORWARD -m sctp --dport 9 -j ACCEPT
    -A FORWARD -p tcp --dport 80 -j ACCEPT
    -A FORWARD -p tcp -m conntrack -m state ESTABLISHED,RELATED -j ACCEPT

    One would assume that this allows SCTP on port 9 and TCP on port 80.
    Unfortunately, if the SCTP conntrack module is not loaded, this allows
    *all* SCTP communication, to pass though, i.e. -p sctp -j ACCEPT,
    which we think is a security issue.

    This is because on the first SCTP packet on port 9, we create a dummy
    "generic l4" conntrack entry without any port information (since
    conntrack doesn't know how to extract this information).

    All subsequent packets that are unknown will then be in established
    state since they will fallback to proto_generic and will match the
    'generic' entry.

    Our originally proposed version [1] completely disabled generic protocol
    tracking, but Jozsef suggests to not track protocols for which a more
    suitable helper is available, hence we now mitigate the issue for in
    tree known ct protocol helpers only, so that at least NAT and direction
    information will still be preserved for others.

    [1] http://www.spinics.net/lists/netfilter-devel/msg33430.html

    Joint work with Daniel Borkmann.

    Signed-off-by: Florian Westphal
    Signed-off-by: Daniel Borkmann
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • We want to know in which cases the user explicitly sets the policy
    options. In that case, we also want to dump back the info.

    Signed-off-by: Arturo Borrero Gonzalez
    Signed-off-by: Pablo Neira Ayuso

    Arturo Borrero
     

27 Sep, 2014

2 commits

  • Pablo Neira Ayuso says:

    ====================
    nf pull request for net

    This series contains netfilter fixes for net, they are:

    1) Fix lockdep splat in nft_hash when releasing sets from the
    rcu_callback context. We don't the mutex there anymore.

    2) Remove unnecessary spinlock_bh in the destroy path of the nf_tables
    rbtree set type from rcu_callback context.

    3) Fix another lockdep splat in rhashtable. None of the callers hold
    a mutex when calling rhashtable_destroy.

    4) Fix duplicated error reporting from nfnetlink when aborting and
    replaying a batch.

    5) Fix a Kconfig issue reported by kbuild robot.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Reduce boilerplate code by using __seq_open_private() instead of seq_open()
    in xt_match_open() and xt_target_open().

    Signed-off-by: Rob Jones
    Signed-off-by: Pablo Neira Ayuso

    Rob Jones
     

19 Sep, 2014

2 commits

  • This patch exposes the ruleset generation ID in three ways:

    1) The new command NFT_MSG_GETGEN that exposes the 32-bits ruleset
    generation ID. This ID is incremented in every commit and it
    should be large enough to avoid wraparound problems.

    2) The less significant 16-bits of the generation ID are exposed through
    the nfgenmsg->res_id header field. This allows us to quickly catch
    if the ruleset has change between two consecutive list dumps from
    different object lists (in this specific case I think the risk of
    wraparound is unlikely).

    3) Userspace subscribers may receive notifications of new rule-set
    generation after every commit. This also provides an alternative
    way to monitor the generation ID. If the events are lost, the
    userspace process hits a overrun error, so it knows that it is
    working with a stale ruleset anyway.

    Patrick spotted that rule-set transformations in userspace may take
    quite some time. In that case, it annotates the 32-bits generation ID
    before fetching the rule-set, then:

    1) it compares it to what we obtain after the transformation to
    make sure it is not working with a stale rule-set and no wraparound
    has ocurred.

    2) it subscribes to ruleset notifications, so it can watch for new
    generation ID.

    This is complementary to the NLM_F_DUMP_INTR approach, which allows
    us to detect an interference in the middle one single list dumping.
    There is no way to explicitly check that an interference has occurred
    between two list dumps from the kernel, since it doesn't know how
    many lists the userspace client is actually going to dump.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This allows us to access the original content of the batch from
    the commit and the abort paths.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

18 Sep, 2014

4 commits

  • Simon Horman says:

    ====================
    This pull requests makes the following changes:

    * Add simple weighted fail-over scheduler.
    - Unlike other IPVS schedulers this offers fail-over rather than load
    balancing. Connections are directed to the appropriate server based
    solely on highest weight value and server availability.
    - Thanks to Kenny Mathis

    * Support IPv6 real servers in IPv4 virtual-services and vice versa
    - This feature is supported in conjunction with the tunnel (IPIP)
    forwarding mechanism. That is, IPv4 may be forwarded in IPv6 and
    vice versa.
    - The motivation for this is to allow more flexibility in the
    choice of IP version offered by both virtual-servers and
    real-servers as they no longer need to match: An IPv4 connection from an
    end-user may be forwarded to a real-server using IPv6 and vice versa.
    - Further work need to be done to support this feature in conjunction
    with connection synchronisation. For now such configurations are
    not allowed.
    - This change includes update to netlink protocol, adding a new
    destination address family attribute. And the necessary changes
    to plumb this information throughout IPVS.
    - Thanks to Alex Gartrell and Julian Anastasov
    ====================

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Remove the temporary consistency check and add a case statement to only
    allow ipip mixed dests.

    Signed-off-by: Alex Gartrell
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Alex Gartrell
     
  • Use the new address family field cp->daf when printing
    cp->daddr in logs or connection listing.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Alex Gartrell
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Needed to support svc->af != dest->af.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Alex Gartrell
    Signed-off-by: Simon Horman

    Julian Anastasov
     

16 Sep, 2014

1 commit