17 Nov, 2013

2 commits

  • Pull NFS client bugfixes:
    - Stable fix for data corruption when retransmitting O_DIRECT writes
    - Stable fix for a deep recursion/stack overflow bug in rpc_release_client
    - Stable fix for infinite looping when mounting a NFSv4.x volume
    - Fix a typo in the nfs mount option parser
    - Allow pNFS layouts to be compiled into the kernel when NFSv4.1 is

    * tag 'nfs-for-3.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    nfs: fix pnfs Kconfig defaults
    NFS: correctly report misuse of "migration" mount option.
    nfs: don't retry detect_trunking with RPC_AUTH_UNIX more than once
    SUNRPC: Avoid deep recursion in rpc_release_client
    SUNRPC: Fix a data corruption issue when retransmitting RPC calls

    Linus Torvalds
     
  • Pull nfsd changes from Bruce Fields:
    "This includes miscellaneous bugfixes and cleanup and a performance fix
    for write-heavy NFSv4 workloads.

    (The most significant nfsd-relevant change this time is actually in
    the delegation patches that went through Viro, fixing a long-standing
    bug that can cause NFSv4 clients to miss updates made by non-nfs users
    of the filesystem. Those enable some followup nfsd patches which I
    have queued locally, but those can wait till 3.14)"

    * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (24 commits)
    nfsd: export proper maximum file size to the client
    nfsd4: improve write performance with better sendspace reservations
    svcrpc: remove an unnecessary assignment
    sunrpc: comment typo fix
    Revert "nfsd: remove_stid can be incorporated into nfs4_put_delegation"
    nfsd4: fix discarded security labels on setattr
    NFSD: Add support for NFS v4.2 operation checking
    nfsd4: nfsd_shutdown_net needs state lock
    NFSD: Combine decode operations for v4 and v4.1
    nfsd: -EINVAL on invalid anonuid/gid instead of silent failure
    nfsd: return better errors to exportfs
    nfsd: fh_update should error out in unexpected cases
    nfsd4: need to destroy revoked delegations in destroy_client
    nfsd: no need to unhash_stid before free
    nfsd: remove_stid can be incorporated into nfs4_put_delegation
    nfsd: nfs4_open_delegation needs to remove_stid rather than unhash_stid
    nfsd: nfs4_free_stid
    nfsd: fix Kconfig syntax
    sunrpc: trim off EC bytes in GSSAPI v2 unwrap
    gss_krb5: document that we ignore sequence number
    ...

    Linus Torvalds
     

16 Nov, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     

15 Nov, 2013

2 commits

  • Pull virtio updates from Rusty Russell:
    "Nothing really exciting: some groundwork for changing virtio endian,
    and some robustness fixes for broken virtio devices, plus minor
    tweaks"

    * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    virtio_scsi: verify if queue is broken after virtqueue_get_buf()
    x86, asmlinkage, lguest: Pass in globals into assembler statement
    virtio: mmio: fix signature checking for BE guests
    virtio_ring: adapt to notify() returning bool
    virtio_net: verify if queue is broken after virtqueue_get_buf()
    virtio_console: verify if queue is broken after virtqueue_get_buf()
    virtio_blk: verify if queue is broken after virtqueue_get_buf()
    virtio_ring: add new function virtqueue_is_broken()
    virtio_test: verify if virtqueue_kick() succeeded
    virtio_net: verify if virtqueue_kick() succeeded
    virtio_ring: let virtqueue_{kick()/notify()} return a bool
    virtio_ring: change host notification API
    virtio_config: remove virtio_config_val
    virtio: use size-based config accessors.
    virtio_config: introduce size-based accessors.
    virtio_ring: plug kmemleak false positive.
    virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM

    Linus Torvalds
     
  • All seq_printf() users are using "%n" for calculating padding size,
    convert them to use seq_setwidth() / seq_pad() pair.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Kees Cook
    Cc: Joe Perches
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

14 Nov, 2013

2 commits

  • Pull core locking changes from Ingo Molnar:
    "The biggest changes:

    - add lockdep support for seqcount/seqlocks structures, this
    unearthed both bugs and required extra annotation.

    - move the various kernel locking primitives to the new
    kernel/locking/ directory"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    block: Use u64_stats_init() to initialize seqcounts
    locking/lockdep: Mark __lockdep_count_forward_deps() as static
    lockdep/proc: Fix lock-time avg computation
    locking/doc: Update references to kernel/mutex.c
    ipv6: Fix possible ipv6 seqlock deadlock
    cpuset: Fix potential deadlock w/ set_mems_allowed
    seqcount: Add lockdep functionality to seqcount/seqlock structures
    net: Explicitly initialize u64_stats_sync structures for lockdep
    locking: Move the percpu-rwsem code to kernel/locking/
    locking: Move the lglocks code to kernel/locking/
    locking: Move the rwsem code to kernel/locking/
    locking: Move the rtmutex code to kernel/locking/
    locking: Move the semaphore core to kernel/locking/
    locking: Move the spinlock code to kernel/locking/
    locking: Move the lockdep code to kernel/locking/
    locking: Move the mutex code to kernel/locking/
    hung_task debugging: Add tracepoint to report the hang
    x86/locking/kconfig: Update paravirt spinlock Kconfig description
    lockstat: Report avg wait and hold times
    lockdep, x86/alternatives: Drop ancient lockdep fixup message
    ...

    Linus Torvalds
     
  • Signed-off-by: Weng Meiling
    Signed-off-by: J. Bruce Fields

    Weng Meiling
     

13 Nov, 2013

4 commits

  • Pull networking updates from David Miller:

    1) The addition of nftables. No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations. For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset. In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

    2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

    3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

    4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

    5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

    6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

    7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option. From Eric Dumazet.

    8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

    9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too. Implementation from Shawn
    Bohrer.

    10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets. With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention. From Eric Dumazet.

    11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations. From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

    12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels. From Eric Dumazet.

    13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies. From Hannes Frederic Sowa. The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

    14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more. Many drivers have been cleaned
    up in this way, from Jingoo Han.

    15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

    16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled. Also from Daniel
    Borkmann.

    17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value. This helps avoid PMTU attacks,
    particularly on DNS servers. From Hannes Frederic Sowa.

    18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net. From Jason Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    random32: add test cases for taus113 implementation
    random32: upgrade taus88 generator to taus113 from errata paper
    random32: move rnd_state to linux/random.h
    random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
    random32: add periodic reseeding
    random32: fix off-by-one in seeding requirement
    PHY: Add RTL8201CP phy_driver to realtek
    xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
    macmace: add missing platform_set_drvdata() in mace_probe()
    ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
    ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
    vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
    ixgbe: add warning when max_vfs is out of range.
    igb: Update link modes display in ethtool
    netfilter: push reasm skb through instead of original frag skbs
    ip6_output: fragment outgoing reassembled skb properly
    MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
    net_sched: tbf: support of 64bit rates
    ixgbe: deleting dfwd stations out of order can cause null ptr deref
    ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
    ...

    Linus Torvalds
     
  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     
  • In cases where an rpc client has a parent hierarchy, then
    rpc_free_client may end up calling rpc_release_client() on the
    parent, thus recursing back into rpc_free_client. If the hierarchy
    is deep enough, then we can get into situations where the stack
    simply overflows.

    The fix is to have rpc_release_client() loop so that it can take
    care of the parent rpc client hierarchy without needing to
    recurse.

    Reported-by: Jeff Layton
    Reported-by: Weston Andros Adamson
    Reported-by: Bruce Fields
    Link: http://lkml.kernel.org/r/2C73011F-0939-434C-9E4D-13A1EB1403D7@netapp.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

12 Nov, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle are:

    - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
    van Riel, Peter Zijlstra et al. Yay!

    - optimize preemption counter handling: merge the NEED_RESCHED flag
    into the preempt_count variable, by Peter Zijlstra.

    - wait.h fixes and code reorganization from Peter Zijlstra

    - cfs_bandwidth fixes from Ben Segall

    - SMP load-balancer cleanups from Peter Zijstra

    - idle balancer improvements from Jason Low

    - other fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
    ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
    stop_machine: Fix race between stop_two_cpus() and stop_cpus()
    sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
    sched: Fix asymmetric scheduling for POWER7
    sched: Move completion code from core.c to completion.c
    sched: Move wait code from core.c to wait.c
    sched: Move wait.c into kernel/sched/
    sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
    sched: Avoid throttle_cfs_rq() racing with period_timer stopping
    sched: Guarantee new group-entities always have weight
    sched: Fix hrtimer_cancel()/rq->lock deadlock
    sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
    sched: Fix race on toggling cfs_bandwidth_used
    sched: Remove extra put_online_cpus() inside sched_setaffinity()
    sched/rt: Fix task_tick_rt() comment
    sched/wait: Fix build breakage
    sched/wait: Introduce prepare_to_wait_event()
    sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
    sched: Remove get_online_cpus() usage
    sched: Fix race in migrate_swap_stop()
    ...

    Linus Torvalds
     

11 Nov, 2013

4 commits

  • Fixes a suspicious rcu derference warning.

    Cc: Florent Fourcot
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • This is to avoid very silly Kconfig dependencies for modules
    using this routine.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pushing original fragments through causes several problems. For example
    for matching, frags may not be matched correctly. Take following
    example:

    On HOSTA do:
    ip6tables -I INPUT -p icmpv6 -j DROP
    ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT

    and on HOSTB you do:
    ping6 HOSTA -s2000 (MTU is 1500)

    Incoming echo requests will be filtered out on HOSTA. This issue does
    not occur with smaller packets than MTU (where fragmentation does not happen)

    As was discussed previously, the only correct solution seems to be to use
    reassembled skb instead of separete frags. Doing this has positive side
    effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
    dances in ipvs and conntrack can be removed.

    Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
    entirely and use code in net/ipv6/reassembly.c instead.

    Signed-off-by: Jiri Pirko
    Acked-by: Julian Anastasov
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • If reassembled packet would fit into outdev MTU, it is not fragmented
    according the original frag size and it is send as single big packet.

    The second case is if skb is gso. In that case fragmentation does not happen
    according to the original frag size.

    This patch fixes these.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

10 Nov, 2013

1 commit

  • With psched_ratecfg_precompute(), tbf can deal with 64bit rates.
    Add two new attributes so that tc can use them to break the 32bit
    limit.

    Signed-off-by: Yang Yingliang
    Suggested-by: Sergei Shtylyov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yang Yingliang
     

09 Nov, 2013

8 commits

  • The following scenario can cause silent data corruption when doing
    NFS writes. It has mainly been observed when doing database writes
    using O_DIRECT.

    1) The RPC client uses sendpage() to do zero-copy of the page data.
    2) Due to networking issues, the reply from the server is delayed,
    and so the RPC client times out.

    3) The client issues a second sendpage of the page data as part of
    an RPC call retransmission.

    4) The reply to the first transmission arrives from the server
    _before_ the client hardware has emptied the TCP socket send
    buffer.
    5) After processing the reply, the RPC state machine rules that
    the call to be done, and triggers the completion callbacks.
    6) The application notices the RPC call is done, and reuses the
    pages to store something else (e.g. a new write).

    7) The client NIC drains the TCP socket send buffer. Since the
    page data has now changed, it reads a corrupted version of the
    initial RPC call, and puts it on the wire.

    This patch fixes the problem in the following manner:

    The ordering guarantees of TCP ensure that when the server sends a
    reply, then we know that the _first_ transmission has completed. Using
    zero-copy in that situation is therefore safe.
    If a time out occurs, we then send the retransmission using sendmsg()
    (i.e. no zero-copy), We then know that the socket contains a full copy of
    the data, and so it will retransmit a faithful reproduction even if the
    RPC call completes, and the application reuses the O_DIRECT buffer in
    the meantime.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     
  • As the rfc 4191 said, the Router Preference and Lifetime values in a
    ::/0 Route Information Option should override the preference and lifetime
    values in the Router Advertisement header. But when the kernel deals with
    a ::/0 Route Information Option, the rt6_get_route_info() always return
    NULL, that means that overriding will not happen, because those default
    routers were added without flag RTF_ROUTEINFO in rt6_add_dflt_router().

    In order to deal with that condition, we should call rt6_get_dflt_router
    when the prefix length is 0.

    Signed-off-by: Duan Jiong
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Duan Jiong
     
  • Commit 0628b123c96d ("netfilter: nfnetlink: add batch support and use it
    from nf_tables") introduced a bug leading to various crashes in netlink_ack
    when netlink message with invalid nlmsg_len was sent by an unprivileged
    user.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • When trying to delete a table >= 256 using iproute2 the local table
    will be deleted.
    The table id is specified as a netlink attribute when it needs more then
    8 bits and iproute2 then sets the table field to RT_TABLE_UNSPEC (0).
    Preconditions to matching the table id in the rule delete code
    doesn't seem to take the "table id in netlink attribute" into condition
    so the frh_get_table helper function never gets to do its job when
    matching against current rule.
    Use the helper function twice instead of peaking at the table value directly.

    Originally reported at: http://bugs.debian.org/724783

    Reported-by: Nicolas HICHER
    Signed-off-by: Andreas Henriksson
    Signed-off-by: David S. Miller

    Andreas Henriksson
     
  • Take ip6_fl_lock before to read and update
    a label.

    v2: protect only the relevant code

    Reported-by: Hannes Frederic Sowa
    Signed-off-by: Florent Fourcot
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • If the last RFC 6437 does not give any constraints
    for lifetime of flow labels, the previous RFC 3697
    spoke of a minimum of 120 seconds between
    reattribution of a flow label.

    The maximum linger is currently set to 60 seconds
    and does not allow this configuration without
    CAP_NET_ADMIN right.

    This patch increase the maximum linger to 150
    seconds, allowing more flexibility to standard
    users.

    Signed-off-by: Florent Fourcot
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • It is already possible to set/put/renew a label
    with IPV6_FLOWLABEL_MGR and setsockopt. This patch
    add the possibility to get information about this
    label (current value, time before expiration, etc).

    It helps application to take decision for a renew
    or a release of the label.

    v2:
    * Add spin_lock to prevent race condition
    * return -ENOENT if no result found
    * check if flr_action is GET

    v3:
    * move the spin_lock to protect only the
    relevant code

    Signed-off-by: Florent Fourcot
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • By moving code around, we avoid :

    1) A reload of iph->ihl (bit field, so needs a mask)

    2) A conditional test (replaced by a conditional mov on x86)
    Fast path loads iph->protocol anyway.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Nov, 2013

12 commits

  • …wireless-next into for-davem

    John W. Linville
     
  • While testing virtio_net and skb_segment() changes, Hannes reported
    that UFO was sending wrong frames.

    It appears this was introduced by a recent commit :
    8c3a897bfab1 ("inet: restore gso for vxlan")

    The old condition to perform IP frag was :

    tunnel = !!skb->encapsulation;
    ...
    if (!tunnel && proto == IPPROTO_UDP) {

    So the new one should be :

    udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
    ...
    if (udpfrag) {

    Initialization of udpfrag must be done before call
    to ops->callbacks.gso_segment(skb, features), as
    skb_udp_tunnel_segment() clears skb->encapsulation

    (We want udpfrag to be true for UFO, false for VXLAN)

    With help from Alexei Starovoitov

    Reported-by: Hannes Frederic Sowa
    Signed-off-by: Eric Dumazet
    Cc: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use "@" to refer to parameters in the kernel-doc description. According
    to Documentation/kernel-doc-nano-HOWTO.txt "&" shall be used to refer to
    structures only.

    Signed-off-by: Mathias Krause
    Cc: "David S. Miller"
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • Also remove the warning for fragmented packets -- skb_cow_data() will
    linearize the buffer, removing all fragments.

    Signed-off-by: Mathias Krause
    Cc: Dmitry Tarnyagin
    Cc: "David S. Miller"
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • This function has usage beside IPsec so move it to the core skbuff code.
    While doing so, give it some documentation and change its return type to
    'unsigned char *' to be in line with skb_put().

    Signed-off-by: Mathias Krause
    Cc: Steffen Klassert
    Cc: "David S. Miller"
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • Add a operations structure that allows a network interface to export
    the fact that it supports package forwarding in hardware between
    physical interfaces and other mac layer devices assigned to it (such
    as macvlans). This operaions structure can be used by virtual mac
    devices to bypass software switching so that forwarding can be done
    in hardware more efficiently.

    Signed-off-by: John Fastabend
    Signed-off-by: Neil Horman
    CC: Andy Gospodarek
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    John Fastabend
     
  • We recently added a new error path and it needs a dev_put().

    Fixes: 7adac1ec8198 ('6lowpan: Only make 6lowpan links to IEEE802154 devices')
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • Provide a method for read-only access to the vlan device egress mapping.

    Do this by refactoring vlan_dev_get_egress_qos_mask() such that now it
    receives as an argument the skb priority instead of pointer to the skb.

    Such an access is needed for the IBoE stack where the control plane
    goes through the network stack. This is an add-on step on top of commit
    d4a968658c "net/route: export symbol ip_tos2prio" which allowed the RDMA-CM
    to use ip_tos2prio.

    Signed-off-by: Eyal Perry
    Signed-off-by: Hadar Hen Zion
    Signed-off-by: David S. Miller

    Eyal Perry
     
  • If appending a received fragment to the pending fragment chain
    in a unicast link fails, the current code tries to force a retransmission
    of the fragment by decrementing the 'next received sequence number'
    field in the link. This is done under the assumption that the failure
    is caused by an out-of-memory situation, an assumption that does
    not hold true after the previous patch in this series.

    A failure to append a fragment can now only be caused by a protocol
    violation by the sending peer, and it must hence be assumed that it
    is either malicious or buggy. Either way, the correct behavior is now
    to reset the link instead of trying to revert its sequence number.
    So, this is what we do in this commit.

    Signed-off-by: Erik Hugne
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • When the first fragment of a long data data message is received on a link, a
    reassembly buffer large enough to hold the data from this and all subsequent
    fragments of the message is allocated. The payload of each new fragment is
    copied into this buffer upon arrival. When the last fragment is received, the
    reassembled message is delivered upwards to the port/socket layer.

    Not only is this an inefficient approach, but it may also cause bursts of
    reassembly failures in low memory situations. since we may fail to allocate
    the necessary large buffer in the first place. Furthermore, after 100 subsequent
    such failures the link will be reset, something that in reality aggravates the
    situation.

    To remedy this problem, this patch introduces a different approach. Instead of
    allocating a big reassembly buffer, we now append the arriving fragments
    to a reassembly chain on the link, and deliver the whole chain up to the
    socket layer once the last fragment has been received. This is safe because
    the retransmission layer of a TIPC link always delivers packets in strict
    uninterrupted order, to the reassembly layer as to all other upper layers.
    Hence there can never be more than one fragment chain pending reassembly at
    any given time in a link, and we can trust (but still verify) that the
    fragments will be chained up in the correct order.

    Signed-off-by: Erik Hugne
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • When a message fragment is received in a broadcast or unicast link,
    the reception code will append the fragment payload to a big reassembly
    buffer through a call to the function tipc_recv_fragm(). However, after
    the return of that call, the logics goes on and passes the fragment
    buffer to the function tipc_net_route_msg(), which will simply drop it.
    This behavior is a remnant from the now obsolete multi-cluster
    functionality, and has no relevance in the current code base.

    Although currently harmless, this unnecessary call would be fatal
    after applying the next patch in this series, which introduces
    a completely new reassembly algorithm. So we change the code to
    eliminate the redundant call.

    Signed-off-by: Erik Hugne
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    - Changes to the RPC socket code to allow NFSv4 to turn off
    timeout+retry:
    * Detect TCP connection breakage through the "keepalive" mechanism
    - Add client side support for NFSv4.x migration (Chuck Lever)
    - Add support for multiple security flavour arguments to the "sec="
    mount option (Dros Adamson)
    - fs-cache bugfixes from David Howells:
    * Fix an issue whereby caching can be enabled on a file that is
    open for writing
    - More NFSv4 open code stable bugfixes
    - Various Labeled NFS (selinux) bugfixes, including one stable fix
    - Fix buffer overflow checking in the RPCSEC_GSS upcall encoding"

    * tag 'nfs-for-3.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
    NFSv4.2: Remove redundant checks in nfs_setsecurity+nfs4_label_init_security
    NFSv4: Sanity check the server reply in _nfs4_server_capabilities
    NFSv4.2: encode_readdir - only ask for labels when doing readdirplus
    nfs: set security label when revalidating inode
    NFSv4.2: Fix a mismatch between Linux labeled NFS and the NFSv4.2 spec
    NFS: Fix a missing initialisation when reading the SELinux label
    nfs: fix oops when trying to set SELinux label
    nfs: fix inverted test for delegation in nfs4_reclaim_open_state
    SUNRPC: Cleanup xs_destroy()
    SUNRPC: close a rare race in xs_tcp_setup_socket.
    SUNRPC: remove duplicated include from clnt.c
    nfs: use IS_ROOT not DCACHE_DISCONNECTED
    SUNRPC: Fix buffer overflow checking in gss_encode_v0_msg/gss_encode_v1_msg
    SUNRPC: gss_alloc_msg - choose _either_ a v0 message or a v1 message
    SUNRPC: remove an unnecessary if statement
    nfs: Use PTR_ERR_OR_ZERO in 'nfs/nfs4super.c'
    nfs: Use PTR_ERR_OR_ZERO in 'nfs41_callback_up' function
    nfs: Remove useless 'error' assignment
    sunrpc: comment typo fix
    SUNRPC: Add correct rcu_dereference annotation in rpc_clnt_set_transport
    ...

    Linus Torvalds
     

07 Nov, 2013

1 commit

  • Pull driver core / sysfs patches from Greg KH:
    "Here's the big driver core / sysfs update for 3.13-rc1.

    There's lots of dev_groups updates for different subsystems, as they
    all get slowly migrated over to the safe versions of the attribute
    groups (removing userspace races with the creation of the sysfs
    files.) Also in here are some kobject updates, devres expansions, and
    the first round of Tejun's sysfs reworking to enable it to be used by
    other subsystems as a backend for an in-kernel filesystem.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
    sysfs: rename sysfs_assoc_lock and explain what it's about
    sysfs: use generic_file_llseek() for sysfs_file_operations
    sysfs: return correct error code on unimplemented mmap()
    mdio_bus: convert bus code to use dev_groups
    device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
    sysfs: separate out dup filename warning into a separate function
    sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
    sysfs: remove unused sysfs_get_dentry() prototype
    sysfs: honor bin_attr.attr.ignore_lockdep
    sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
    devres: restore zeroing behavior of devres_alloc()
    sysfs: fix sysfs_write_file for bin file
    input: gameport: convert bus code to use dev_groups
    input: serio: remove bus usage of dev_attrs
    input: serio: use DEVICE_ATTR_RO()
    i2o: convert bus code to use dev_groups
    memstick: convert bus code to use dev_groups
    tifm: convert bus code to use dev_groups
    virtio: convert bus code to use dev_groups
    ipack: convert bus code to use dev_groups
    ...

    Linus Torvalds
     

06 Nov, 2013

2 commits

  • While enabling lockdep on seqlocks, I ran across the warning below
    caused by the ipv6 stats being updated in both irq and non-irq context.

    This patch changes from IP6_INC_STATS_BH to IP6_INC_STATS (suggested
    by Eric Dumazet) to resolve this problem.

    [ 11.120383] =================================
    [ 11.121024] [ INFO: inconsistent lock state ]
    [ 11.121663] 3.12.0-rc1+ #68 Not tainted
    [ 11.122229] ---------------------------------
    [ 11.122867] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    [ 11.123741] init/4483 [HC0[0]:SC1[3]:HE1:SE0] takes:
    [ 11.124505] (&stats->syncp.seq#6){+.?...}, at: [] ndisc_send_ns+0xe2/0x130
    [ 11.125736] {SOFTIRQ-ON-W} state was registered at:
    [ 11.126447] [] __lock_acquire+0x5c7/0x1af0
    [ 11.127222] [] lock_acquire+0x96/0xd0
    [ 11.127925] [] write_seqcount_begin+0x33/0x40
    [ 11.128766] [] ip6_dst_lookup_tail+0x3a3/0x460
    [ 11.129582] [] ip6_dst_lookup_flow+0x2e/0x80
    [ 11.130014] [] ip6_datagram_connect+0x150/0x4e0
    [ 11.130014] [] inet_dgram_connect+0x25/0x70
    [ 11.130014] [] SYSC_connect+0xa1/0xc0
    [ 11.130014] [] SyS_connect+0x11/0x20
    [ 11.130014] [] SyS_socketcall+0x12b/0x300
    [ 11.130014] [] syscall_call+0x7/0xb
    [ 11.130014] irq event stamp: 1184
    [ 11.130014] hardirqs last enabled at (1184): [] local_bh_enable+0x71/0x110
    [ 11.130014] hardirqs last disabled at (1183): [] local_bh_enable+0x3d/0x110
    [ 11.130014] softirqs last enabled at (0): [] copy_process.part.42+0x45d/0x11a0
    [ 11.130014] softirqs last disabled at (1147): [] irq_exit+0xa5/0xb0
    [ 11.130014]
    [ 11.130014] other info that might help us debug this:
    [ 11.130014] Possible unsafe locking scenario:
    [ 11.130014]
    [ 11.130014] CPU0
    [ 11.130014] ----
    [ 11.130014] lock(&stats->syncp.seq#6);
    [ 11.130014]
    [ 11.130014] lock(&stats->syncp.seq#6);
    [ 11.130014]
    [ 11.130014] *** DEADLOCK ***
    [ 11.130014]
    [ 11.130014] 3 locks held by init/4483:
    [ 11.130014] #0: (rcu_read_lock){.+.+..}, at: [] SyS_setpriority+0x4c/0x620
    [ 11.130014] #1: (((&ifa->dad_timer))){+.-...}, at: [] call_timer_fn+0x0/0xf0
    [ 11.130014] #2: (rcu_read_lock){.+.+..}, at: [] ndisc_send_skb+0x54/0x5d0
    [ 11.130014]
    [ 11.130014] stack backtrace:
    [ 11.130014] CPU: 0 PID: 4483 Comm: init Not tainted 3.12.0-rc1+ #68
    [ 11.130014] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 11.130014] 00000000 00000000 c55e5c10 c1bb0e71 c57128b0 c55e5c4c c1badf79 c1ec1123
    [ 11.130014] c1ec1484 00001183 00000000 00000000 00000001 00000003 00000001 00000000
    [ 11.130014] c1ec1484 00000004 c5712dcc 00000000 c55e5c84 c10de492 00000004 c10755f2
    [ 11.130014] Call Trace:
    [ 11.130014] [] dump_stack+0x4b/0x66
    [ 11.130014] [] print_usage_bug+0x1d3/0x1dd
    [ 11.130014] [] mark_lock+0x282/0x2f0
    [ 11.130014] [] ? kvm_clock_read+0x22/0x30
    [ 11.130014] [] ? check_usage_backwards+0x150/0x150
    [ 11.130014] [] __lock_acquire+0x584/0x1af0
    [ 11.130014] [] ? sched_clock_cpu+0xef/0x190
    [ 11.130014] [] ? mark_held_locks+0x8c/0xf0
    [ 11.130014] [] lock_acquire+0x96/0xd0
    [ 11.130014] [] ? ndisc_send_ns+0xe2/0x130
    [ 11.130014] [] ndisc_send_skb+0x293/0x5d0
    [ 11.130014] [] ? ndisc_send_ns+0xe2/0x130
    [ 11.130014] [] ndisc_send_ns+0xe2/0x130
    [ 11.130014] [] ? mod_timer+0xf2/0x160
    [ 11.130014] [] ? addrconf_dad_timer+0xce/0x150
    [ 11.130014] [] addrconf_dad_timer+0x10a/0x150
    [ 11.130014] [] ? addrconf_dad_completed+0x1c0/0x1c0
    [ 11.130014] [] call_timer_fn+0x73/0xf0
    [ 11.130014] [] ? __internal_add_timer+0xb0/0xb0
    [ 11.130014] [] ? addrconf_dad_completed+0x1c0/0x1c0
    [ 11.130014] [] run_timer_softirq+0x141/0x1e0
    [ 11.130014] [] ? __do_softirq+0x70/0x1b0
    [ 11.130014] [] __do_softirq+0xc0/0x1b0
    [ 11.130014] [] irq_exit+0xa5/0xb0
    [ 11.130014] [] smp_apic_timer_interrupt+0x35/0x50
    [ 11.130014] [] apic_timer_interrupt+0x32/0x38
    [ 11.130014] [] ? SyS_setpriority+0xfd/0x620
    [ 11.130014] [] ? lock_release+0x9/0x240
    [ 11.130014] [] ? SyS_setpriority+0xe7/0x620
    [ 11.130014] [] ? _raw_read_unlock+0x1d/0x30
    [ 11.130014] [] SyS_setpriority+0x111/0x620
    [ 11.130014] [] ? SyS_setpriority+0x4c/0x620
    [ 11.130014] [] syscall_call+0x7/0xb

    Signed-off-by: John Stultz
    Acked-by: Eric Dumazet
    Signed-off-by: Peter Zijlstra
    Cc: Alexey Kuznetsov
    Cc: "David S. Miller"
    Cc: Hideaki YOSHIFUJI
    Cc: James Morris
    Cc: Mathieu Desnoyers
    Cc: Patrick McHardy
    Cc: Steven Rostedt
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/1381186321-4906-5-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz
     
  • In order to enable lockdep on seqcount/seqlock structures, we
    must explicitly initialize any locks.

    The u64_stats_sync structure, uses a seqcount, and thus we need
    to introduce a u64_stats_init() function and use it to initialize
    the structure.

    This unfortunately adds a lot of fairly trivial initialization code
    to a number of drivers. But the benefit of ensuring correctness makes
    this worth while.

    Because these changes are required for lockdep to be enabled, and the
    changes are quite trivial, I've not yet split this patch out into 30-some
    separate patches, as I figured it would be better to get the various
    maintainers thoughts on how to best merge this change along with
    the seqcount lockdep enablement.

    Feedback would be appreciated!

    Signed-off-by: John Stultz
    Acked-by: Julian Anastasov
    Signed-off-by: Peter Zijlstra
    Cc: Alexey Kuznetsov
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Hideaki YOSHIFUJI
    Cc: James Morris
    Cc: Jesse Gross
    Cc: Mathieu Desnoyers
    Cc: "Michael S. Tsirkin"
    Cc: Mirko Lindner
    Cc: Patrick McHardy
    Cc: Roger Luethi
    Cc: Rusty Russell
    Cc: Simon Horman
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Thomas Petazzoni
    Cc: Wensong Zhang
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz