31 Jul, 2016

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable bugfixes:
    - nfs: don't create zero-length requests

    - several LAYOUTGET bugfixes

    Features:
    - several performance related features

    - more aggressive caching when we can rely on close-to-open
    cache consistency

    - remove serialisation of O_DIRECT reads and writes

    - optimise several code paths to not flush to disk unnecessarily.

    However allow for the idiosyncracies of pNFS for those layout
    types that need to issue a LAYOUTCOMMIT before the metadata can
    be updated on the server.

    - SUNRPC updates to the client data receive path

    - pNFS/SCSI support RH/Fedora dm-mpath device nodes

    - pNFS files/flexfiles can now use unprivileged ports when
    the generic NFS mount options allow it.

    Bugfixes:
    - Don't use RDMA direct data placement together with data
    integrity or privacy security flavours

    - Remove the RDMA ALLPHYSICAL memory registration mode as
    it has potential security holes.

    - Several layout recall fixes to improve NFSv4.1 protocol
    compliance.

    - Fix an Oops in the pNFS files and flexfiles connection
    setup to the DS

    - Allow retry of operations that used a returned delegation
    stateid

    - Don't mark the inode as revalidated if a LAYOUTCOMMIT is
    outstanding

    - Fix writeback races in nfs4_copy_range() and
    nfs42_proc_deallocate()"

    * tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits)
    pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding
    NFSv4: Clean up lookup of SECINFO_NO_NAME
    NFSv4.2: Fix warning "variable ‘stateids’ set but not used"
    NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’"
    SUNRPC: Fix a compiler warning in fs/nfs/clnt.c
    pNFS: Remove redundant smp_mb() from pnfs_init_lseg()
    pNFS: Cleanup - do layout segment initialisation in one place
    pNFS: Remove redundant stateid invalidation
    pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()
    pNFS: Clear the layout metadata if the server changed the layout stateid
    pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()
    NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id
    pNFS: Do not set plh_return_seq for non-callback related layoutreturns
    pNFS: Ensure layoutreturn acts as a completion for layout callbacks
    pNFS: Fix CB_LAYOUTRECALL stateid verification
    pNFS: Always update the layout barrier seqid on LAYOUTGET
    pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set
    pNFS: Clear the layout return tracking on layout reinitialisation
    pNFS: LAYOUTRETURN should only update the stateid if the layout is valid
    nfs: don't create zero-length requests
    ...

    Linus Torvalds
     

30 Jul, 2016

2 commits

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - TPM core and driver updates/fixes
    - IPv6 security labeling (CALIPSO)
    - Lots of Apparmor fixes
    - Seccomp: remove 2-phase API, close hole where ptrace can change
    syscall #"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
    apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
    tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
    tpm: Factor out common startup code
    tpm: use devm_add_action_or_reset
    tpm2_i2c_nuvoton: add irq validity check
    tpm: read burstcount from TPM_STS in one 32-bit transaction
    tpm: fix byte-order for the value read by tpm2_get_tpm_pt
    tpm_tis_core: convert max timeouts from msec to jiffies
    apparmor: fix arg_size computation for when setprocattr is null terminated
    apparmor: fix oops, validate buffer size in apparmor_setprocattr()
    apparmor: do not expose kernel stack
    apparmor: fix module parameters can be changed after policy is locked
    apparmor: fix oops in profile_unpack() when policy_db is not present
    apparmor: don't check for vmalloc_addr if kvzalloc() failed
    apparmor: add missing id bounds check on dfa verification
    apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
    apparmor: use list_next_entry instead of list_entry_next
    apparmor: fix refcount race when finding a child profile
    apparmor: fix ref count leak when profile sha1 hash is read
    apparmor: check that xindex is in trans_table bounds
    ...

    Linus Torvalds
     
  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

29 Jul, 2016

1 commit

  • This changes the vfs dentry hashing to mix in the parent pointer at the
    _beginning_ of the hash, rather than at the end.

    That actually improves both the hash and the code generation, because we
    can move more of the computation to the "static" part of the dcache
    setup, and do less at lookup runtime.

    It turns out that a lot of other hash users also really wanted to mix in
    a base pointer as a 'salt' for the hash, and so the slightly extended
    interface ends up working well for other cases too.

    Users that want a string hash that is purely about the string pass in a
    'salt' pointer of NULL.

    * merge branch 'salted-string-hash':
    fs/dcache.c: Save one 32-bit multiply in dcache lookup
    vfs: make the string hashes salt the hash

    Linus Torvalds
     

28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

27 Jul, 2016

11 commits


26 Jul, 2016

18 commits

  • After the previous patch, struct tc_action should be enough
    to represent the generic tc action, tcf_common is not necessary
    any more. This patch gets rid of it to make tc action code
    more readable.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • struct tc_action is confusing, currently we use it for two purposes:
    1) Pass in arguments and carry out results from helper functions
    2) A generic representation for tc actions

    The first one is error-prone, since we need to make sure we don't
    miss anything. This patch aims to get rid of this use, by moving
    tc_action into tcf_common, so that they are allocated together
    in hashtable and can be cast'ed easily.

    And together with the following patch, we could really make
    tc_action a generic representation for all tc actions and each
    type of action can inherit from it.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • After a612769774a3 ("udp: prevent bugcheck if filter truncates packet
    too much"), there followed various other fixes for similar cases such
    as f4979fcea7fd ("rose: limit sk_filter trim to payload").

    Latter introduced a new helper sk_filter_trim_cap(), where we can pass
    the trim limit directly to the socket filter handling. Make use of it
    here as well with sizeof(struct udphdr) as lower cap limit and drop the
    extra skb->len test in UDP's input path.

    Signed-off-by: Daniel Borkmann
    Cc: Willem de Bruijn
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pull timer updates from Thomas Gleixner:
    "This update provides the following changes:

    - The rework of the timer wheel which addresses the shortcomings of
    the current wheel (cascading, slow search for next expiring timer,
    etc). That's the first major change of the wheel in almost 20
    years since Finn implemted it.

    - A large overhaul of the clocksource drivers init functions to
    consolidate the Device Tree initialization

    - Some more Y2038 updates

    - A capability fix for timerfd

    - Yet another clock chip driver

    - The usual pile of updates, comment improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
    tick/nohz: Optimize nohz idle enter
    clockevents: Make clockevents_subsys static
    clocksource/drivers/time-armada-370-xp: Fix return value check
    timers: Implement optimization for same expiry time in mod_timer()
    timers: Split out index calculation
    timers: Only wake softirq if necessary
    timers: Forward the wheel clock whenever possible
    timers/nohz: Remove pointless tick_nohz_kick_tick() function
    timers: Optimize collect_expired_timers() for NOHZ
    timers: Move __run_timers() function
    timers: Remove set_timer_slack() leftovers
    timers: Switch to a non-cascading wheel
    timers: Reduce the CPU index space to 256k
    timers: Give a few structs and members proper names
    hlist: Add hlist_is_singular_node() helper
    signals: Use hrtimer for sigtimedwait()
    timers: Remove the deprecated mod_timer_pinned() API
    timers, net/ipv4/inet: Initialize connection request timers as pinned
    timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
    timers, drivers/tty/metag_da: Initialize the poll timer as pinned
    ...

    Linus Torvalds
     
  • I was seeing a lot of these:

    BUG: sleeping function called from invalid context at mm/slab.h:388
    in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
    Preemption disabled at:[] rhashtable_walk_start+0x46/0x150

    [] preempt_count_add+0x1fb/0x280
    [] _raw_spin_lock+0x12/0x40
    [] console_unlock+0x2f7/0x930
    [] vprintk_emit+0x2fb/0x520
    [] vprintk_default+0x1a/0x20
    [] printk+0x94/0xb0
    [] print_stack_trace+0xe0/0x170
    [] ___might_sleep+0x3be/0x460
    [] __might_sleep+0x90/0x1a0
    [] kmem_cache_alloc+0x153/0x1e0
    [] rhashtable_walk_init+0xfe/0x2d0
    [] sctp_transport_walk_start+0x1e/0x60
    [] sctp_transport_seq_start+0x4d/0x150
    [] seq_read+0x27b/0x1180
    [] proc_reg_read+0xbc/0x180
    [] __vfs_read+0xdb/0x610
    [] vfs_read+0xea/0x2d0
    [] SyS_pread64+0x11b/0x150
    [] do_syscall_64+0x19c/0x410
    [] return_from_SYSCALL_64+0x0/0x6a
    [] 0xffffffffffffffff

    Apparently we always need to call rhashtable_walk_stop(), even when
    rhashtable_walk_start() fails:

    * rhashtable_walk_start - Start a hash table walk
    * @iter: Hash table iterator
    *
    * Start a hash table walk. Note that we take the RCU lock in all
    * cases including when we return an error. So you must always call
    * rhashtable_walk_stop to clean up.

    otherwise we never call rcu_read_unlock() and we get the splat above.

    Fixes: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag")
    See-also: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag")
    See-also: f2dba9c6 ("rhashtable: Introduce rhashtable_walk_*")
    Cc: Xin Long
    Cc: Herbert Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Vegard Nossum
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Vegard Nossum
     
  • Pull locking updates from Ingo Molnar:
    "The locking tree was busier in this cycle than the usual pattern - a
    couple of major projects happened to coincide.

    The main changes are:

    - implement the atomic_fetch_{add,sub,and,or,xor}() API natively
    across all SMP architectures (Peter Zijlstra)

    - add atomic_fetch_{inc/dec}() as well, using the generic primitives
    (Davidlohr Bueso)

    - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
    Waiman Long)

    - optimize smp_cond_load_acquire() on arm64 and implement LSE based
    atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    on arm64 (Will Deacon)

    - introduce smp_acquire__after_ctrl_dep() and fix various barrier
    mis-uses and bugs (Peter Zijlstra)

    - after discovering ancient spin_unlock_wait() barrier bugs in its
    implementation and usage, strengthen its semantics and update/fix
    usage sites (Peter Zijlstra)

    - optimize mutex_trylock() fastpath (Peter Zijlstra)

    - ... misc fixes and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
    locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
    locking/static_keys: Fix non static symbol Sparse warning
    locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
    locking/atomic, arch/tile: Fix tilepro build
    locking/atomic, arch/m68k: Remove comment
    locking/atomic, arch/arc: Fix build
    locking/Documentation: Clarify limited control-dependency scope
    locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
    locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
    locking/atomic, arch/mips: Convert to _relaxed atomics
    locking/atomic, arch/alpha: Convert to _relaxed atomics
    locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
    locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
    locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    locking/atomic: Fix atomic64_relaxed() bits
    locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    ...

    Linus Torvalds
     
  • I ran into this:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 2 PID: 2012 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff8800b745f2c0 ti: ffff880111740000 task.ti: ffff880111740000
    RIP: 0010:[] [] irttp_connect_request+0x36/0x710
    RSP: 0018:ffff880111747bb8 EFLAGS: 00010286
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000069dd8358
    RDX: 0000000000000009 RSI: 0000000000000027 RDI: 0000000000000048
    RBP: ffff880111747c00 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000069dd8358 R11: 1ffffffff0759723 R12: 0000000000000000
    R13: ffff88011a7e4780 R14: 0000000000000027 R15: 0000000000000000
    FS: 00007fc738404700(0000) GS:ffff88011af00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc737fdfb10 CR3: 0000000118087000 CR4: 00000000000006e0
    Stack:
    0000000000000200 ffff880111747bd8 ffffffff810ee611 ffff880119f1f220
    ffff880119f1f4f8 ffff880119f1f4f0 ffff88011a7e4780 ffff880119f1f232
    ffff880119f1f220 ffff880111747d58 ffffffff82bca542 0000000000000000
    Call Trace:
    [] irda_connect+0x562/0x1190
    [] SYSC_connect+0x202/0x2a0
    [] SyS_connect+0x9/0x10
    [] do_syscall_64+0x19c/0x410
    [] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 41 89 ca 48 89 e5 41 57 41 56 41 55 41 54 41 89 d7 53 48 89 fb 48 83 c7 48 48 89 fa 41 89 f6 48 c1 ea 03 48 83 ec 20 4c 8b 65 10 b6 04 02 84 c0 74 08 84 c0 0f 8e 4c 04 00 00 80 7b 48 00 74
    RIP [] irttp_connect_request+0x36/0x710
    RSP
    ---[ end trace 4cda2588bc055b30 ]---

    The problem is that irda_open_tsap() can fail and leave self->tsap = NULL,
    and then irttp_connect_request() almost immediately dereferences it.

    Cc: stable@vger.kernel.org
    Signed-off-by: Vegard Nossum
    Signed-off-by: David S. Miller

    Vegard Nossum
     
  • The head skb for GSO packets won't travel through the inner depths of
    SCTP stack as it doesn't contain any chunks on it. That means skb->sk
    doesn't get set and then when sctp_recvmsg() calls
    sctp_inet6_skb_msgname() on the head_skb it panics, as this last needs
    to check flags at the socket (sp->v4mapped).

    The fix is to initialize skb->sk for th head skb once we are able to do
    it. That is, when the first chunk is processed.

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • Now that the backlog processing is called with BH enabled, we have to
    disable BH before taking the socket lock via bh_lock_sock() otherwise
    it may dead lock:

    sctp_backlog_rcv()
    bh_lock_sock(sk);

    if (sock_owned_by_user(sk)) {
    if (sk_add_backlog(sk, skb, sk->sk_rcvbuf))
    sctp_chunk_free(chunk);
    else
    backloged = 1;
    } else
    sctp_inq_push(inqueue, chunk);

    bh_unlock_sock(sk);

    while sctp_inq_push() was disabling/enabling BH, but enabling BH
    triggers pending softirq, which then may try to re-lock the socket in
    sctp_rcv().

    [ 219.187215]
    [ 219.187217] [] _raw_spin_lock+0x20/0x30
    [ 219.187223] [] sctp_rcv+0x48c/0xba0 [sctp]
    [ 219.187225] [] ? nf_iterate+0x62/0x80
    [ 219.187226] [] ip_local_deliver_finish+0x94/0x1e0
    [ 219.187228] [] ip_local_deliver+0x6f/0xf0
    [ 219.187229] [] ? ip_rcv_finish+0x3b0/0x3b0
    [ 219.187230] [] ip_rcv_finish+0xd8/0x3b0
    [ 219.187232] [] ip_rcv+0x282/0x3a0
    [ 219.187233] [] ? update_curr+0x66/0x180
    [ 219.187235] [] __netif_receive_skb_core+0x524/0xa90
    [ 219.187236] [] ? update_cfs_shares+0x30/0xf0
    [ 219.187237] [] ? __enqueue_entity+0x6c/0x70
    [ 219.187239] [] ? enqueue_entity+0x204/0xdf0
    [ 219.187240] [] __netif_receive_skb+0x18/0x60
    [ 219.187242] [] process_backlog+0x9e/0x140
    [ 219.187243] [] net_rx_action+0x22c/0x370
    [ 219.187245] [] __do_softirq+0x112/0x2e7
    [ 219.187247] [] do_softirq_own_stack+0x1c/0x30
    [ 219.187247]
    [ 219.187248] [] do_softirq.part.14+0x38/0x40
    [ 219.187249] [] __local_bh_enable_ip+0x7d/0x80
    [ 219.187254] [] sctp_inq_push+0x68/0x80 [sctp]
    [ 219.187258] [] sctp_backlog_rcv+0x151/0x1c0 [sctp]
    [ 219.187260] [] __release_sock+0x87/0xf0
    [ 219.187261] [] release_sock+0x30/0xa0
    [ 219.187265] [] sctp_accept+0x17d/0x210 [sctp]
    [ 219.187266] [] ? prepare_to_wait_event+0xf0/0xf0
    [ 219.187268] [] inet_accept+0x3c/0x130
    [ 219.187269] [] SYSC_accept4+0x103/0x210
    [ 219.187271] [] ? _raw_spin_unlock_bh+0x1a/0x20
    [ 219.187272] [] ? release_sock+0x8c/0xa0
    [ 219.187276] [] ? sctp_inet_listen+0x62/0x1b0 [sctp]
    [ 219.187277] [] SyS_accept+0x10/0x20

    Fixes: 860fbbc343bf ("sctp: prepare for socket backlog behavior change")
    Cc: Eric Dumazet
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • The check for a -ve error is redundant, remove it and just
    immediately return the return value from the call to
    seq_open_net.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • Default kernel behavior is to delete IPv6 addresses on link
    down, which entails deletion of the multicast and the
    subnet-router anycast addresses. These deletions do not
    happen with sysctl setting to keep global IPv6 addresses on
    link down, so every link down/up causes an increment of the
    anycast and multicast refcounts. These bogus refcounts may
    stop these addrs from being removed on subsequent calls to
    delete them. The solution is to leave the groups for the
    multicast and subnet anycast on link down for the callflow
    when global IPv6 addresses are kept.

    Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
    Signed-off-by: Mike Manning
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Mike Manning
     
  • Commit 486bdee0134c ("sctp: add support for RPS and RFS")
    saves skb->hash into sk->sk_rxhash so that the inet_* can
    record it to flow table.

    But sctp uses sock_common_recvmsg as .recvmsg instead
    of inet_recvmsg, sock_common_recvmsg doesn't invoke
    sock_rps_record_flow to record the flow. It may cause
    that the receiver has no chances to record the flow if
    it doesn't send msg or poll the socket.

    So this patch fixes it by using inet_recvmsg as .recvmsg
    in sctp.

    Fixes: 486bdee0134c ("sctp: add support for RPS and RFS")
    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     
  • Commit 8626c56c8279 ("bridge: fix potential use-after-free when hook
    returns QUEUE or STOLEN verdict") caused LLDP packets arriving through a
    bridge port to be re-injected to the Rx path with skb->dev set to the
    bridge device, but this breaks the lldpad daemon.

    The lldpad daemon opens a packet socket with protocol set to ETH_P_LLDP
    for any valid device on the system, which doesn't not include soft
    devices such as bridge and VLAN.

    Since packet sockets (ptype_base) are processed in the Rx path after the
    Rx handler, LLDP packets with skb->dev set to the bridge device never
    reach the lldpad daemon.

    Fix this by making the bridge's Rx handler re-inject LLDP packets with
    RX_HANDLER_PASS, which effectively restores the behaviour prior to the
    mentioned commit.

    This means netfilter will never receive LLDP packets coming through a
    bridge port, as I don't see a way in which we can have okfn() consume
    the packet without breaking existing behaviour. I've already carried out
    a similar fix for STP packets in commit 56fae404fb2c ("bridge: Fix
    incorrect re-injection of STP packets").

    Fixes: 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Cc: Florian Westphal
    Cc: John Fastabend
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • This patch makes sctp support ipv6 nonlocal bind by adding
    sp->inet.freebind and net->ipv6.sysctl.ip_nonlocal_bind
    check in sctp_v6_available as what sctp did to support
    ipv4 nonlocal bind (commit cdac4e077489).

    Reported-by: Shijoe George
    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch fixes the __output_custom() routine we currently use with
    bpf_skb_copy(). I missed that when len is larger than the size of the
    current handle, we can issue multiple invocations of copy_func, and
    __output_custom() advances destination but also source buffer by the
    written amount of bytes. When we have __output_custom(), this is actually
    wrong since in that case the source buffer points to a non-linear object,
    in our case an skb, which the copy_func helper is supposed to walk.
    Therefore, since this is non-linear we thus need to pass the offset into
    the helper, so that copy_func can use it for extracting the data from
    the source object.

    Therefore, adjust the callback signatures properly and pass offset
    into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
    __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
    i) to pass in whether we should advance source buffer or not; this is
    a compile-time constant condition, ii) to pass in the offset for
    __output_custom(), which we do with help of __VA_ARGS__, so everything
    can stay inlined as is currently. Both changes allow for adapting the
    __output_* fast-path helpers w/o extra overhead.

    Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
    Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • gcc-4.9 and higher warn about the newly added NSCI code:

    net/ncsi/ncsi-manage.c: In function 'ncsi_process_next_channel':
    net/ncsi/ncsi-manage.c:1003:2: error: 'old_state' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    The warning is a false positive and therefore harmless, but it would be good to
    avoid it anyway. I have determined that the barrier in the spin_unlock_irqsave()
    is what confuses gcc to the point that it cannot track whether the variable
    was unused or not.

    This rearranges the code in a way that makes it obvious to gcc that old_state
    is always initialized at the time of use, functionally this should not
    change anything.

    Signed-off-by: Arnd Bergmann
    Acked-by: Gavin Shan
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • Change the ageing_time type in br_set_ageing_time() from u32 to what it
    is expected to be, i.e. a clock_t.

    Signed-off-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • br_stp_enable_bridge() does take the br->lock spinlock. Fix its wrongly
    pasted comment and use the same as br_stp_disable_bridge().

    Signed-off-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Vivien Didelot
     

25 Jul, 2016

6 commits

  • Following the work that have been done on offloading classifiers like u32
    and flower, now the match-all classifier hw offloading is possible. if
    the interface supports tc offloading.

    To control the offloading, two tc flags have been introduced: skip_sw and
    skip_hw. Typical usage:

    tc filter add dev eth25 parent ffff: \
    matchall skip_sw \
    action mirred egress mirror \
    dev eth27

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • The matchall classifier matches every packet and allows the user to apply
    actions on it. This filter is very useful in usecases where every packet
    should be matched, for example, packet mirroring (SPAN) can be setup very
    easily using that filter.

    Signed-off-by: Jiri Pirko
    Signed-off-by: Yotam Gigi
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next,
    they are:

    1) Count pre-established connections as active in "least connection"
    schedulers such that pre-established connections to avoid overloading
    backend servers on peak demands, from Michal Kubecek via Simon Horman.

    2) Address a race condition when resizing the conntrack table by caching
    the bucket size when fulling iterating over the hashtable in these
    three possible scenarios: 1) dump via /proc/net/nf_conntrack,
    2) unlinking userspace helper and 3) unlinking custom conntrack timeout.
    From Liping Zhang.

    3) Revisit early_drop() path to perform lockless traversal on conntrack
    eviction under stress, use del_timer() as synchronization point to
    avoid two CPUs evicting the same entry, from Florian Westphal.

    4) Move NAT hlist_head to nf_conn object, this simplifies the existing
    NAT extension and it doesn't increase size since recent patches to
    align nf_conn, from Florian.

    5) Use rhashtable for the by-source NAT hashtable, also from Florian.

    6) Don't allow --physdev-is-out from OUTPUT chain, just like
    --physdev-out is not either, from Hangbin Liu.

    7) Automagically set on nf_conntrack counters if the user tries to
    match ct bytes/packets from nftables, from Liping Zhang.

    8) Remove possible_net_t fields in nf_tables set objects since we just
    simply pass the net pointer to the backend set type implementations.

    9) Fix possible off-by-one in h323, from Toby DiPasquale.

    10) early_drop() may be called from ctnetlink patch, so we must hold
    rcu read size lock from them too, this amends Florian's patch #3
    coming in this batch, from Liping Zhang.

    11) Use binary search to validate jump offset in x_tables, this
    addresses the O(n!) validation that was introduced recently
    resolve security issues with unpriviledge namespaces, from Florian.

    12) Fix reference leak to connlabel in error path of nft_ct, from Zhang.

    13) Three updates for nft_log: Fix log prefix leak in error path. Bail
    out on loglevel larger than debug in nft_log and set on the new
    NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang.

    14) Allow to filter rule dumps in nf_tables based on table and chain
    names.

    15) Simplify connlabel to always use 128 bits to store labels and
    get rid of unused function in xt_connlabel, from Florian.

    16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack
    helper, by Gao Feng.

    17) Put back x_tables module reference in nft_compat on error, from
    Liping Zhang.

    18) Add a reference count to the x_tables extensions cache in
    nft_compat, so we can remove them when unused and avoid a crash
    if the extensions are rmmod, again from Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull tty/serial driver updates from Greg KH:
    "Here is the big tty and serial driver update for 4.8-rc1.

    Lots of good cleanups from Jiri on a number of vt and other tty
    related things, and the normal driver updates. Full details are in
    the shortlog.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (90 commits)
    tty/serial: atmel: enforce tasklet init and termination sequences
    serial: sh-sci: Stop transfers in sci_shutdown()
    serial: 8250_ingenic: drop #if conditional surrounding earlycon code
    serial: 8250_mtk: drop !defined(MODULE) conditional
    serial: 8250_uniphier: drop !defined(MODULE) conditional
    earlycon: mark earlycon code as __used iif the caller is built-in
    tty/serial/8250: use mctrl_gpio helpers
    serial: mctrl_gpio: enable API usage only for initialized mctrl_gpios struct
    serial: mctrl_gpio: add modem control read routine
    tty/serial/8250: make UART_MCR register access consistent
    serial: 8250_mid: Read RX buffer on RX DMA timeout for DNV
    serial: 8250_dma: Export serial8250_rx_dma_flush()
    dmaengine: hsu: Export hsu_dma_get_status()
    tty: serial: 8250: add CON_CONSDEV to flags
    tty: serial: samsung: add byte-order aware bit functions
    tty: serial: samsung: fixup accessors for endian
    serial: sirf: make fifo functions static
    serial: mps2-uart: make driver explicitly non-modular
    serial: mvebu-uart: free the IRQ in ->shutdown()
    serial/bcm63xx_uart: use correct alias naming
    ...

    Linus Torvalds
     
  • Trond Myklebust
     
  • Trond Myklebust