01 Dec, 2018

1 commit

  • Now that call_rcu()'s callback is not invoked until after bh-disable
    regions of code have completed (in addition to explicitly marked
    RCU read-side critical sections), call_rcu() can be used in place
    of call_rcu_bh(). Similarly, rcu_barrier() can be used in place of
    rcu_barrier_bh() and synchronize_rcu() in place of synchronize_rcu_bh().
    This commit therefore makes these changes.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Pablo Neira Ayuso

    Paul E. McKenney
     

12 Nov, 2018

39 commits

  • nf_flow_offload_gc_step() and nf_flow_table_iterate() are very similar.
    so that many duplicate code can be removed.
    After this patch, nf_flow_offload_gc_step() is simple callback function of
    nf_flow_table_iterate() like nf_flow_table_do_cleanup().

    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     
  • nf_flow_table_iterate() is local function, make it static.

    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     
  • Useful to only set a particular range of the conntrack mark while
    leaving existing parts of the value alone, e.g. when updating
    conntrack marks via netlink from userspace.

    For NFQUEUE it was already implemented in commit 534473c6080e
    ("netfilter: ctnetlink: honor CTA_MARK_MASK when setting ctmark").

    This now adds the same functionality also for the other netlink
    conntrack mark changes.

    Signed-off-by: Andreas Jaggi
    Signed-off-by: Pablo Neira Ayuso

    Andreas Jaggi
     
  • Jozsef Kadlecsik says:

    ====================
    - Introduction of new commands and thus protocol version 7. The
    new commands makes possible to eliminate the getsockopt interface
    of ipset and use solely netlink to communicate with the kernel.
    Due to the strict attribute checking both in user/kernel space,
    a new protocol number was introduced. Both the kernel/userspace is
    fully backward compatible.
    - Make invalid MAC address checks consisten, from Stefano Brivio.
    The patch depends on the next one.
    - Allow matching on destination MAC address for mac and ipmac sets,
    also from Stefano Brivio.
    ====================

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Fixes gcc '-Wunused-but-set-variable' warning:

    drivers/net/phy/marvell.c: In function 'm88e1510_config_init':
    drivers/net/phy/marvell.c:850:7: warning:
    variable 'pause' set but not used [-Wunused-but-set-variable]

    It not used any more after commit 3c1bcc8614db ("net: ethernet: Convert phydev
    advertize and supported from u32 to link mode")

    Signed-off-by: YueHaibing
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    YueHaibing
     
  • David S. Miller
     
  • Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "One last pull request before heading to Vancouver for LPC, here we have:

    1) Don't forget to free VSI contexts during ice driver unload, from
    Victor Raj.

    2) Don't forget napi delete calls during device remove in ice driver,
    from Dave Ertman.

    3) Don't request VLAN tag insertion of ibmvnic device when SKB
    doesn't have VLAN tags at all.

    4) IPV4 frag handling code has to accomodate the situation where two
    threads try to insert the same fragment into the hash table at the
    same time. From Eric Dumazet.

    5) Relatedly, don't flow separate on protocol ports for fragmented
    frames, also from Eric Dumazet.

    6) Memory leaks in qed driver, from Denis Bolotin.

    7) Correct valid MTU range in smsc95xx driver, from Stefan Wahren.

    8) Validate cls_flower nested policies properly, from Jakub Kicinski.

    9) Clearing of stats counters in mc88e6xxx driver doesn't retain
    important bits in the G1_STATS_OP register causing the chip to
    hang. Fix from Andrew Lunn"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
    act_mirred: clear skb->tstamp on redirect
    net: dsa: mv88e6xxx: Fix clearing of stats counters
    tipc: fix link re-establish failure
    net: sched: cls_flower: validate nested enc_opts_policy to avoid warning
    net: mvneta: correct typo
    flow_dissector: do not dissect l4 ports for fragments
    net: qualcomm: rmnet: Fix incorrect assignment of real_dev
    net: aquantia: allow rx checksum offload configuration
    net: aquantia: invalid checksumm offload implementation
    net: aquantia: fixed enable unicast on 32 macvlan
    net: aquantia: fix potential IOMMU fault after driver unbind
    net: aquantia: synchronized flow control between mac/phy
    net: smsc95xx: Fix MTU range
    net: stmmac: Fix RX packet size > 8191
    qed: Fix potential memory corruption
    qed: Fix SPQ entries not returned to pool in error flows
    qed: Fix blocking/unlimited SPQ entries leak
    qed: Fix memory/entry leak in qed_init_sp_request()
    inet: frags: better deal with smp races
    net: hns3: bugfix for not checking return value
    ...

    Linus Torvalds
     
  • …masahiroy/linux-kbuild

    Pull Kbuild fixes from Masahiro Yamada:

    - fix build errors in binrpm-pkg and bindeb-pkg targets

    - fix false positive matches in merge_config.sh

    - fix build version mismatch in deb-pkg target

    - fix dtbs_install handling in (bin)deb-pkg target

    - revert a commit that allows setlocalversion to write to source tree

    * tag 'kbuild-fixes-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    builddeb: Fix inclusion of dtbs in debian package
    Revert "scripts/setlocalversion: git: Make -dirty check more robust"
    kbuild: deb-pkg: fix too low build version number
    kconfig: merge_config: avoid false positive matches from comment lines
    kbuild: deb-pkg: fix bindeb-pkg breakage when O= is used
    kbuild: rpm-pkg: fix binrpm-pkg breakage when O= is used

    Linus Torvalds
     
  • Pull btrfs fixes from David Sterba:
    "Several fixes to recent release (4.19, fixes tagged for stable) and
    other fixes"

    * tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    Btrfs: fix missing delayed iputs on unmount
    Btrfs: fix data corruption due to cloning of eof block
    Btrfs: fix infinite loop on inode eviction after deduplication of eof block
    Btrfs: fix deadlock on tree root leaf when finding free extent
    btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
    btrfs: tree-checker: Fix misleading group system information
    Btrfs: fix missing data checksums after a ranged fsync (msync)
    btrfs: fix pinned underflow after transaction aborted
    Btrfs: fix cur_offset in the error case for nocow

    Linus Torvalds
     
  • Pull ext4 fixes from Ted Ts'o:
    "A large number of ext4 bug fixes, mostly buffer and memory leaks on
    error return cleanup paths"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: missing !bh check in ext4_xattr_inode_write()
    ext4: fix buffer leak in __ext4_read_dirblock() on error path
    ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
    ext4: fix buffer leak in ext4_xattr_move_to_block() on error path
    ext4: release bs.bh before re-using in ext4_xattr_block_find()
    ext4: fix buffer leak in ext4_xattr_get_block() on error path
    ext4: fix possible leak of s_journal_flag_rwsem in error path
    ext4: fix possible leak of sbi->s_group_desc_leak in error path
    ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref()
    ext4: avoid possible double brelse() in add_new_gdb() on error path
    ext4: avoid buffer leak in ext4_orphan_add() after prior errors
    ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
    ext4: fix possible inode leak in the retry loop of ext4_resize_fs()
    ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
    ext4: add missing brelse() update_backups()'s error path
    ext4: add missing brelse() add_new_gdb_meta_bg()'s error path
    ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
    ext4: avoid potential extra brelse in setup_new_flex_group_blocks()

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:
    "A set of x86 fixes:

    - Cure the LDT remapping to user space on 5 level paging which ended
    up in the KASLR space

    - Remove LDT mapping before freeing the LDT pages

    - Make NFIT MCE handling more robust

    - Unbreak the VSMP build by removing the dependency on paravirt ops

    - Support broken PIT emulation on Microsoft hyperV

    - Don't trace vmware_sched_clock() to avoid tracer recursion

    - Remove -pipe from KBUILD CFLAGS which breaks clang and is also
    slower on GCC

    - Trivial coding style and typo fixes"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/cpu/vmware: Do not trace vmware_sched_clock()
    x86/vsmp: Remove dependency on pv_irq_ops
    x86/ldt: Remove unused variable in map_ldt_struct()
    x86/ldt: Unmap PTEs for the slot before freeing LDT pages
    x86/mm: Move LDT remap out of KASLR region on 5-level paging
    acpi/nfit, x86/mce: Validate a MCE's address before using it
    acpi/nfit, x86/mce: Handle only uncorrectable machine checks
    x86/build: Remove -pipe from KBUILD_CFLAGS
    x86/hyper-v: Fix indentation in hv_do_fast_hypercall16()
    Documentation/x86: Fix typo in zero-page.txt
    x86/hyper-v: Enable PIT shutdown quirk
    clockevents/drivers/i8253: Add support for PIT shutdown quirk

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A bunch of perf tooling fixes:

    - Make the Intel PT SQL viewer more robust

    - Make the Intel PT debug log more useful

    - Support weak groups in perf record so it's behaving the same way as
    perf stat

    - Display the LBR stats in callchain entries properly in perf top

    - Handle different PMu names with common prefix properlin in pert
    stat

    - Start syscall augmenting in perf trace. Preparation for
    architecture independent eBPF instrumentation of syscalls.

    - Fix build breakage in JVMTI perf lib

    - Fix arm64 tools build failure wrt smp_load_{acquire,release}"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf tools: Do not zero sample_id_all for group members
    perf tools: Fix undefined symbol scnprintf in libperf-jvmti.so
    perf beauty: Use SRCARCH, ARCH=x86_64 must map to "x86" to find the headers
    perf intel-pt: Add MTC and CYC timestamps to debug log
    perf intel-pt: Add more event information to debug log
    perf scripts python: exported-sql-viewer.py: Fix table find when table re-ordered
    perf scripts python: exported-sql-viewer.py: Add help window
    perf scripts python: exported-sql-viewer.py: Add Selected branches report
    perf scripts python: exported-sql-viewer.py: Fall back to /usr/local/lib/libxed.so
    perf top: Display the LBR stats in callchain entry
    perf stat: Handle different PMU names with common prefix
    perf record: Support weak groups
    perf evlist: Move perf_evsel__reset_weak_group into evlist
    perf augmented_syscalls: Start collecting pathnames in the BPF program
    perf trace: Fix setting of augmented payload when using eBPF + raw_syscalls
    perf trace: When augmenting raw_syscalls plug raw_syscalls:sys_exit too
    perf examples bpf: Start augmenting raw_syscalls:sys_{start,exit}
    tools headers barrier: Fix arm64 tools build failure wrt smp_load_{acquire,release}

    Linus Torvalds
     
  • Pull timer fix from Thomas Gleixner:
    "Just the removal of a redundant call into the sched deadline overrun
    check"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    posix-cpu-timers: Remove useless call to check_dl_overrun()

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "Two small scheduler fixes:

    - Take hotplug lock in sched_init_smp(). Technically not really
    required, but lockdep will complain other.

    - Trivial comment fix in sched/fair"

    * 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Fix a comment in task_numa_fault()
    sched/core: Take the hotplug lock in sched_init_smp()

    Linus Torvalds
     
  • Pull locking build fix from Thomas Gleixner:
    "A single fix for a build fail with CONFIG_PROFILE_ALL_BRANCHES=y in
    the qspinlock code"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/qspinlock: Fix compile error

    Linus Torvalds
     
  • Pull core fixes from Thomas Gleixner:
    "A couple of fixlets for the core:

    - Kernel doc function documentation fixes

    - Missing prototypes for weak watchdog functions"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    resource/docs: Complete kernel-doc style function documentation
    watchdog/core: Add missing prototypes for weak functions
    resource/docs: Fix new kernel-doc warnings

    Linus Torvalds
     
  • The PCI vendor id of U.S. Robotics isn't defined in pci_ids.h so far,
    only ISDN driver w6692 has a private definition. Move the definition
    to pci_ids.h and use it in the r8169 driver too.

    Signed-off-by: Heiner Kallweit
    Signed-off-by: David S. Miller

    Heiner Kallweit
     
  • Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute")

    After EDT adoption, it became easier to implement DCTCP-like CE marking.

    In many cases, queues are not building in the network fabric but on
    the hosts themselves.

    If packets leaving fq missed their Earliest Departure Time by XXX usec,
    we mark them with ECN CE. This gives a feedback (after one RTT) to
    the sender to slow down and find better operating mode.

    Example :

    tc qd replace dev eth0 root fq ce_threshold 2.5ms

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • FQ pacing guarantees that paced packets queued by one flow do not
    add head-of-line blocking for other flows.

    After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe,
    since this maps to 16 skbs at most in qdisc or device queues.
    (or slightly more if some drivers lower {gso_max_segs|size})

    We still can queue at most 1 ms worth of traffic (this can be scaled
    by wifi drivers if they need to)

    Tested:

    # ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC
    tx-usecs: 16
    tx-frames: 16
    # tc qdisc replace dev eth0 root fq
    # for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done

    Before patch:
    27711
    26118
    27107
    27377
    27712
    27388
    27340
    27117
    27278
    27509

    After patch:
    37434
    36949
    36658
    36998
    37711
    37291
    37605
    36659
    36544
    37349

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Eric Dumazet says:

    ====================
    tcp: tso defer improvements

    This series makes tcp_tso_should_defer() a bit smarter :

    1) MSG_EOR gives a hint to TCP to not defer some skbs

    2) Second patch takes into account that head tstamp
    can be in the future.

    3) Third patch uses existing high resolution state variables
    to have a more precise heuristic.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • tcp_tso_should_defer() first heuristic is to not defer
    if last send is "old enough".

    Its current implementation uses jiffies and its low granularity.

    TSO autodefer performance should not rely on kernel HZ :/

    After EDT conversion, we have state variables in nanoseconds that
    can allow us to properly implement the heuristic.

    This patch increases TSO chunk sizes on medium rate flows,
    especially when receivers do not use GRO or similar aggregation.

    It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
    behavior more uniform.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • tcp_tso_should_defer() last step tries to check if the probable
    next ACK packet is coming in less than half rtt.

    Problem is that the head->tstamp might be in the future,
    so we need to use signed arithmetics to avoid overflows.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Applications using MSG_EOR are giving a strong hint to TCP stack :

    Subsequent sendmsg() can not append more bytes to skbs having
    the EOR mark.

    Do not try to TSO defer suchs skbs, there is really no hope.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Bitwise operation is a little faster.
    So I replace after() with using the flag FLAG_SND_UNA_ADVANCED as it is
    already set before.

    In addtion, there's another similar improvement in tcp_cwnd_reduction().

    Cc: Joe Perches
    Suggested-by: Eric Dumazet
    Signed-off-by: Yafang Shao
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yafang Shao
     
  • If sch_fq is used at ingress, skbs that might have been
    timestamped by net_timestamp_set() if a packet capture
    is requesting timestamps could be delayed by arbitrary
    amount of time, since sch_fq time base is MONOTONIC.

    Fix this problem by moving code from sch_netem.c to act_mirred.c.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The mv88e6161 would sometime fail to probe with a timeout waiting for
    the switch to complete an operation. This operation is supposed to
    clear the statistics counters. However, due to a read/modify/write,
    without the needed mask, the operation actually carried out was more
    random, with invalid parameters, resulting in the switch not
    responding. We need to preserve the histogram mode bits, so apply a
    mask to keep them.

    Reported-by: Chris Healy
    Fixes: 40cff8fca9e3 ("net: dsa: mv88e6xxx: Fix stats histogram mode")
    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Andrew Lunn says:

    ====================
    net: dsa: mv88e6xxx: Support more SERDES interfacxes

    Currently the SERDES interfaces for ports 9 and 10 on the mv88e6390x
    are supported, allowing upto 10G. However, when unused, these SERDES
    interfaces can be used by some of the lower ports for 1000Base-X.

    The tricky bit here is ordering. The SERDES have to become free from
    ports 9 or 10 before they can be used with lower ports. Normally, this
    would happen only when these ports would be configured up, which is
    too late. So at probe time, defaulting ports 9 and 10 to 1000BaseX
    frees them for use with lower ports. If they are actually needed, they
    will be taken back when port 9 and 10 goes up.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The 6390X family has 8 SERDES interfaces. When ports 9 and 10 are not
    using all their SERDES interfaces, the unused ones can be assigned to
    ports 2-8. Add support for interrupts from SERDES interfaces connected
    to these lower ports.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The 6390X family has 8 SERDES interfaces. This allows ports 9 and 10
    to support up to 10Gbps using 4 SERDES interfaces. However, when lower
    speeds are used, which need fewer SERDES interfaces, the unused SERDES
    interfaces can be used by ports 2-8.

    The hardware defaults to ports 9 and 10 having all 4 SERDES interfaces
    assigned to them. This only gets changed when the interface is
    configured after what the SFP supports has been determined, or the 10G
    PHY completes auto-neg.

    For hardware designs which limit ports 9 and 10 to one or two SERDES
    interfaces, and place SFPs on the lower interfaces, this is too
    late. Those ports with SFP should not wait until ports 9/10 are up in
    order to get access to the SERDES interface. So change the default
    configuration when the driver is initialised. Configure ports 9 and 10
    to 1000BaseX, so they use a single SERDES interface, freeing up the
    others. They can steal them back if they need them.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The X family variants support additional ports modes, for 10G
    operation, which the non-X variants don't have. Add a port_set_cmode()
    for non-X variants to enforce this.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Move .port_set_cmode next to .port_get_cmode.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Andrew Lunn says:

    ====================
    net: phy: convert advertise and supported to linkmode

    This is the last part in converting phylib to make use of a linux
    bitmap, not a u32, to represent links modes. This will allow support
    for PHYs > 1Gbps, which need to use link modes represented by a bit >
    32.

    A number of MAC and PHY drivers need changes to support this. However
    the previous two patchesets reduced the number somewhat, the helpers
    which were introduced have been modified instead of the actual
    drivers.

    The follow on patches then make use of the extra bits, adding support
    for more link modes.

    Given how invasive this change is, i expect the build is broken for
    some architectures i did not test. I will fixup the breakage as fast
    as i can.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Now that 2.5G and 5G can be represented in phydev->advertising and
    phydev->lp_advertising, add these two links modes as possible
    resolutions to auto negotiation.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Now that PHYs and MAC can support more than 32 bit masks, add link
    modes which are > 31 to the PHY settings table.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Add missing markup for function parameters

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Convert phy drivers to report the link partner advertised modes using
    a linkmode bitmap. This allows them to report the higher speeds which
    don't fit in a u32.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • There are a few MAC/PHYs combinations which now support > 1Gbps. These
    may need to make use of link modes with bits > 31. Thus their
    supported PHY features or advertised features cannot be implemented
    using the current bitmap in a u32. Convert to using a linkmode bitmap,
    which can support all the currently devices link modes, and is future
    proof as more modes are added.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Both states aren't used. Most likely they result from an idea that
    never materialized. So remove them.

    Signed-off-by: Heiner Kallweit
    Signed-off-by: David S. Miller

    Heiner Kallweit