05 Jun, 2014

2 commits

  • Pull percpu fix from Tejun Heo:
    "It is very late but this is an important percpu-refcount fix from
    Sebastian Ott.

    The problem is that percpu_ref_*() used __this_cpu_*() instead of
    this_cpu_*(). The difference between the two is that the latter is
    atomic on the local cpu while the former is not. this_cpu_inc() is
    guaranteed to increment the percpu counter on the cpu that the
    operation is executed on without any synchronization; however,
    __this_cpu_inc() doesn't and if the local cpu invokes the function
    from different contexts (e.g. process and irq) of the same CPU, it's
    not guaranteed to actually increment as it may be implemented as rmw.

    This bug existed from the get-go but it hasn't been noticed earlier
    probably because on x86 __this_cpu_inc() is equivalent to
    this_cpu_inc() as both get translated into single instruction;
    however, s390 uses the generic rmw implementation and gets affected by
    the bug. Kudos to Sebastian and Heiko for diagnosing it.

    The change is very low risk and fixes a critical issue on the affected
    architectures, so I think it's a good candidate for inclusion although
    it's very late in the devel cycle. On the other hand, this has been
    broken since v3.11, so backporting it through -stable post -rc1 won't
    be the end of the world.

    I'll ping Christoph whether __this_cpu_*() ops can be better annotated
    so that it can trigger lockdep warning when used from multiple
    contexts"

    * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu-refcount: fix usage of this_cpu_ops

    Linus Torvalds
     
  • The percpu-refcount infrastructure uses the underscore variants of
    this_cpu_ops in order to modify percpu reference counters.
    (e.g. __this_cpu_inc()).

    However the underscore variants do not atomically update the percpu
    variable, instead they may be implemented using read-modify-write
    semantics (more than one instruction). Therefore it is only safe to
    use the underscore variant if the context is always the same (process,
    softirq, or hardirq). Otherwise it is possible to lose updates.

    This problem is something that Sebastian has seen within the aio
    subsystem which uses percpu refcounters both in process and softirq
    context leading to reference counts that never dropped to zeroes; even
    though the number of "get" and "put" calls matched.

    Fix this by using the non-underscore this_cpu_ops variant which
    provides correct per cpu atomic semantics and fixes the corrupted
    reference counts.

    Cc: Kent Overstreet
    Cc: # v3.11+
    Reported-by: Sebastian Ott
    Signed-off-by: Heiko Carstens
    Signed-off-by: Tejun Heo
    References: http://lkml.kernel.org/g/alpine.LFD.2.11.1406041540520.21183@denkbrett

    Sebastian Ott
     

04 Jun, 2014

4 commits

  • Pull intel pstate fixes from Rafael Wysocki:
    "Final power management fixes for 3.15

    - Taking non-idle time into account when calculating core busy time
    was a mistake and led to a performance regression. Since the
    problem it was supposed to address is now taken care of in a
    different way, we don't need to do it any more, so drop the
    non-idle time tracking from intel_pstate. Dirk Brandewie.

    - Changing to fixed point math throughout the busy calculation
    introduced rounding errors that adversely affect the accuracy of
    intel_pstate's computations. Fix from Dirk Brandewie.

    - The PID controller algorithm used by intel_pstate assumes that the
    time interval between two adjacent samples will always be the same
    which is not the case for deferable timers (used by intel_pstate)
    when the system is idle. This leads to inaccurate predictions and
    artificially increases convergence times for the minimum P-state.
    Fix from Dirk Brandewie.

    - intel_pstate carries out computations using 32-bit variables that
    may overflow for large enough values of APERF/MPERF. Switch to
    using 64-bit variables for computations, from Doug Smythies"

    * tag 'pm-3.15-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    intel_pstate: Improve initial busy calculation
    intel_pstate: add sample time scaling
    intel_pstate: Correct rounding in busy calculation
    intel_pstate: Remove C0 tracking

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "All fairly small: radeon stability and a panic path fix.

    Mostly radeon fixes, suspend/resume fix, stability on the CIK
    chipsets, along with a locking check avoidance patch for panic times
    regression"

    * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
    drm/radeon: use the CP DMA on CIK
    drm/radeon: sync page table updates
    drm/radeon: fix vm buffer size estimation
    drm/crtc-helper: skip locking checks in panicking path
    drm/radeon/dpm: resume fixes for some systems

    Linus Torvalds
     
  • The first one is a one liner fixing a stupid typo in the VM handling code and is only relevant if play with one of the VM defines.

    The other two switches CIK to use the CPDMA instead of the SDMA for buffer moves, as it turned out the SDMA is still sometimes not 100% reliable.

    * 'drm-fixes-3.15' of git://people.freedesktop.org/~deathsimple/linux:
    drm/radeon: use the CP DMA on CIK
    drm/radeon: sync page table updates
    drm/radeon: fix vm buffer size estimation

    Dave Airlie
     
  • Pull sound fixes from Takashi Iwai:
    "A few addition of HD-audio fixups for ALC260 and AD1986A codecs. All
    marked as stable fixes.

    The fixes are pretty local and they are old machines, so quite safe to
    apply"

    * tag 'sound-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ALSA: hda/realtek - Fix COEF widget NID for ALC260 replacer fixup
    ALSA: hda/realtek - Correction of fixup codes for PB V7900 laptop
    ALSA: hda/analog - Fix silent output on ASUS A8JN

    Linus Torvalds
     

03 Jun, 2014

18 commits

  • There is still one residue of sysfs remaining: the sb_magic
    SYSFS_MAGIC. However this should be kernfs user specific,
    so this patch moves it out. Kerrnfs user should specify their
    magic number while mouting.

    Signed-off-by: Jianyu Zhan
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • Pull networking fixes from David Miller:

    1) Unbreak zebra and other netlink apps, from Eric W Biederman.

    2) Some new qmi_wwan device IDs, from Aleksander Morgado.

    3) Fix info leak in DCB netlink handler of qlcnic driver, from Dan
    Carpenter.

    4) inet_getid() and ipv6_select_ident() do not generate monotonically
    increasing ID numbers, fix from Eric Dumazet.

    5) Fix memory leak in __sk_prepare_filter(), from Leon Yu.

    6) Netlink leftover bytes warning message is user triggerable, rate
    limit it. From Michal Schmidt.

    7) Fix non-linear SKB panic in ipvs, from Peter Christensen.

    8) Congestion window undo needs to be performed even if only never
    retransmitted data is SACK'd, fix from Yuching Cheng.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits)
    net: filter: fix possible memory leak in __sk_prepare_filter()
    net: ec_bhf: Add runtime dependencies
    tcp: fix cwnd undo on DSACK in F-RTO
    netlink: Only check file credentials for implicit destinations
    ipheth: Add support for iPad 2 and iPad 3
    team: fix mtu setting
    net: fix inet_getid() and ipv6_select_ident() bugs
    net: qmi_wwan: interface #11 in Sierra Wireless MC73xx is not QMI
    net: qmi_wwan: add additional Sierra Wireless QMI devices
    bridge: Prevent insertion of FDB entry with disallowed vlan
    netlink: rate-limit leftover bytes warning and print process name
    bridge: notify user space after fdb update
    net: qmi_wwan: add Netgear AirCard 341U
    net: fix wrong mac_len calculation for vlans
    batman-adv: fix NULL pointer dereferences
    net/mlx4_core: Reset RoCE VF gids when guest driver goes down
    emac: aggregation of v1-2 PLB errors for IER register
    emac: add missing support of 10mbit in emac/rgmii
    can: only rename enabled led triggers when changing the netdev name
    ipvs: Fix panic due to non-linear skb
    ...

    Linus Torvalds
     
  • __sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
    rework/optimize internal BPF interpreter's instruction set) so that it should
    have uncharged memory once things went wrong. However that work isn't complete.
    Error is handled only in __sk_migrate_filter() while memory can still leak in
    the error path right after sk_chk_filter().

    Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
    Signed-off-by: Leon Yu
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Leon Yu
     
  • Pull two md bugfixes from Neil Brown:
    "Two md bugfixes for possible corruption when restarting reshape

    If a raid5/6 reshape is restarted (After stopping and re-assembling
    the array) and the array is marked read-only (or read-auto), then the
    reshape will appear to complete immediately, without actually moving
    anything around. This can result in corruption.

    There are two patches which do much the same thing in different
    places. They are separate because one is an older bug and so can be
    applied to more -stable kernels"

    * tag 'md/3.15-fixes' of git://neil.brown.name/md:
    md: always set MD_RECOVERY_INTR when interrupting a reshape thread.
    md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".

    Linus Torvalds
     
  • The ec_bhf driver is specific to the Beckhoff CX embedded PC series.
    These are based on Intel x86 CPU. So we can add a dependency on
    X86, with COMPILE_TEST as an alternative to still allow for broader
    build-testing.

    Signed-off-by: Jean Delvare
    Cc: Darek Marcinkiewicz
    Cc: David S. Miller
    Signed-off-by: David S. Miller

    Jean Delvare
     
  • Queued trim only works for some users with MU05 firmware. Revert to
    blacklisting all firmware versions.

    Introduced by commit d121f7d0cbb8 ("libata: Update queued trim blacklist
    for M5x0 drives") which this effectively reverts, while retaining the
    blacklisting of M550.

    See

    https://bugzilla.kernel.org/show_bug.cgi?id=71371

    for reports of trouble with MU05 firmware.

    Signed-off-by: Martin K. Petersen
    Cc: Alan Cox
    Cc: Tejun Heo
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Martin K. Petersen
     
  • Pull x86 fix from Peter Anvin:
    "A single quite small patch that managed to get overlooked earlier, to
    prevent a user space triggerable oops on systems without HPET"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET

    Linus Torvalds
     
  • Pull USB fixes from Greg KH:
    "Here are some fixes for 3.15-rc8 that resolve a number of tiny USB
    issues that have been reported, and there are some new device ids as
    well.

    All have been tested in linux-next"

    * tag 'usb-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    xhci: delete endpoints from bandwidth list before freeing whole device
    usb: pci-quirks: Prevent Sony VAIO t-series from switching usb ports
    USB: cdc-wdm: properly include types.h
    usb: cdc-wdm: export cdc-wdm uapi header
    USB: serial: option: add support for Novatel E371 PCIe card
    USB: ftdi_sio: add NovaTech OrionLXm product ID
    USB: io_ti: fix firmware download on big-endian machines (part 2)
    USB: Avoid runtime suspend loops for HCDs that can't handle suspend/resume

    Linus Torvalds
     
  • Pull staging driver fixes from Greg KH:
    "Here are some staging driver fixes for 3.15.

    Three are for the speakup drivers (one fixes a regression caused in
    3.15-rc, and the other two resolve a tty issue found by Ben Hutchings)
    The comedi and r8192e_pci driver fixes also resolve reported issues"

    * tag 'staging-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
    staging: r8192e_pci: fix htons error
    Staging: speakup: Update __speakup_paste_selection() tty (ab)usage to match vt
    Staging: speakup: Move pasting into a work item
    staging: comedi: ni_daq_700: add mux settling delay
    speakup: fix incorrect perms on speakup_acntsa.c

    Linus Torvalds
     
  • This bug is discovered by an recent F-RTO issue on tcpm list
    https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html

    The bug is that currently F-RTO does not use DSACK to undo cwnd in
    certain cases: upon receiving an ACK after the RTO retransmission in
    F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
    the sender only calls tcp_try_undo_loss() if some never retransmisted
    data is sacked (FLAG_ORIG_DATA_SACKED).

    The correct behavior is to unconditionally call tcp_try_undo_loss so
    the DSACK information is used properly to undo the cwnd reduction.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • It was possible to get a setuid root or setcap executable to write to
    it's stdout or stderr (which has been set made a netlink socket) and
    inadvertently reconfigure the networking stack.

    To prevent this we check that both the creator of the socket and
    the currentl applications has permission to reconfigure the network
    stack.

    Unfortunately this breaks Zebra which always uses sendto/sendmsg
    and creates it's socket without any privileges.

    To keep Zebra working don't bother checking if the creator of the
    socket has privilege when a destination address is specified. Instead
    rely exclusively on the privileges of the sender of the socket.

    Note from Andy: This is exactly Eric's code except for some comment
    clarifications and formatting fixes. Neither I nor, I think, anyone
    else is thrilled with this approach, but I'm hesitant to wait on a
    better fix since 3.15 is almost here.

    Note to stable maintainers: This is a mess. An earlier series of
    patches in 3.15 fix a rather serious security issue (CVE-2014-0181),
    but they did so in a way that breaks Zebra. The offending series
    includes:

    commit aa4cf9452f469f16cea8c96283b641b4576d4a7b
    Author: Eric W. Biederman
    Date: Wed Apr 23 14:28:03 2014 -0700

    net: Add variants of capable for use on netlink messages

    If a given kernel version is missing that series of fixes, it's
    probably worth backporting it and this patch. if that series is
    present, then this fix is critical if you care about Zebra.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Andy Lutomirski
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Each iPad model has a different product id, this patch adds support for iPad 2
    (pid 0x12a2) and iPad 3 (pid 0x12a6). Note that iPad 2 must be jailbroken and a
    third-party app must be used for tethering to work. On iPad 3, tethering works
    out of the box (assuming your ISP is nice).

    Signed-off-by: Kristian Evensen
    Signed-off-by: David S. Miller

    Kristian Evensen
     
  • Now it is not possible to set mtu to team device which has a port
    enslaved to it. The reason is that when team_change_mtu() calls
    dev_set_mtu() for port device, notificator for NETDEV_PRECHANGEMTU
    event is called and team_device_event() returns NOTIFY_BAD forbidding
    the change. So fix this by returning NOTIFY_DONE here in case team is
    changing mtu in team_change_mtu().

    Introduced-by: 3d249d4c "net: introduce ethernet teaming device"
    Signed-off-by: Jiri Pirko
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • I noticed we were sending wrong IPv4 ID in TCP flows when MTU discovery
    is disabled.
    Note how GSO/TSO packets do not have monotonically incrementing ID.

    06:37:41.575531 IP (id 14227, proto: TCP (6), length: 4396)
    06:37:41.575534 IP (id 14272, proto: TCP (6), length: 65212)
    06:37:41.575544 IP (id 14312, proto: TCP (6), length: 57972)
    06:37:41.575678 IP (id 14317, proto: TCP (6), length: 7292)
    06:37:41.575683 IP (id 14361, proto: TCP (6), length: 63764)

    It appears I introduced this bug in linux-3.1.

    inet_getid() must return the old value of peer->ip_id_count,
    not the new one.

    Lets revert this part, and remove the prevention of
    a null identification field in IPv6 Fragment Extension Header,
    which is dubious and not even done properly.

    Fixes: 87c48fa3b463 ("ipv6: make fragment identifications less predictable")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This interface is unusable, as the cdc-wdm character device doesn't reply to
    any QMI command. Also, the out-of-tree Sierra Wireless GobiNet driver fully
    skips it.

    Signed-off-by: Aleksander Morgado
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Aleksander Morgado
     
  • A set of new VID/PIDs retrieved from the out-of-tree GobiNet/GobiSerial
    Sierra Wireless drivers.

    Signed-off-by: Aleksander Morgado
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Aleksander Morgado
     
  • br_handle_local_finish() is allowing us to insert an FDB entry with
    disallowed vlan. For example, when port 1 and 2 are communicating in
    vlan 10, and even if vlan 10 is disallowed on port 3, port 3 can
    interfere with their communication by spoofed src mac address with
    vlan id 10.

    Note: Even if it is judged that a frame should not be learned, it should
    not be dropped because it is destined for not forwarding layer but higher
    layer. See IEEE 802.1Q-2011 8.13.10.

    Signed-off-by: Toshiaki Makita
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • Any process is able to send netlink messages with leftover bytes.
    Make the warning rate-limited to prevent too much log spam.

    The warning is supposed to help find userspace bugs, so print the
    triggering command name to implicate the buggy program.

    [v2: Use pr_warn_ratelimited instead of printk_ratelimited.]

    Signed-off-by: Michal Schmidt
    Signed-off-by: David S. Miller

    Michal Schmidt
     

02 Jun, 2014

16 commits

  • The conversion to a fixup table for Replacer model with ALC260 in
    commit 20f7d928 took the wrong widget NID for COEF setups. Namely,
    NID 0x1a should have been used instead of NID 0x20, which is the
    common node for all Realtek codecs but ALC260.

    Fixes: 20f7d928fa6e ('ALSA: hda/realtek - Replace ALC260 model=replacer with the auto-parser')
    Cc: [v3.4+]
    Signed-off-by: Takashi Iwai

    Takashi Iwai
     
  • Correcion of wrong fixup entries add in commit ca8f0424 to replace
    static model quirk for PB V7900 laptop (will model).

    [note: the removal of ALC260_FIXUP_HP_PIN_0F chain is also needed as a
    part of the fix; otherwise the pin is set up wrongly as a headphone,
    and user-space (PulseAudio) may be wrongly trying to detect the jack
    state -- tiwai]

    Fixes: ca8f04247eaa ('ALSA: hda/realtek - Add the fixup codes for ALC260 model=will')
    Signed-off-by: Ronan Marquet
    Cc: [v3.4+]
    Signed-off-by: Takashi Iwai

    Ronan Marquet
     
  • This change makes the busy calculation using 64 bit math which prevents
    overflow for large values of aperf/mperf.

    Cc: 3.14+ # 3.14+
    Signed-off-by: Doug Smythies
    Signed-off-by: Dirk Brandewie
    Signed-off-by: Rafael J. Wysocki

    Doug Smythies
     
  • The PID assumes that samples are of equal time, which for a deferable
    timers this is not true when the system goes idle. This causes the
    PID to take a long time to converge to the min P state and depending
    on the pattern of the idle load can make the P state appear stuck.

    The hold-off value of three sample times before using the scaling is
    to give a grace period for applications that have high performance
    requirements and spend a lot of time idle, The poster child for this
    behavior is the ffmpeg benchmark in the Phoronix test suite.

    Cc: 3.14+ # 3.14+
    Signed-off-by: Dirk Brandewie
    Signed-off-by: Rafael J. Wysocki

    Dirk Brandewie
     
  • Changing to fixed point math throughout the busy calculation in
    commit e66c1768 (Change busy calculation to use fixed point
    math.) Introduced some inaccuracies by rounding the busy value at two
    points in the calculation. This change removes roundings and moves
    the rounding to the output of the PID where the calculations are
    complete and the value returned as an integer.

    Fixes: e66c17683746 (intel_pstate: Change busy calculation to use fixed point math.)
    Reported-by: Doug Smythies
    Cc: 3.14+ # 3.14+
    Signed-off-by: Dirk Brandewie
    Signed-off-by: Rafael J. Wysocki

    Dirk Brandewie
     
  • Commit fcb6a15c (intel_pstate: Take core C0 time into account for core
    busy calculation) introduced a regression referenced below. The issue
    with "lockup" after suspend that this commit was addressing is now dealt
    with in the suspend path.

    Fixes: fcb6a15c2e7e (intel_pstate: Take core C0 time into account for core busy calculation)
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=66581
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=75121
    Reported-by: Doug Smythies
    Cc: 3.14+ # 3.14+
    Signed-off-by: Dirk Brandewie
    Signed-off-by: Rafael J. Wysocki

    Dirk Brandewie
     
  • The SDMA sometimes doesn't seem to work reliable.

    Signed-off-by: Christian König
    Cc: stable@vger.kernel.org

    Christian König
     
  • Only necessary if we don't use the same engine for buffer moves and table updates.

    Signed-off-by: Christian König

    Christian König
     
  • Only relevant if we got VM_BLOCK_SIZE>9, but better save than sorry.

    Signed-off-by: Christian König

    Christian König
     
  • There has been a number incidents recently where customers running KVM have
    reported that VM hosts on different Hypervisors are unreachable. Based on
    pcap traces we found that the bridge was broadcasting the ARP request out
    onto the network. However some NICs have an inbuilt switch which on occasions
    were broadcasting the VMs ARP request back through the physical NIC on the
    Hypervisor. This resulted in the bridge changing ports and incorrectly learning
    that the VMs mac address was external. As a result the ARP reply was directed
    back onto the external network and VM never updated it's ARP cache. This patch
    will notify the bridge command, after a fdb has been updated to identify such
    port toggling.

    Signed-off-by: Jon Maxwell
    Reviewed-by: Jiri Pirko
    Acked-by: Toshiaki Makita
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Jon Maxwell
     
  • Skip locking checks in drm_helper_*_in_use() if they are called in panicking
    path. See similar code in drm_warn_on_modeset_not_all_locked().

    After panic information has been output, these WARN_ONs go off outputing a lot
    of lines and scrolling the panic information out of the screen. Here is a
    partial call trace showing how execution reaches them:

    ? drm_helper_crtc_in_use()
    ? __drm_helper_disable_unused_functions()
    ? several *_set_config functions
    ? drm_fb_helper_restore_fbdev_mode()

    Reviewed-by: Daniel Vetter
    Signed-off-by: Sergei Antonov
    Signed-off-by: Dave Airlie

    Sergei Antonov
     
  • Setting the power state prior to restoring the display
    hardware leads to blank screens on some systems. Drop
    the power state set from dpm resume. The power state
    will get set as part of the mode set sequence. Also
    add an explicit power state set after mode set resume
    to cover PX and headless systems.

    bug:
    https://bugzilla.kernel.org/show_bug.cgi?id=76761

    Signed-off-by: Alex Deucher
    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Airlie

    Alex Deucher
     
  • Signed-off-by: Aleksander Morgado
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Aleksander Morgado
     
  • After 1e785f48d29a ("net: Start with correct mac_len in
    skb_network_protocol") skb->mac_len is used as a start of the
    calculation in skb_network_protocol() but that is not always correct. If
    skb->protocol == 8021Q/AD, usually the vlan header is already inserted
    in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
    dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
    destination mac address (skb->data + VLAN_HLEN) as a type in
    skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
    off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
    skb_network_protocol() returns a type from inside the packet and
    offset == 22. Also make vlan_depth unsigned as suggested before.
    As suggested by Eric Dumazet, move the while() loop in the if() so we
    can avoid additional testing in fast path.

    Here are few netperf tests + debug printk's to illustrate:
    cat netperf.tso-on.reorder-on.bugged
    - Vlan -> device (reorder on, default, this case is okay)
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 7111.54
    [ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
    skb->mac_len 0 vlan_depth 0 type 0x800

    - Vlan -> device (reorder off, bad)
    cat netperf.tso-on.reorder-off.bugged
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 241.35
    [ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
    skb->mac_len 0 vlan_depth 4 type 0x5301
    0x5301 are the last two bytes of the destination mac.

    And if we stop TSO, we may get even the following:
    [ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
    skb->mac_len 18 vlan_depth 22 type 0xb84
    Because mac_len already accounts for VLAN_HLEN.

    After the fix:
    cat netperf.tso-on.reorder-off.fixed
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.01 5001.46
    [ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
    skb->mac_len 0 vlan_depth 18 type 0x800

    CC: Vlad Yasevich
    CC: Eric Dumazet
    CC: Daniel Borkman
    CC: David S. Miller

    Fixes:1e785f48d29a ("net: Start with correct mac_len in
    skb_network_protocol")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Linus Torvalds
     
  • Pull powerpc fix from Ben Herrenschmidt:
    "Here's just one trivial patch to wire up sys_renameat2 which I seem to
    have completely missed so far.

    (My test build scripts fwd me warnings but miss the ones generated for
    missing syscalls)"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc: Wire renameat2() syscall

    Linus Torvalds