13 Jan, 2021

4 commits

  • [ Upstream commit 4ae2bb81649dc03dfc95875f02126b14b773f7ab ]

    Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
    protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
    see an actual bug being triggered, but let's be safe here and take the
    rtnl lock while accessing the map in sysfs.

    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Antoine Tenart
     
  • [ Upstream commit 2d57b4f142e0b03e854612b8e28978935414bced ]

    Two race conditions can be triggered when storing xps rxqs, resulting in
    various oops and invalid memory accesses:

    1. Calling netdev_set_num_tc while netif_set_xps_queue:

    - netif_set_xps_queue uses dev->tc_num as one of the parameters to
    compute the size of new_dev_maps when allocating it. dev->tc_num is
    also used to access the map, and the compiler may generate code to
    retrieve this field multiple times in the function.

    - netdev_set_num_tc sets dev->tc_num.

    If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
    is set to a higher value through netdev_set_num_tc, later accesses to
    new_dev_maps in netif_set_xps_queue could lead to accessing memory
    outside of new_dev_maps; triggering an oops.

    2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

    2.1. netdev_set_num_tc starts by resetting the xps queues,
    dev->tc_num isn't updated yet.

    2.2. netif_set_xps_queue is called, setting up the map with the
    *old* dev->num_tc.

    2.3. netdev_set_num_tc updates dev->tc_num.

    2.4. Later accesses to the map lead to out of bound accesses and
    oops.

    A similar issue can be found with netdev_reset_tc.

    One way of triggering this is to set an iface up (for which the driver
    uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
    xps_rxqs in a concurrent thread. With the right timing an oops is
    triggered.

    Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
    and netdev_reset_tc should be mutually exclusive. We do that by taking
    the rtnl lock in xps_rxqs_store.

    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Antoine Tenart
     
  • [ Upstream commit fb25038586d0064123e393cadf1fadd70a9df97a ]

    Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
    protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
    see an actual bug being triggered, but let's be safe here and take the
    rtnl lock while accessing the map in sysfs.

    Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Antoine Tenart
     
  • [ Upstream commit 1ad58225dba3f2f598d2c6daed4323f24547168f ]

    Two race conditions can be triggered when storing xps cpus, resulting in
    various oops and invalid memory accesses:

    1. Calling netdev_set_num_tc while netif_set_xps_queue:

    - netif_set_xps_queue uses dev->tc_num as one of the parameters to
    compute the size of new_dev_maps when allocating it. dev->tc_num is
    also used to access the map, and the compiler may generate code to
    retrieve this field multiple times in the function.

    - netdev_set_num_tc sets dev->tc_num.

    If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
    is set to a higher value through netdev_set_num_tc, later accesses to
    new_dev_maps in netif_set_xps_queue could lead to accessing memory
    outside of new_dev_maps; triggering an oops.

    2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

    2.1. netdev_set_num_tc starts by resetting the xps queues,
    dev->tc_num isn't updated yet.

    2.2. netif_set_xps_queue is called, setting up the map with the
    *old* dev->num_tc.

    2.3. netdev_set_num_tc updates dev->tc_num.

    2.4. Later accesses to the map lead to out of bound accesses and
    oops.

    A similar issue can be found with netdev_reset_tc.

    One way of triggering this is to set an iface up (for which the driver
    uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
    xps_cpus in a concurrent thread. With the right timing an oops is
    triggered.

    Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
    and netdev_reset_tc should be mutually exclusive. We do that by taking
    the rtnl lock in xps_cpus_store.

    Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Antoine Tenart
     

02 Oct, 2020

1 commit

  • Fix follow warnings:
    [net/core/net-sysfs.c:1161]: (warning) %u in format string (no. 1)
    requires 'unsigned int' but the argument type is 'int'.
    [net/core/net-sysfs.c:1162]: (warning) %u in format string (no. 1)
    requires 'unsigned int' but the argument type is 'int'.

    Reported-by: Hulk Robot
    Signed-off-by: Ye Bin
    Signed-off-by: David S. Miller

    Ye Bin
     

13 Aug, 2020

1 commit

  • We must accept an empty mask in store_rps_map(), or we are not able
    to disable RPS on a queue.

    Fixes: 07bbecb34106 ("net: Restrict receive packets queuing to housekeeping CPUs")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Cc: Alex Belits
    Cc: Nitesh Narayan Lal
    Cc: Peter Zijlstra (Intel)
    Reviewed-by: Maciej Żenczykowski
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Nitesh Narayan Lal
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Aug, 2020

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - Improve uclamp performance by using a static key for the fast path

    - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)

    - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the
    values become larger. This is now replaced with more precise
    arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

    - Improve the deadline scheduler, such as making it capacity aware

    - Improve frequency-invariant scheduling

    - Misc cleanups in energy/power aware scheduling

    - Add sched_update_nr_running tracepoint to track changes to nr_running

    - Documentation additions and updates

    - Misc cleanups and smaller fixes

    * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
    sched/doc: Document capacity aware scheduling
    sched: Document arch_scale_*_capacity()
    arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
    Documentation/sysctl: Document uclamp sysctl knobs
    sched/uclamp: Add a new sysctl to control RT default boost value
    sched/uclamp: Fix a deadlock when enabling uclamp static key
    sched: Remove duplicated tick_nohz_full_enabled() check
    sched: Fix a typo in a comment
    sched/uclamp: Remove unnecessary mutex_init()
    arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
    sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
    arch_topology, sched/core: Cleanup thermal pressure definition
    trace/events/sched.h: fix duplicated word
    linux/sched/mm.h: drop duplicated words in comments
    smp: Fix a potential usage of stale nr_cpus
    sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
    sched: nohz: stop passing around unused "ticks" parameter.
    sched: Better document ttwu()
    sched: Add a tracepoint to track rq->nr_running
    ...

    Linus Torvalds
     

22 Jul, 2020

1 commit


08 Jul, 2020

1 commit

  • With the existing implementation of store_rps_map(), packets are queued
    in the receive path on the backlog queues of other CPUs irrespective of
    whether they are isolated or not. This could add a latency overhead to
    any RT workload that is running on the same CPU.

    Ensure that store_rps_map() only uses available housekeeping CPUs for
    storing the rps_map.

    Signed-off-by: Alex Belits
    Signed-off-by: Nitesh Narayan Lal
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200625223443.2684-4-nitesh@redhat.com

    Alex Belits
     

16 May, 2020

1 commit

  • The assumption that a device node is associated either with the
    netdev's device, or the parent of that device, does not hold for all
    drivers. E.g. Freescale's DPAA has two layers of platform devices
    above the netdev. Instead, recursively walk up the tree from the
    netdev, allowing any parent to match against the sought after node.

    Signed-off-by: Tobias Waldekranz
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Tobias Waldekranz
     

24 Apr, 2020

2 commits

  • gro_flush_timeout and napi_defer_hard_irqs can be read
    from napi_complete_done() while other cpus write the value,
    whithout explicit synchronization.

    Use READ_ONCE()/WRITE_ONCE() to annotate the races.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
    we added the ability to arm one high resolution timer, that we used
    to keep not-complete packets in GRO engine a bit longer, hoping that further
    frames might be added to them.

    Since then, we added the napi_complete_done() interface, and commit
    364b6055738b ("net: busy-poll: return busypolling status to drivers")
    allowed drivers to avoid re-arming NIC interrupts if we made a promise
    that their NAPI poll() handler would be called in the near future.

    This infrastructure can be leveraged, thanks to a new device parameter,
    which allows to arm the napi hrtimer, instead of re-arming the device
    hard IRQ.

    We have noticed that on some servers with 32 RX queues or more, the chit-chat
    between the NIC and the host caused by IRQ delivery and re-arming could hurt
    throughput by ~20% on 100Gbit NIC.

    In contrast, hrtimers are using local (percpu) resources and might have lower
    cost.

    The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
    than gro_flush_timeout (/sys/class/net/ethX/)

    By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.

    This patch does not change the prior behavior of gro_flush_timeout
    if used alone : NIC hard irqs should be rearmed as before.

    One concrete usage can be :

    echo 20000 >/sys/class/net/eth1/gro_flush_timeout
    echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs

    If at least one packet is retired, then we will reset napi counter
    to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
    of the queue.

    On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
    avoidance was only possible if napi->poll() was exhausting its budget
    and not call napi_complete_done().

    This feature also can be used to work around some non-optimal NIC irq
    coalescing strategies.

    Having the ability to insert XX usec delays between each napi->poll()
    can increase cache efficiency, since we increase batch sizes.

    It also keeps serving cpus not idle too long, reducing tail latencies.

    Co-developed-by: Luigi Rizzo
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2020

1 commit


10 Apr, 2020

1 commit


27 Feb, 2020

2 commits

  • Add a function to change the owner of the queue entries for a network device
    when it is moved between network namespaces.

    Currently, when moving network devices between network namespaces the
    ownership of the corresponding queue sysfs entries are not changed. This leads
    to problems when tools try to operate on the corresponding sysfs files. Fix
    this.

    Signed-off-by: Christian Brauner
    Signed-off-by: David S. Miller

    Christian Brauner
     
  • Add a function to change the owner of a network device when it is moved
    between network namespaces.

    Currently, when moving network devices between network namespaces the
    ownership of the corresponding sysfs entries is not changed. This leads
    to problems when tools try to operate on the corresponding sysfs files.
    This leads to a bug whereby a network device that is created in a
    network namespaces owned by a user namespace will have its corresponding
    sysfs entry owned by the root user of the corresponding user namespace.
    If such a network device has to be moved back to the host network
    namespace the permissions will still be set to the user namespaces. This
    means unprivileged users can e.g. trigger uevents for such incorrectly
    owned devices. They can also modify the settings of the device itself.
    Both of these things are unwanted.

    For example, workloads will create network devices in the host network
    namespace. Other tools will then proceed to move such devices between
    network namespaces owner by other user namespaces. While the ownership
    of the device itself is updated in
    net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs
    entry for the device is not:

    drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 .
    drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 ..
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id
    drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down
    drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed
    drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics
    lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len
    -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type
    -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent

    However, if a device is created directly in the network namespace then
    the device's sysfs permissions will be correctly updated:

    drwxr-xr-x 5 root root 0 Jan 25 18:12 .
    drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 ..
    -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type
    -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len
    -r--r--r-- 1 root root 4096 Jan 25 18:12 address
    -r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier
    -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes
    -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count
    -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count
    -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id
    -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port
    -r--r--r-- 1 root root 4096 Jan 25 18:12 dormant
    -r--r--r-- 1 root root 4096 Jan 25 18:12 duplex
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 flags
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias
    -r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex
    -r--r--r-- 1 root root 4096 Jan 25 18:12 iflink
    -r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu
    -r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group
    -r--r--r-- 1 root root 4096 Jan 25 18:12 operstate
    -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id
    -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name
    -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id
    drwxr-xr-x 2 root root 0 Jan 25 18:12 power
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down
    drwxr-xr-x 4 root root 0 Jan 25 18:12 queues
    -r--r--r-- 1 root root 4096 Jan 25 18:12 speed
    drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics
    lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len
    -r--r--r-- 1 root root 4096 Jan 25 18:12 type
    -rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent

    Now, when creating a network device in a network namespace owned by a
    user namespace and moving it to the host the permissions will be set to
    the id that the user namespace root user has been mapped to on the host
    leading to all sorts of permission issues:

    458752
    drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 .
    drwxr-xr-x 9 root root 0 Jan 25 18:08 ..
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id
    drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down
    drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed
    drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics
    lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len
    -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type
    -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent

    Signed-off-by: Christian Brauner
    Signed-off-by: David S. Miller

    Christian Brauner
     

18 Dec, 2019

1 commit

  • Dev_hold has to be called always in rx_queue_add_kobject.
    Otherwise usage count drops below 0 in case of failure in
    kobject_init_and_add.

    Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
    Reported-by: syzbot
    Cc: Tetsuo Handa
    Cc: David Miller
    Cc: Lukas Bulwahn
    Signed-off-by: Jouni Hogander
    Signed-off-by: David S. Miller

    Jouni Hogander
     

07 Dec, 2019

1 commit

  • Dev_hold has to be called always in netdev_queue_add_kobject.
    Otherwise usage count drops below 0 in case of failure in
    kobject_init_and_add.

    Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
    Reported-by: Hulk Robot
    Cc: Tetsuo Handa
    Cc: David Miller
    Cc: Lukas Bulwahn
    Signed-off-by: David S. Miller

    Jouni Hogander
     

21 Nov, 2019

2 commits

  • kobject_put() should only be called in error path.

    Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
    Signed-off-by: Eric Dumazet
    Cc: Jouni Hogander
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • kobject_init_and_add takes reference even when it fails. This has
    to be given up by the caller in error handling. Otherwise memory
    allocated by kobject_init_and_add is never freed. Originally found
    by Syzkaller:

    BUG: memory leak
    unreferenced object 0xffff8880679f8b08 (size 8):
    comm "netdev_register", pid 269, jiffies 4294693094 (age 12.132s)
    hex dump (first 8 bytes):
    72 78 2d 30 00 36 20 d4 rx-0.6 .
    backtrace:
    [] __kmalloc_track_caller+0x16e/0x290
    [] kvasprintf+0xb1/0x140
    [] kvasprintf_const+0x56/0x160
    [] kobject_set_name_vargs+0x5b/0x140
    [] kobject_init_and_add+0xd8/0x170
    [] net_rx_queue_update_kobjects+0x152/0x560
    [] netdev_register_kobject+0x210/0x380
    [] register_netdevice+0xa1b/0xf00
    [] __tun_chr_ioctl+0x20d5/0x3dd0
    [] tun_chr_ioctl+0x2f/0x40
    [] do_vfs_ioctl+0x1c7/0x1510
    [] ksys_ioctl+0x99/0xb0
    [] __x64_sys_ioctl+0x78/0xb0
    [] do_syscall_64+0x16f/0x580
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [] 0xffffffffffffffff

    Cc: David Miller
    Cc: Lukas Bulwahn
    Signed-off-by: Jouni Hogander
    Signed-off-by: David S. Miller

    Jouni Hogander
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

2 commits

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

    2) Add fib_sync_mem to control the amount of dirty memory we allow to
    queue up between synchronize RCU calls, from David Ahern.

    3) Make flow classifier more lockless, from Vlad Buslov.

    4) Add PHY downshift support to aquantia driver, from Heiner
    Kallweit.

    5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
    contention on SLAB spinlocks in heavy RPC workloads.

    6) Partial GSO offload support in XFRM, from Boris Pismenny.

    7) Add fast link down support to ethtool, from Heiner Kallweit.

    8) Use siphash for IP ID generator, from Eric Dumazet.

    9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
    entries, from David Ahern.

    10) Move skb->xmit_more into a per-cpu variable, from Florian
    Westphal.

    11) Improve eBPF verifier speed and increase maximum program size,
    from Alexei Starovoitov.

    12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
    spinlocks. From Neil Brown.

    13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

    14) Improve link partner cap detection in generic PHY code, from
    Heiner Kallweit.

    15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
    Maguire.

    16) Remove SKB list implementation assumptions in SCTP, your's truly.

    17) Various cleanups, optimizations, and simplifications in r8169
    driver. From Heiner Kallweit.

    18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

    19) Switch PHY drivers over to use dynamic featue detection, from
    Heiner Kallweit.

    20) Support flow steering without masking in dpaa2-eth, from Ioana
    Ciocoi.

    21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
    Pirko.

    22) Increase the strict parsing of current and future netlink
    attributes, also export such policies to userspace. From Johannes
    Berg.

    23) Allow DSA tag drivers to be modular, from Andrew Lunn.

    24) Remove legacy DSA probing support, also from Andrew Lunn.

    25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
    Haabendal.

    26) Add a generic tracepoint for TX queue timeouts to ease debugging,
    from Cong Wang.

    27) More indirect call optimizations, from Paolo Abeni"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
    cxgb4: Fix error path in cxgb4_init_module
    net: phy: improve pause mode reporting in phy_print_status
    dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
    net: macb: Change interrupt and napi enable order in open
    net: ll_temac: Improve error message on error IRQ
    net/sched: remove block pointer from common offload structure
    net: ethernet: support of_get_mac_address new ERR_PTR error
    net: usb: smsc: fix warning reported by kbuild test robot
    staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
    net: dsa: support of_get_mac_address new ERR_PTR error
    net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
    vrf: sit mtu should not be updated when vrf netdev is the link
    net: dsa: Fix error cleanup path in dsa_init_module
    l2tp: Fix possible NULL pointer dereference
    taprio: add null check on sched_nest to avoid potential null pointer dereference
    net: mvpp2: cls: fix less than zero check on a u32 variable
    net_sched: sch_fq: handle non connected flows
    net_sched: sch_fq: do not assume EDT packets are ordered
    net: hns3: use devm_kcalloc when allocating desc_cb
    net: hns3: some cleanup for struct hns3_enet_ring
    ...

    Linus Torvalds
     
  • Pull driver core/kobject updates from Greg KH:
    "Here is the "big" set of driver core patches for 5.2-rc1

    There are a number of ACPI patches in here as well, as Rafael said
    they should go through this tree due to the driver core changes they
    required. They have all been acked by the ACPI developers.

    There are also a number of small subsystem-specific changes in here,
    due to some changes to the kobject core code. Those too have all been
    acked by the various subsystem maintainers.

    As for content, it's pretty boring outside of the ACPI changes:
    - spdx cleanups
    - kobject documentation updates
    - default attribute groups for kobjects
    - other minor kobject/driver core fixes

    All have been in linux-next for a while with no reported issues"

    * tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
    kobject: clean up the kobject add documentation a bit more
    kobject: Fix kernel-doc comment first line
    kobject: Remove docstring reference to kset
    firmware_loader: Fix a typo ("syfs" -> "sysfs")
    kobject: fix dereference before null check on kobj
    Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
    init/config: Do not select BUILD_BIN2C for IKCONFIG
    Provide in-kernel headers to make extending kernel easier
    kobject: Improve doc clarity kobject_init_and_add()
    kobject: Improve docs for kobject_add/del
    driver core: platform: Fix the usage of platform device name(pdev->name)
    livepatch: Replace klp_ktype_patch's default_attrs with groups
    cpufreq: schedutil: Replace default_attrs field with groups
    padata: Replace padata_attr_type default_attrs field with groups
    irqdesc: Replace irq_kobj_type's default_attrs field with groups
    net-sysfs: Replace ktype default_attrs field with groups
    block: Replace all ktype default_attrs with groups
    samples/kobject: Replace foo_ktype's default_attrs field with groups
    kobject: Add support for default attribute groups to kobj_type
    driver core: Postpone DMA tear-down until after devres release for probe failure
    ...

    Linus Torvalds
     

26 Apr, 2019

1 commit

  • The kobj_type default_attrs field is being replaced by the
    default_groups field. Replace the default_attrs fields in rx_queue_ktype
    and netdev_queue_ktype with default_groups. Use the ATTRIBUTE_GROUPS
    macro to create rx_queue_default_groups and netdev_queue_default_groups.

    This patch was tested by verifying that the sysfs files for the
    attributes in the default groups were created.

    Signed-off-by: Kimberly Brown
    Signed-off-by: Greg Kroah-Hartman

    Kimberly Brown
     

18 Apr, 2019

1 commit


16 Apr, 2019

1 commit

  • This reverts commit 6b70fc94afd165342876e53fc4b2f7d085009945.

    The reverted bugfix will cause another issue.
    Reported by syzbot+6024817a931b2830bc93@syzkaller.appspotmail.com.
    See https://syzkaller.appspot.com/x/log.txt?x=1737671b200000 for
    details.

    Signed-off-by: Wang Hai
    Acked-by: Andy Shevchenko
    Signed-off-by: David S. Miller

    Wang Hai
     

28 Mar, 2019

1 commit


24 Mar, 2019

1 commit


22 Mar, 2019

1 commit

  • When registering struct net_device, it will call
    register_netdevice ->
    netdev_register_kobject ->
    device_initialize(dev);
    dev_set_name(dev, "%s", ndev->name)
    device_add(dev)
    register_queue_kobjects(ndev)

    In netdev_register_kobject(), if device_add(dev) or
    register_queue_kobjects(ndev) failed. Register_netdevice()
    will return error, causing netdev_freemem(ndev) to be
    called to free net_device, however put_device(&dev->dev)->..->
    kobject_cleanup() won't be called, resulting in a memory leak.

    syzkaller report this:
    BUG: memory leak
    unreferenced object 0xffff8881f4fad168 (size 8):
    comm "syz-executor.0", pid 3575, jiffies 4294778002 (age 20.134s)
    hex dump (first 8 bytes):
    77 70 61 6e 30 00 ff ff wpan0...
    backtrace:
    [] kstrdup_const+0x3d/0x50 mm/util.c:73
    [] kvasprintf_const+0x112/0x170 lib/kasprintf.c:48
    [] kobject_set_name_vargs+0x55/0x130 lib/kobject.c:281
    [] dev_set_name+0xbb/0xf0 drivers/base/core.c:1915
    [] netdev_register_kobject+0xc0/0x410 net/core/net-sysfs.c:1727
    [] register_netdevice+0xa51/0xeb0 net/core/dev.c:8711
    [] cfg802154_update_iface_num.isra.2+0x13/0x90 [ieee802154]
    [] ieee802154_llsec_fill_key_id+0x1d5/0x570 [ieee802154]
    [] 0xffffffffc1500e0e
    [] platform_drv_probe+0xc6/0x180 drivers/base/platform.c:614
    [] really_probe+0x491/0x7c0 drivers/base/dd.c:509
    [] driver_probe_device+0xdc/0x240 drivers/base/dd.c:671
    [] device_driver_attach+0xf2/0x130 drivers/base/dd.c:945
    [] __driver_attach+0x10e/0x210 drivers/base/dd.c:1022
    [] bus_for_each_dev+0x154/0x1e0 drivers/base/bus.c:304
    [] bus_add_driver+0x427/0x5e0 drivers/base/bus.c:645

    Reported-by: Hulk Robot
    Fixes: 1fa5ae857bb1 ("driver core: get rid of struct device's bus_id string array")
    Signed-off-by: Wang Hai
    Reviewed-by: Andy Shevchenko
    Reviewed-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Wang Hai
     

20 Mar, 2019

1 commit

  • In netdev_queue_add_kobject and rx_queue_add_kobject,
    if sysfs_create_group failed, kobject_put will call
    netdev_queue_release to decrease dev refcont, however
    dev_hold has not be called. So we will see this while
    unregistering dev:

    unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1

    Reported-by: Hulk Robot
    Fixes: d0d668371679 ("net: don't decrement kobj reference count on init failure")
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     

05 Mar, 2019

2 commits


04 Mar, 2019

1 commit

  • syzkaller report this:
    BUG: memory leak
    unreferenced object 0xffff88837a71a500 (size 256):
    comm "syz-executor.2", pid 9770, jiffies 4297825125 (age 17.843s)
    hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
    ff ff ff ff ff ff ff ff 20 c0 ef 86 ff ff ff ff ........ .......
    backtrace:
    [] netdev_register_kobject+0x124/0x2e0 net/core/net-sysfs.c:1751
    [] register_netdevice+0xcc1/0x1270 net/core/dev.c:8516
    [] tun_set_iff drivers/net/tun.c:2649 [inline]
    [] __tun_chr_ioctl+0x2218/0x3d20 drivers/net/tun.c:2883
    [] vfs_ioctl fs/ioctl.c:46 [inline]
    [] do_vfs_ioctl+0x1a5/0x10e0 fs/ioctl.c:690
    [] ksys_ioctl+0x89/0xa0 fs/ioctl.c:705
    [] __do_sys_ioctl fs/ioctl.c:712 [inline]
    [] __se_sys_ioctl fs/ioctl.c:710 [inline]
    [] __x64_sys_ioctl+0x74/0xb0 fs/ioctl.c:710
    [] do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    It should call kset_unregister to free 'dev->queues_kset'
    in error path of register_queue_kobjects, otherwise will cause a mem leak.

    Reported-by: Hulk Robot
    Fixes: 1d24eb4815d1 ("xps: Transmit Packet Steering")
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     

07 Feb, 2019

2 commits

  • Now that we have a dedicated NDO for getting a port's parent ID, get rid
    of SWITCHDEV_ATTR_ID_PORT_PARENT_ID and convert all callers to use the
    NDO exclusively. This is a preliminary change to getting rid of
    switchdev_ops eventually.

    Signed-off-by: Florian Fainelli
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • In preparation for getting rid of switchdev_ops, create a dedicated NDO
    operation for getting the port's parent identifier. There are
    essentially two classes of drivers that need to implement getting the
    port's parent ID which are VF/PF drivers with a built-in switch, and
    pure switchdev drivers such as mlxsw, ocelot, dsa etc.

    We introduce a helper function: dev_get_port_parent_id() which supports
    recursion into the lower devices to obtain the first port's parent ID.

    Convert the bridge, core and ipv4 multicast routing code to check for
    such ndo_get_port_parent_id() and call the helper function when valid
    before falling back to switchdev_port_attr_get(). This will allow us to
    convert all relevant drivers in one go instead of having to implement
    both switchdev_port_attr_get() and ndo_get_port_parent_id() operations,
    then get rid of switchdev_port_attr_get().

    Acked-by: Jiri Pirko
    Signed-off-by: Florian Fainelli
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Florian Fainelli
     

07 Dec, 2018

1 commit

  • In order to pass extack together with NETDEV_PRE_UP notifications, it's
    necessary to route the extack to __dev_open() from diverse (possibly
    indirect) callers. One prominent API through which the notification is
    invoked is dev_change_flags().

    Therefore extend dev_change_flags() with and extra extack argument and
    update all users. Most of the calls end up just encoding NULL, but
    several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available.

    Since the function declaration line is changed anyway, name the other
    function arguments to placate checkpatch.

    Signed-off-by: Petr Machata
    Acked-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Petr Machata
     

10 Aug, 2018

1 commit

  • The definition of static_key_slow_inc() has cpus_read_lock in place. In the
    virtio_net driver, XPS queues are initialized after setting the queue:cpu
    affinity in virtnet_set_affinity() which is already protected within
    cpus_read_lock. Lockdep prints a warning when we are trying to acquire
    cpus_read_lock when it is already held.

    This patch adds an ability to call __netif_set_xps_queue under
    cpus_read_lock().
    Acked-by: Jason Wang

    ============================================
    WARNING: possible recursive locking detected
    4.18.0-rc3-next-20180703+ #1 Not tainted
    --------------------------------------------
    swapper/0/1 is trying to acquire lock:
    00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20

    but task is already holding lock:
    00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(cpu_hotplug_lock.rw_sem);
    lock(cpu_hotplug_lock.rw_sem);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    3 locks held by swapper/0/1:
    #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
    #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
    #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60

    v2: move cpus_read_lock() out of __netif_set_xps_queue()

    Cc: "Nambiar, Amritha"
    Cc: "Michael S. Tsirkin"
    Cc: Jason Wang
    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")

    Signed-off-by: Andrei Vagin

    Signed-off-by: David S. Miller

    Andrei Vagin
     

21 Jul, 2018

3 commits

  • Make net_ns_get_ownership() reusable by networking code outside of core.
    This is useful, for example, to allow bridge related sysfs files to be
    owned by container root.

    Add a function comment since this is a potentially dangerous function to
    use given the way that kobject_get_ownership() works by initializing uid
    and gid before calling .get_ownership().

    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Tyler Hicks
     
  • When creating various objects in /sys/class/net/... make sure that they
    belong to container's owner instead of global root (if they belong to a
    container/namespace).

    Co-Developed-by: Tyler Hicks
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Dmitry Torokhov
     
  • An upcoming change will allow container root to open some /sys/class/net
    files for writing. The tx_maxrate attribute can result in changes
    to actual hardware devices so err on the side of caution by requiring
    CAP_NET_ADMIN in the init namespace in the corresponding attribute store
    operation.

    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Tyler Hicks