11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

09 Sep, 2012

2 commits

  • This patch defines netlink_kernel_create as a wrapper function of
    __netlink_kernel_create to hide the struct module *me parameter
    (which seems to be THIS_MODULE in all existing netlink subsystems).

    Suggested by David S. Miller.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Replace netlink_set_nonroot by one new field `flags' in
    struct netlink_kernel_cfg that is passed to netlink_kernel_create.

    This patch also renames NL_NONROOT_* to NL_CFG_F_NONROOT_* since
    now the flags field in nl_table is generic (so we can add more
    flags if needed in the future).

    Also adjust all callers in the net-next tree to use these flags
    instead of netlink_set_nonroot.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

08 Sep, 2012

2 commits

  • Since route cache deletion (89aef8921bfbac22f), delay is no
    more used. Remove it.

    Signed-off-by: Nicolas Dichtel
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Passing uids and gids on NETLINK_CB from a process in one user
    namespace to a process in another user namespace can result in the
    wrong uid or gid being presented to userspace. Avoid that problem by
    passing kuids and kgids instead.

    - define struct scm_creds for use in scm_cookie and netlink_skb_parms
    that holds uid and gid information in kuid_t and kgid_t.

    - Modify scm_set_cred to fill out scm_creds by heand instead of using
    cred_to_ucred to fill out struct ucred. This conversion ensures
    userspace does not get incorrect uid or gid values to look at.

    - Modify scm_recv to convert from struct scm_creds to struct ucred
    before copying credential values to userspace.

    - Modify __scm_send to populate struct scm_creds on in the scm_cookie,
    instead of just copying struct ucred from userspace.

    - Modify netlink_sendmsg to copy scm_creds instead of struct ucred
    into the NETLINK_CB.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

06 Sep, 2012

3 commits

  • When adding a blackhole or a prohibit route, they were handling like classic
    routes. Moreover, it was only possible to add this kind of routes by specifying
    an interface.

    Bug already reported here:
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498

    Before the patch:
    $ ip route add blackhole 2001::1/128
    RTNETLINK answers: No such device
    $ ip route add blackhole 2001::1/128 dev eth0
    $ ip -6 route | grep 2001
    2001::1 dev eth0 metric 1024

    After:
    $ ip route add blackhole 2001::1/128
    $ ip -6 route | grep 2001
    blackhole 2001::1 dev lo metric 1024 error -22

    v2: wrong patch
    v3: add a field fc_type in struct fib6_config to store RTN_* type

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • It seems we need to provide ability for stacked devices
    to use specific lock_class_key for sch->busylock

    We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
    a user might use a qdisc anyway.

    (So same fixes are probably needed on non LLTX stacked drivers)

    Noticed while stressing L2TPV3 setup :

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc3+ #788 Not tainted
    -------------------------------------------------------
    netperf/4660 is trying to acquire lock:
    (l2tpsock){+.-...}, at: [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]

    but task is already holding lock:
    (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&(&sch->busylock)->rlock){+.-...}:
    [] lock_acquire+0x90/0x200
    [] _raw_spin_lock_irqsave+0x4c/0x60
    [] __wake_up+0x32/0x70
    [] tty_wakeup+0x3e/0x80
    [] pty_write+0x73/0x80
    [] tty_put_char+0x3c/0x40
    [] process_echoes+0x142/0x330
    [] n_tty_receive_buf+0x8fb/0x1230
    [] flush_to_ldisc+0x142/0x1c0
    [] process_one_work+0x198/0x760
    [] worker_thread+0x186/0x4b0
    [] kthread+0x93/0xa0
    [] kernel_thread_helper+0x4/0x10

    -> #0 (l2tpsock){+.-...}:
    [] __lock_acquire+0x1628/0x1b10
    [] lock_acquire+0x90/0x200
    [] _raw_spin_lock+0x41/0x50
    [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
    [] dev_hard_start_xmit+0x502/0xa70
    [] sch_direct_xmit+0xfe/0x290
    [] dev_queue_xmit+0x1e5/0xe00
    [] ip_finish_output+0x3d0/0x890
    [] ip_output+0x59/0xf0
    [] ip_local_out+0x2d/0xa0
    [] ip_queue_xmit+0x1c3/0x680
    [] tcp_transmit_skb+0x402/0xa60
    [] tcp_write_xmit+0x1f4/0xa30
    [] tcp_push_one+0x30/0x40
    [] tcp_sendmsg+0xe82/0x1040
    [] inet_sendmsg+0x125/0x230
    [] sock_sendmsg+0xdc/0xf0
    [] sys_sendto+0xfe/0x130
    [] system_call_fastpath+0x16/0x1b
    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&sch->busylock)->rlock);
    lock(l2tpsock);
    lock(&(&sch->busylock)->rlock);
    lock(l2tpsock);

    *** DEADLOCK ***

    5 locks held by netperf/4660:
    #0: (sk_lock-AF_INET){+.+.+.}, at: [] tcp_sendmsg+0x2c/0x1040
    #1: (rcu_read_lock){.+.+..}, at: [] ip_queue_xmit+0x0/0x680
    #2: (rcu_read_lock_bh){.+....}, at: [] ip_finish_output+0x135/0x890
    #3: (rcu_read_lock_bh){.+....}, at: [] dev_queue_xmit+0x0/0xe00
    #4: (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

    stack backtrace:
    Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
    Call Trace:
    [] print_circular_bug+0x1fb/0x20c
    [] __lock_acquire+0x1628/0x1b10
    [] ? check_usage+0x9b/0x4d0
    [] ? __lock_acquire+0x2e4/0x1b10
    [] lock_acquire+0x90/0x200
    [] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] _raw_spin_lock+0x41/0x50
    [] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
    [] dev_hard_start_xmit+0x502/0xa70
    [] ? dev_hard_start_xmit+0x5e/0xa70
    [] ? dev_queue_xmit+0x141/0xe00
    [] sch_direct_xmit+0xfe/0x290
    [] dev_queue_xmit+0x1e5/0xe00
    [] ? dev_hard_start_xmit+0xa70/0xa70
    [] ip_finish_output+0x3d0/0x890
    [] ? ip_finish_output+0x135/0x890
    [] ip_output+0x59/0xf0
    [] ip_local_out+0x2d/0xa0
    [] ip_queue_xmit+0x1c3/0x680
    [] ? ip_local_out+0xa0/0xa0
    [] tcp_transmit_skb+0x402/0xa60
    [] ? tcp_md5_do_lookup+0x18e/0x1a0
    [] tcp_write_xmit+0x1f4/0xa30
    [] tcp_push_one+0x30/0x40
    [] tcp_sendmsg+0xe82/0x1040
    [] inet_sendmsg+0x125/0x230
    [] ? inet_create+0x6b0/0x6b0
    [] ? sock_update_classid+0xc2/0x3b0
    [] ? sock_update_classid+0x130/0x3b0
    [] sock_sendmsg+0xdc/0xf0
    [] ? fget_light+0x3f9/0x4f0
    [] sys_sendto+0xfe/0x130
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? _raw_spin_unlock_irq+0x30/0x50
    [] ? finish_task_switch+0x83/0xf0
    [] ? finish_task_switch+0x46/0xf0
    [] ? sysret_check+0x1b/0x56
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Add support for genl "tcp_metrics". No locking
    is changed, only that now we can unlink and delete
    entries after grace period. We implement get/del for
    single entry and dump to support show/flush filtering
    in user space. Del without address attribute causes
    flush for all addresses, sadly under genl_mutex.

    v2:
    - remove rcu_assign_pointer as suggested by Eric Dumazet,
    it is not needed because there are no other writes under lock
    - move the flushing code in tcp_metrics_flush_all

    v3:
    - remove synchronize_rcu on flush as suggested by Eric Dumazet

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

04 Sep, 2012

2 commits

  • David S. Miller
     
  • Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state,
    in addition to Recovery state. Retire the current rate-halving in CWR.
    When losses are detected via ACKs in CWR state, the sender enters Recovery
    state but the cwnd reduction continues and does not restart.

    Rename and refactor cwnd reduction functions since both CWR and Recovery
    use the same algorithm:
    tcp_init_cwnd_reduction() is new and initiates reduction state variables.
    tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery().
    tcp_ends_cwnd_reduction() is previously tcp_complete_cwr().

    The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(),
    and the cwnd moderation inside tcp_enter_cwr() are removed. The unused
    parameter, flag, in tcp_cwnd_reduction() is also removed.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

03 Sep, 2012

2 commits


01 Sep, 2012

6 commits

  • This patch builds on top of the previous patch to add the support
    for TFO listeners. This includes -

    1. allocating, properly initializing, and managing the per listener
    fastopen_queue structure when TFO is enabled

    2. changes to the inet_csk_accept code to support TFO. E.g., the
    request_sock can no longer be freed upon accept(), not until 3WHS
    finishes

    3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
    if it's a TFO socket

    4. properly closing a TFO listener, and a TFO socket before 3WHS
    finishes

    5. supporting TCP_FASTOPEN socket option

    6. modifying tcp_check_req() to use to check a TFO socket as well
    as request_sock

    7. supporting TCP's TFO cookie option

    8. adding a new SYN-ACK retransmit handler to use the timer directly
    off the TFO socket rather than the listener socket. Note that TFO
    server side will not retransmit anything other than SYN-ACK until
    the 3WHS is completed.

    The patch also contains an important function
    "reqsk_fastopen_remove()" to manage the somewhat complex relation
    between a listener, its request_sock, and the corresponding child
    socket. See the comment above the function for the detail.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     
  • This patch adds all the necessary data structure and support
    functions to implement TFO server side. It also documents a number
    of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
    extension MIBs.

    In addition, it includes the following:

    1. a new TCP_FASTOPEN socket option an application must call to
    supply a max backlog allowed in order to enable TFO on its listener.

    2. A number of key data structures:
    "fastopen_rsk" in tcp_sock - for a big socket to access its
    request_sock for retransmission and ack processing purpose. It is
    non-NULL iff 3WHS not completed.

    "fastopenq" in request_sock_queue - points to a per Fast Open
    listener data structure "fastopen_queue" to keep track of qlen (# of
    outstanding Fast Open requests) and max_qlen, among other things.

    "listener" in tcp_request_sock - to point to the original listener
    for book-keeping purpose, i.e., to maintain qlen against max_qlen
    as part of defense against IP spoofing attack.

    3. various data structure and functions, many in tcp_fastopen.c, to
    support server side Fast Open cookie operations, including
    /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     
  • This patch removes bus_id from mdio platform data, The reason to remove
    bus_id is, stmmac mdio bus_id is always same as stmmac bus-id, so there
    is no point in passing this in different variable.
    Also stmmac ethernet driver connects to phy with bus_id passed its
    platform data.
    So, having single bus-id is much simpler.

    Signed-off-by: Srinivas Kandagatla
    Signed-off-by: David S. Miller

    Srinivas Kandagatla
     
  • Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
    the passive open side") changed the initRTO from 3secs to 1sec in
    accordance to RFC6298 (former RFC2988bis). This reduced the time till
    the last SYN retransmission packet gets sent from 93secs to 31secs.

    RFC1122 is stating that the retransmission should be done for at least 3
    minutes, but this seems to be quite high.

    "However, the values of R1 and R2 may be different for SYN
    and data segments. In particular, R2 for a SYN segment MUST
    be set large enough to provide retransmission of the segment
    for at least 3 minutes. The application can close the
    connection (i.e., give up on the open attempt) sooner, of
    course."

    This patch increases the value of TCP_SYN_RETRIES to the value of 6,
    providing a retransmission window of 63secs.

    The comments for SYN and SYNACK retries have also been updated to
    describe the current settings. The same goes for the documentation file
    "Documentation/networking/ip-sysctl.txt".

    Signed-off-by: Alexander Bergmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alex Bergmann
     
  • Merge the 'net' tree to get the recent set of netfilter bug fixes in
    order to assist with some merge hassles Pablo is going to have to deal
    with for upcoming changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • David S. Miller
     

31 Aug, 2012

3 commits

  • Existing code assumes that del_timer returns true for alive conntrack
    entries. However, this is not true if reliable events are enabled.
    In that case, del_timer may return true for entries that were
    just inserted in the dying list. Note that packets / ctnetlink may
    hold references to conntrack entries that were just inserted to such
    list.

    This patch fixes the issue by adding an independent timer for
    event delivery. This increases the size of the ecache extension.
    Still we can revisit this later and use variable size extensions
    to allocate this area on demand.

    Tested-by: Oliver Smith
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Commit c3def943c7117d42caaed3478731ea7c3c87190e have added support for
    new pci ids of the 57840 board, while failing to change the obsolete value
    in 'pci_ids.h'.
    This patch does so, allowing the probe of such devices.

    Signed-off-by: Yuval Mintz
    Signed-off-by: Eilon Greenstein
    Signed-off-by: David S. Miller

    Yuval Mintz
     
  • This patch adds dummy functions in of_mdio.h, so that driver need not
    ifdef there code with CONFIG_OF.

    Signed-off-by: Srinivas Kandagatla
    Signed-off-by: David S. Miller

    Srinivas Kandagatla
     

30 Aug, 2012

8 commits

  • Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Add IPv6 support to the SIP NAT helper. There are no functional differences
    to IPv4 NAT, just different formats for addresses.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header
    checksums for IPv6 NAT.

    Signed-off-by: Patrick McHardy
    Acked-by: David S. Miller

    Patrick McHardy
     
  • Convert the IPv4 NAT implementation to a protocol independent core and
    address family specific modules.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • For mangling IPv6 packets the protocol header offset needs to be known
    by the NAT packet mangling functions. Add a so far unused protoff argument
    and convert the conntrack and NAT helpers to use it in preparation of
    IPv6 NAT.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • The IPv6 conntrack fragmentation currently has a couple of shortcomings.
    Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the
    defragmented packet is then passed to conntrack, the resulting conntrack
    information is attached to each original fragment and the fragments then
    continue their way through the stack.

    Helper invocation occurs in the POSTROUTING hook, at which point only
    the original fragments are available. The result of this is that
    fragmented packets are never passed to helpers.

    This patch improves the situation in the following way:

    - If a reassembled packet belongs to a connection that has a helper
    assigned, the reassembled packet is passed through the stack instead
    of the original fragments.

    - During defragmentation, the largest received fragment size is stored.
    On output, the packet is refragmented if required. If the largest
    received fragment size exceeds the outgoing MTU, a "packet too big"
    message is generated, thus behaving as if the original fragments
    were passed through the stack from an outside point of view.

    - The ipv6_helper() hook function can't receive fragments anymore for
    connections using a helper, so it is switched to use ipv6_skip_exthdr()
    instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the
    reassembled packets are passed to connection tracking helpers.

    The result of this is that we can properly track fragmented packets, but
    still generate ICMPv6 Packet too big messages if we would have before.

    This patch is also required as a precondition for IPv6 NAT, where NAT
    helpers might enlarge packets up to a point that they require
    fragmentation. In that case we can't generate Packet too big messages
    since the proper MTU can't be calculated in all cases (f.i. when
    changing textual representation of a variable amount of addresses),
    so the packet is transparently fragmented iff the original packet or
    fragments would have fit the outgoing MTU.

    IPVS parts by Jesper Dangaard Brouer .

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

27 Aug, 2012

1 commit

  • IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
    (in case of forwarded packets) refragments them at POST_ROUTING
    independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
    the local route without caring about the original fragment sizes,
    thereby breaking PMTUD.

    This patch fixes this by keeping track of the largest received fragment
    with IP_DF set and generates an ICMP fragmentation required error during
    refragmentation if that size exceeds the MTU.

    Signed-off-by: Patrick McHardy
    Acked-by: Eric Dumazet
    Acked-by: David S. Miller

    Patrick McHardy
     

25 Aug, 2012

5 commits

  • This is an initial merge in of Eric Biederman's work to start adding
    user namespace support to the networking.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • John W. Linville says:

    ====================
    This is a batch of updates intended for 3.7. The bulk of it is
    mac80211 changes, including some mesh work from Thomas Pederson and
    some multi-channel work from Johannes. A variety of driver updates
    and other bits are scattered in there as well.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • also, remove unused vlan_info definition from header

    CC: Patrick McHardy
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • The operstate of a device is initially IF_OPER_UNKNOWN and is updated
    asynchronously by linkwatch after each change of carrier state
    reported by the driver. The default carrier state of a net device is
    on, and this will never be changed on drivers that do not support
    carrier detection, thus the operstate remains IF_OPER_UNKNOWN.

    For devices that do support carrier detection, the driver must set the
    carrier state to off initially, then poll the hardware state when the
    device is opened. However, we must not activate linkwatch for a
    unregistered device, and commit b473001 ('net: Do not fire linkwatch
    events until the device is registered.') ensured that we don't. But
    this means that the operstate for many devices that support carrier
    detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.

    The same issue exists with the dormant state.

    The proper initialisation sequence, avoiding a race with opening of
    the device, is:

    rtnl_lock();
    rc = register_netdevice(dev);
    if (rc)
    goto out_unlock;
    netif_carrier_off(dev); /* or netif_dormant_on(dev) */
    rtnl_unlock();

    but it seems silly that this should have to be repeated in so many
    drivers. Further, the operstate seen immediately after opening the
    device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
    linkwatch.

    Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted
    to fix this by setting the operstate synchronously, but it was
    reverted as it could lead to deadlock.

    This initialises the operstate synchronously at registration time
    only.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • …wireless-next into for-davem

    John W. Linville
     

24 Aug, 2012

1 commit

  • This patch fixes a broken build due to a missing header:
    ...
    CC net/ipv4/proc.o
    In file included from include/net/net_namespace.h:15,
    from net/ipv4/proc.c:35:
    include/net/netns/packet.h:11: error: field 'sklist_lock' has incomplete type
    ...

    The lock of netns_packet has been replaced by a recent patch to be a mutex instead of a spinlock,
    but we need to replace the header file to be linux/mutex.h instead of linux/spinlock.h as well.

    See commit 0fa7fa98dbcc2789409ed24e885485e645803d7f:
    packet: Protect packet sk list with mutex (v2) patch,

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     

23 Aug, 2012

4 commits

  • John W. Linville
     
  • Change since v1:

    * Fixed inuse counters access spotted by Eric

    In patch eea68e2f (packet: Report socket mclist info via diag module) I've
    introduced a "scheduling in atomic" problem in packet diag module -- the
    socket list is traversed under rcu_read_lock() while performed under it sk
    mclist access requires rtnl lock (i.e. -- mutex) to be taken.

    [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
    [152363.820573] 4 locks held by crtools/12517:
    [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [] sock_diag_rcv+0x1f/0x3e
    [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [] sock_diag_rcv_msg+0xdb/0x11a
    [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x23/0x1ab
    [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [] packet_diag_dump+0x0/0x1af

    Similar thing was then re-introduced by further packet diag patches (fanount
    mutex and pgvec mutex for rings) :(

    Apart from being terribly sorry for the above, I propose to change the packet
    sk list protection from spinlock to mutex. This lock currently protects two
    modifications:

    * sklist
    * prot inuse counters

    The sklist modifications can be just reprotected with mutex since they already
    occur in a sleeping context. The inuse counters modifications are trickier -- the
    __this_cpu_-s are used inside, thus requiring the caller to handle the potential
    issues with contexts himself. Since packet sockets' counters are modified in two
    places only (packet_create and packet_release) we only need to protect the context
    from being preempted. BH disabling is not required in this case.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The helper functions which translate IEEE MDIO Manageable Device (MMD)
    Energy-Efficient Ethernet (EEE) registers 3.20, 7.60 and 7.61 to and from
    the comparable ethtool supported/advertised settings will be needed by
    drivers other than those in PHYLIB (e.g. e1000e in a follow-on patch).

    In the same fashion as similar translation functions in linux/mii.h, move
    these functions from the PHYLIB core to the linux/mdio.h header file so the
    code will not have to be duplicated in each driver needing MMD-to-ethtool
    (and vice-versa) translations. The function and some variable names have
    been renamed to be more descriptive.

    Not tested on the only hardware that currently calls the related functions,
    stmmac, because I don't have access to any. Has been compile tested and
    the translations have been tested on a locally modified version of e1000e.

    Signed-off-by: Bruce Allan
    Cc: Giuseppe Cavallaro
    Signed-off-by: David S. Miller

    Allan, Bruce W
     
  • I noticed extra one second delay in device dismantle, tracked down to
    a call to dst_dev_event() while some call_rcu() are still in RCU queues.

    These call_rcu() were posted by rt_free(struct rtable *rt) calls.

    We then wait a little (but one second) in netdev_wait_allrefs() before
    kicking again NETDEV_UNREGISTER.

    As the call_rcu() are now completed, dst_dev_event() can do the needed
    device swap on busy dst.

    To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
    after a rcu_barrier(), but outside of RTNL lock.

    Use NETDEV_UNREGISTER_FINAL with care !

    Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

    Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
    IP cache removal.

    With help from Gao feng

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Signed-off-by: David S. Miller

    Eric Dumazet