17 Jun, 2010

9 commits

  • scm_send occasionally allocates state in the scm_cookie, so I have
    modified netlink_sendmsg to guarantee that when scm_send succeeds
    scm_destory will be called to free that state.

    Signed-off-by: Eric W. Biederman
    Reviewed-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Use struct pid and struct cred to store the peer credentials on struct
    sock. This gives enough information to convert the peer credential
    information to a value relative to whatever namespace the socket is in
    at the time.

    This removes nasty surprises when using SO_PEERCRED on socket
    connetions where the processes on either side are in different pid and
    user namespaces.

    Signed-off-by: Eric W. Biederman
    Acked-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • To keep the coming code clear and to allow both the sock
    code and the scm code to share the logic introduce a
    fuction to translate from struct cred to struct ucred.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Define what happens when a we view a uid from one user_namespace
    in another user_namepece.

    - If the user namespaces are the same no mapping is necessary.

    - For most cases of difference use overflowuid and overflowgid,
    the uid and gid currently used for 16bit apis when we have a 32bit uid
    that does fit in 16bits. Effectively the situation is the same,
    we want to return a uid or gid that is not assigned to any user.

    - For the case when we happen to be mapping the uid or gid of the
    creator of the target user namespace use uid 0 and gid as confusing
    that user with root is not a problem.

    Signed-off-by: Eric W. Biederman
    Acked-by: Serge E. Hallyn
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Reorder the fields in scm_cookie so they pack better on 64bit.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Changed the driver version number to 5.0.4

    Signed-off-by: Anirban Chakraborty
    Signed-off-by: David S. Miller

    Anirban Chakraborty
     
  • The driver was not detecting the presence of NIC partitioning capability of the
    firmware properly. Now, it checks the eswitch set bit in the FW capabilities
    register and accordingly sets the driver mode as NPAR capable or not.

    Signed-off-by: Anirban Chakraborty
    Signed-off-by: David S. Miller

    Anirban Chakraborty
     
  • Discard the ACK if we find options that do not match current sysctl
    settings.

    Previously it was possible to create a connection with sack, wscale,
    etc. enabled even if the feature was disabled via sysctl.

    Also remove an unneeded call to tcp_sack_reset() in
    cookie_check_timestamp: Both call sites (cookie_v4_check,
    cookie_v6_check) zero "struct tcp_options_received", hand it to
    tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack)
    and then call cookie_check_timestamp().

    Even if num_sacks/dsacks were changed, the structure is allocated on
    the stack and after cookie_check_timestamp returns only a few selected
    members are copied to the inet_request_sock.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.

    Thats a bit unfortunate, since old size was exactly 64 bytes.

    This can be solved, using an union between this rcu_head an four fields,
    that are normally used only when a refcount is taken on inet_peer.
    rcu_head is used only when refcnt=-1, right before structure freeing.

    Add a inet_peer_refcheck() function to check this assertion for a while.

    We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Jun, 2010

29 commits

  • Based upon a report by Stephen Rothwell.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)

    Unused inet_peer entries have a null refcnt.

    Using atomic_inc_not_zero() in rcu lookups is not going to work for
    them, and slow path is taken.

    Fix this using -1 marker instead of 0 for deleted entries.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Now that RCU debugging checks for matching rcu_dereference calls
    and rcu_read_lock, we need to use the correct primitives or face
    nasty warnings.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The version of br_netpoll_send_skb used when netpoll is off is
    missing a const thus causing a warning.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • In old kernels, NET_SKB_PAD was defined to 16.

    Then commit d6301d3dd1c2 (net: Increase default NET_SKB_PAD to 32), and
    commit 18e8c134f4e9 (net: Increase NET_SKB_PAD to 64 bytes) increased it
    to 64.

    While first patch was governed by network stack needs, second was more
    driven by performance issues on current hardware. Real intent was to
    align data on a cache line boundary.

    So use max(32, L1_CACHE_BYTES) instead of 64, to be more generic.

    Remove microblaze and powerpc own NET_SKB_PAD definitions.

    Thanks to Alexander Duyck and David Miller for their comments.

    Suggested-by: David Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Third param (work) is unused, remove it.

    Remove __inline__ and inline qualifiers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of doing one atomic operation per frag, we can factorize them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When syncookies are in effect, req->iif is left uninitialized.
    In case of e.g. link-local addresses the route lookup then fails
    and no syn-ack is sent.

    Rearrange things so ->iif is also initialized in the syncookie case.

    want_cookie can only be true when the isn was zero, thus move the want_cookie
    check into the "!isn" branch.

    Cc: Glenn Griffin
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • ndo_get_stats still returns struct net_device_stats *; there is
    no struct net_device_stats64.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • SKBs hold onto resources that can't be held indefinitely, such as TCP
    socket references and netfilter conntrack state. So if a packet is left
    in TX ring for a long time, there might be a TCP socket that cannot be
    closed and freed up.

    Current blackfin EMAC driver always reclaim and free used tx skbs in future
    transfers. The problem is that future transfer may not come as soon as
    possible. This patch start a timer after transfer to reclaim and free skb.
    There is nearly no performance drop with this patch.

    TX interrupt is not enabled because of a strange behavior of the Blackfin EMAC.
    If EMAC TX transfer control is turned on, endless TX interrupts are triggered
    no matter if TX DMA is enabled or not. Since DMA walks down the ring automatically,
    TX transfer control can't be turned off in the middle. The only way is to disable
    TX interrupt completely.

    Signed-off-by: Sonic Zhang
    Signed-off-by: David S. Miller

    Sonic Zhang
     
  • inetpeer currently uses an AVL tree protected by an rwlock.

    It's possible to make most lookups use RCU

    1) Add a struct rcu_head to struct inet_peer

    2) add a lookup_rcu_bh() helper to perform lockless and opportunistic
    lookup. This is a normal function, not a macro like lookup().

    3) Add a limit to number of links followed by lookup_rcu_bh(). This is
    needed in case we fall in a loop.

    4) add an smp_wmb() in link_to_pool() right before node insert.

    5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take
    last reference to an inet_peer, since lockless readers could increase
    refcount, even while we hold peers.lock.

    6) Delay struct inet_peer freeing after rcu grace period so that
    lookup_rcu_bh() cannot crash.

    7) inet_getpeer() first attempts lockless lookup.
    Note this lookup can fail even if target is in AVL tree, but a
    concurrent writer can let tree in a non correct form.
    If this attemps fails, lock is taken a regular lookup is performed
    again.

    8) convert peers.lock from rwlock to a spinlock

    9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because
    rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 ->
    128 bytes)
    In a future patch, this is probably possible to revert this part, if rcu
    field is put in an union to share space with rid, ip_id_count, tcp_ts &
    tcp_ts_stamp. These fields being manipulated only with refcnt > 0.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Fix the code that handles the error case when cnic_cm_abort() cannot
    proceed normally. We cannot just set the csk->state and we must
    go through cnic_ready_to_close() to handle all the conditions. We
    also add error return code in cnic_cm_abort().

    Signed-off-by: Michael Chan
    Signed-off-by: Eddie Wai
    Signed-off-by: David S. Miller

    Michael Chan
     
  • Combine RESET_RECEIVED and RESET_COMP logic and fix race condition
    between these 2 events and cnic_cm_close(). In particular, we need
    to (test_and_clear_bit(SK_F_OFFLD_COMPLETE, &csk->flags)) before we
    update csk->state.

    Signed-off-by: Michael Chan
    Signed-off-by: Eddie Wai
    Signed-off-by: David S. Miller

    Michael Chan
     
  • Move chip-specific code to the respective chip's ->close_conn() functions
    for better code organization.

    Signed-off-by: Michael Chan
    Signed-off-by: Eddie Wai
    Signed-off-by: David S. Miller

    Michael Chan
     
  • So that bnx2i can handle the error condition immediately and not have to
    wait for timeout.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     
  • This change corrects issues where macvlan was not correctly triggering
    promiscuous mode on ixgbe due to the filters not being correctly set. It
    also corrects the fact that VF rar filters were being overwritten when the
    PF was reset.

    CC: Shirley Ma
    Signed-off-by: Alexander Duyck
    Tested-by: Emil Tantilov
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • David S. Miller
     
  • unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH,
    TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced
    with the corresponding TCPHDR_*.

    Signed-off-by: Changli Gao
    ----
    include/net/tcp.h | 24 ++++++-------
    net/ipv4/tcp.c | 8 ++--
    net/ipv4/tcp_input.c | 2 -
    net/ipv4/tcp_output.c | 59 ++++++++++++++++-----------------
    net/netfilter/nf_conntrack_proto_tcp.c | 32 ++++++-----------
    net/netfilter/xt_TCPMSS.c | 4 --
    6 files changed, 58 insertions(+), 71 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     
  • Register net_bridge_port pointer as rx_handler data pointer. As br_port is
    removed from struct net_device, another netdev priv_flag is added to indicate
    the device serves as a bridge port. Also rcuized pointers are now correctly
    dereferenced in br_fdb.c and in netfilter parts.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Register macvlan_port pointer as rx_handler data pointer. As macvlan_port is
    removed from struct net_device, another netdev priv_flag is added to indicate
    the device serves as a macvlan port.

    Signed-off-by: Jiri Pirko
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Add possibility to register rx_handler data pointer along with a rx_handler.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • There are multiple problems with the newly added netpoll support:

    1) Use-after-free on each netpoll packet.
    2) Invoking unsafe code on netpoll/IRQ path.
    3) Breaks when netpoll is enabled on the underlying device.

    This patch fixes all of these problems. In particular, we now
    allocate proper netpoll structures for each underlying device.

    We only allow netpoll to be enabled on the bridge when all the
    devices underneath it support netpoll. Once it is enabled, we
    do not allow non-netpoll devices to join the bridge (until netpoll
    is disabled again).

    This allows us to do away with the npinfo juggling that caused
    problem number 1.

    Incidentally this patch fixes number 2 by bypassing unsafe code
    such as multicast snooping and netfilter.

    Reported-by: Qianfeng Zhang
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds the helper netpoll_tx_running for use within
    ndo_start_xmit. It returns non-zero if ndo_start_xmit is being
    invoked by netpoll, and zero otherwise.

    This is currently implemented by simply looking at the hardirq
    count. This is because for all non-netpoll uses of ndo_start_xmit,
    IRQs must be enabled while netpoll always disables IRQs before
    calling ndo_start_xmit.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds the functions __netpoll_setup/__netpoll_cleanup
    which is designed to be called recursively through ndo_netpoll_seutp.

    They must be called with RTNL held, and the caller must initialise
    np->dev and ensure that it has a valid reference count.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds ndo_netpoll_setup as the initialisation primitive
    to complement ndo_netpoll_cleanup.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • As it stands, netpoll_setup and netpoll_cleanup have no locking
    protection whatsoever. So chaos ensures if two entities try to
    perform them on the same device.

    This patch adds RTNL to the equation. The code has been rearranged so
    that bits that do not need RTNL protection are now moved to the top of
    netpoll_setup.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The use of RCU in netpoll is incorrect in a number of places:

    1) The initial setting is lacking a write barrier.
    2) The synchronize_rcu is in the wrong place.
    3) Read barriers are missing.
    4) Some places are even missing rcu_read_lock.
    5) npinfo is zeroed after freeing.

    This patch fixes those issues. As most users are in BH context,
    this also converts the RCU usage to the BH variant.

    Signed-off-by: Herbert Xu
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Now that netpoll always zaps npinfo we no longer need to do it
    in bridge.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Since we have to NULL npinfo regardless of whether there is a
    ndo_netpoll_cleanup, it makes sense to do this unconditionally
    in netpoll_cleanup rather than having every driver do it by
    themselves.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

15 Jun, 2010

2 commits

  • Conflicts:
    include/net/netfilter/xt_rateest.h
    net/bridge/br_netfilter.c
    net/netfilter/nf_conntrack_core.c

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • This patch implements an idletimer Xtables target that can be used to
    identify when interfaces have been idle for a certain period of time.

    Timers are identified by labels and are created when a rule is set with a new
    label. The rules also take a timeout value (in seconds) as an option. If
    more than one rule uses the same timer label, the timer will be restarted
    whenever any of the rules get a hit.

    One entry for each timer is created in sysfs. This attribute contains the
    timer remaining for the timer to expire. The attributes are located under
    the xt_idletimer class:

    /sys/class/xt_idletimer/timers/

    When the timer expires, the target module sends a sysfs notification to the
    userspace, which can then decide what to do (eg. disconnect to save power).

    Cc: Timo Teras
    Signed-off-by: Luciano Coelho
    Signed-off-by: Patrick McHardy

    Luciano Coelho