25 Dec, 2012

1 commit

  • Sedat reported the following commit caused a regression:

    commit 9650388b5c56578fdccc79c57a8c82fb92b8e7f1
    Author: Eric Dumazet
    Date: Fri Dec 21 07:32:10 2012 +0000

    ipv4: arp: fix a lockdep splat in arp_solicit

    This is due to the 6th parameter of arp_send() needs to be NULL
    for the broadcast case, the above commit changed it to an all-zero
    array by mistake.

    Reported-by: Sedat Dilek
    Tested-by: Sedat Dilek
    Cc: Sedat Dilek
    Cc: Eric Dumazet
    Cc: David S. Miller
    Cc: Julian Anastasov
    Signed-off-by: Cong Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cong Wang
     

22 Dec, 2012

1 commit

  • Yan Burman reported following lockdep warning :

    =============================================
    [ INFO: possible recursive locking detected ]
    3.7.0+ #24 Not tainted
    ---------------------------------------------
    swapper/1/0 is trying to acquire lock:
    (&n->lock){++--..}, at: [] __neigh_event_send
    +0x2e/0x2f0

    but task is already holding lock:
    (&n->lock){++--..}, at: [] arp_solicit+0x1d4/0x280

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&n->lock);
    lock(&n->lock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    4 locks held by swapper/1/0:
    #0: (((&n->timer))){+.-...}, at: []
    call_timer_fn+0x0/0x1c0
    #1: (&n->lock){++--..}, at: [] arp_solicit
    +0x1d4/0x280
    #2: (rcu_read_lock_bh){.+....}, at: []
    dev_queue_xmit+0x0/0x5d0
    #3: (rcu_read_lock_bh){.+....}, at: []
    ip_finish_output+0x13e/0x640

    stack backtrace:
    Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
    Call Trace:
    [] validate_chain+0xdcc/0x11f0
    [] ? __lock_acquire+0x440/0xc30
    [] ? kmem_cache_free+0xe5/0x1c0
    [] __lock_acquire+0x440/0xc30
    [] ? inet_getpeer+0x40/0x600
    [] ? __lock_acquire+0x440/0xc30
    [] ? __neigh_event_send+0x2e/0x2f0
    [] lock_acquire+0x95/0x140
    [] ? __neigh_event_send+0x2e/0x2f0
    [] ? __lock_acquire+0x440/0xc30
    [] _raw_write_lock_bh+0x3b/0x50
    [] ? __neigh_event_send+0x2e/0x2f0
    [] __neigh_event_send+0x2e/0x2f0
    [] neigh_resolve_output+0x16b/0x270
    [] ip_finish_output+0x34d/0x640
    [] ? ip_finish_output+0x13e/0x640
    [] ? vxlan_xmit+0x556/0xbec [vxlan]
    [] ip_output+0x80/0xf0
    [] ip_local_out+0x28/0x80
    [] vxlan_xmit+0x66a/0xbec [vxlan]
    [] ? vxlan_xmit+0x556/0xbec [vxlan]
    [] ? skb_gso_segment+0x2b0/0x2b0
    [] ? _raw_spin_unlock_irqrestore+0x65/0x80
    [] ? dev_queue_xmit_nit+0x207/0x270
    [] dev_hard_start_xmit+0x298/0x5d0
    [] dev_queue_xmit+0x2f3/0x5d0
    [] ? dev_hard_start_xmit+0x5d0/0x5d0
    [] arp_xmit+0x58/0x60
    [] arp_send+0x3b/0x40
    [] arp_solicit+0x204/0x280
    [] ? neigh_add+0x310/0x310
    [] neigh_probe+0x45/0x70
    [] neigh_timer_handler+0x1a0/0x2a0
    [] call_timer_fn+0x7f/0x1c0
    [] ? detach_if_pending+0x120/0x120
    [] run_timer_softirq+0x238/0x2b0
    [] ? neigh_add+0x310/0x310
    [] __do_softirq+0x101/0x280
    [] call_softirq+0x1c/0x30
    [] do_softirq+0x85/0xc0
    [] irq_exit+0x9e/0xc0
    [] smp_apic_timer_interrupt+0x68/0xa0
    [] apic_timer_interrupt+0x6f/0x80
    [] ? mwait_idle+0xa4/0x1c0
    [] ? mwait_idle+0x9b/0x1c0
    [] cpu_idle+0x89/0xe0
    [] start_secondary+0x1b2/0x1b6

    Bug is from arp_solicit(), releasing the neigh lock after arp_send()
    In case of vxlan, we eventually need to write lock a neigh lock later.

    Its a false positive, but we can get rid of it without lockdep
    annotations.

    We can instead use neigh_ha_snapshot() helper.

    Reported-by: Yan Burman
    Signed-off-by: Eric Dumazet
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Nov, 2012

1 commit

  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Settings that merely control a single network device are allowed.
    Either the network device is a logical network device where
    restrictions make no difference or the network device is hardware NIC
    that has been explicity moved from the initial network namespace.

    In general policy and network stack state changes are allowed
    while resource control is left unchanged.

    Allow creating raw sockets.
    Allow the SIOCSARP ioctl to control the arp cache.
    Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
    Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
    Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
    Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
    Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
    Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting gre tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipip tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipsec virtual tunnel interfaces.

    Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
    MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
    sockets.

    Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
    arbitrary ip options.

    Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
    Allow setting the IP_TRANSPARENT ipv4 socket option.
    Allow setting the TCP_REPAIR socket option.
    Allow setting the TCP_CONGESTION socket option.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

19 Sep, 2012

1 commit


27 Jul, 2012

1 commit

  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jul, 2012

2 commits

  • In order to allow prefixed routes, we have to adjust how rt_gateway
    is set and interpreted.

    The new interpretation is:

    1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

    2) rt_gateway != 0, destination requires a nexthop gateway

    Abstract the fetching of the proper nexthop value using a new
    inline helper, rt_nexthop(), as suggested by Joe Perches.

    Signed-off-by: David S. Miller
    Tested-by: Vijay Subramanian

    David S. Miller
     
  • The "noref" argument to ip_route_input_common() is now always ignored
    because we do not cache routes, and in that case we must always grab
    a reference to the resulting 'dst'.

    Signed-off-by: David S. Miller

    David Miller
     

28 Jun, 2012

2 commits

  • This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

    This change has several unwanted side effects:

    1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
    thus never create a real cached route.

    2) All TCP traffic will use DST_NOCACHE and never use the routing
    cache at all.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • DDOS synflood attacks hit badly IP route cache.

    On typical machines, this cache is allowed to hold up to 8 Millions dst
    entries, 256 bytes for each, for a total of 2GB of memory.

    rt_garbage_collect() triggers and tries to cleanup things.

    Eventually route cache is disabled but machine is under fire and might
    OOM and crash.

    This patch exploits the new TCP early demux, to set a nocache
    boolean in case incoming TCP frame is for a not yet ESTABLISHED or
    TIMEWAIT socket.

    This 'nocache' boolean is then used in case dst entry is not found in
    route cache, to create an unhashed dst entry (DST_NOCACHE)

    SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
    output dst for syncookies), so after this patch, a machine is able to
    absorb a DDOS synflood attack without polluting its IP route cache.

    Signed-off-by: Eric Dumazet
    Cc: Hans Schillstrom
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Jun, 2012

1 commit

  • Routing of 127/8 is tradtionally forbidden, we consider
    packets from that address block martian when routing and do
    not process corresponding ARP requests.

    This is a sane default but renders a huge address space
    practically unuseable.

    The RFC states that no address within the 127/8 block should
    ever appear on any network anywhere but it does not forbid
    the use of such addresses outside of the loopback device in
    particular. For example to address a pool of virtual guests
    behind a load balancer.

    This patch adds a new interface option 'route_localnet'
    enabling routing of the 127/8 address block and processing
    of ARP requests on a specific interface.

    Note that for the feature to work, the default local route
    covering 127/8 dev lo needs to be removed.

    Example:
    $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
    $ ip route del 127.0.0.0/8 dev lo table local
    $ ip addr add 127.1.0.1/16 dev eth0
    $ ip route flush cache

    V2: Fix invalid check to auto flush cache (thanks davem)

    Signed-off-by: Thomas Graf
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Thomas Graf
     

16 May, 2012

3 commits


16 Apr, 2012

1 commit


29 Mar, 2012

1 commit


17 Mar, 2012

1 commit

  • I found recently that the arp_process function which handles all of our received
    arp frames, is using IPV4_DEVCONF_ALL macro to check the state of the arp_process
    flag. This seems wrong, as it implies that either none or all of the network
    interfaces accept gratuitous arps. This patch corrects that, allowing
    per-interface arp_accept configuration to deviate from the all setting. Note
    this also brings us into line with the way the arp_filter setting is handled
    during arp_process execution.

    Tested this myself on my home network, and confirmed it works as expected.

    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

11 Feb, 2012

1 commit

  • Commit 653241 (net: RFC3069, private VLAN proxy arp support) changed
    the behavior of arp proxy to send arp replies back out on the interface
    the request came in even if the private VLAN feature is disabled.

    Previously we checked rt->dst.dev != skb->dev for in scenarios, when
    proxy arp is enabled on for the netdevice and also when individual proxy
    neighbour entries have been added.

    This patch adds the check back for the pneigh_lookup() scenario.

    Signed-off-by: Thomas Graf
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Thomas Graf
     

29 Dec, 2011

1 commit


06 Dec, 2011

1 commit


01 Dec, 2011

2 commits


19 Nov, 2011

1 commit

  • ipv4: Remove all uses of LL_ALLOCATED_SPACE

    The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
    alignment to the sum of needed_headroom and needed_tailroom. As
    the amount that is then reserved for head room is needed_headroom
    with alignment, this means that the tail room left may be too small.

    This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv4
    with the macro LL_RESERVED_SPACE and direct reference to
    needed_tailroom.

    This also fixes the problem with needed_headroom changing between
    allocating the skb and reserving the head room.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

14 Nov, 2011

1 commit

  • Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
    > From: David Miller
    > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
    >
    > > From: Eric Dumazet
    > > Date: Wed, 09 Nov 2011 12:14:09 +0100
    > >
    > >> unres_qlen is the number of frames we are able to queue per unresolved
    > >> neighbour. Its default value (3) was never changed and is responsible
    > >> for strange drops, especially if IP fragments are used, or multiple
    > >> sessions start in parallel. Even a single tcp flow can hit this limit.
    > > ...
    > >
    > > Ok, I've applied this, let's see what happens :-)
    >
    > Early answer, build fails.
    >
    > Please test build this patch with DECNET enabled and resubmit. The
    > decnet neigh layer still refers to the removed ->queue_len member.
    >
    > Thanks.

    Ouch, this was fixed on one machine yesterday, but not the other one I
    used this morning, sorry.

    [PATCH V5 net-next] neigh: new unresolved queue limits

    unres_qlen is the number of frames we are able to queue per unresolved
    neighbour. Its default value (3) was never changed and is responsible
    for strange drops, especially if IP fragments are used, or multiple
    sessions start in parallel. Even a single tcp flow can hit this limit.

    $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
    PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
    8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms

    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jul, 2011

1 commit


17 Jul, 2011

2 commits


13 Jul, 2011

1 commit

  • Get rid of all of the useless and costly indirection
    by doing the neigh hash table lookup directly inside
    of the neighbour binding.

    Rename from arp_bind_neighbour to rt_bind_neighbour.

    Use new helpers {__,}ipv4_neigh_lookup()

    In rt_bind_neighbour() get rid of useless tests which
    are never true in the context this function is called,
    namely dev is never NULL and the dst->neighbour is
    always NULL.

    Signed-off-by: David S. Miller

    David Miller
     

11 Jul, 2011

1 commit


30 Mar, 2011

1 commit

  • My commit 6d55cb91a0020ac0 (gre: fix hard header destination
    address checking) broke multicast.

    The reason is that ip_gre used to get ipgre_header() calls with
    zero destination if we have NOARP or multicast destination. Instead
    the actual target was decided at ipgre_tunnel_xmit() time based on
    per-protocol dissection.

    Instead of allowing the "abuse" of ->header() calls with invalid
    destination, this creates multicast mappings for ip_gre. This also
    fixes "ip neigh show nud noarp" to display the proper multicast
    mappings used by the gre device.

    Reported-by: Doug Kehn
    Signed-off-by: Timo Teräs
    Acked-by: Doug Kehn
    Signed-off-by: David S. Miller

    Timo Teräs
     

13 Mar, 2011

1 commit


03 Mar, 2011

1 commit


25 Jan, 2011

1 commit

  • Commit 941666c2e3e0 "net: RCU conversion of dev_getbyhwaddr() and
    arp_ioctl()" introduced a regression, reported by Jamie Heilman.
    "arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
    in pneigh_lookup()

    Removing RTNL requirement from arp_ioctl() was a mistake, just revert
    that part.

    Reported-by: Jamie Heilman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jan, 2011

1 commit

  • IPv4 over firewire needs to be able to remove ARP entries
    from the ARP cache that belong to nodes that are removed, because
    IPv4 over firewire uses ARP packets for private information
    about nodes.

    This information becomes invalid as soon as node drops
    off the bus and when it reconnects, its only possible
    to start talking to it after it responded to an ARP packet.
    But ARP cache prevents such packets from being sent.

    Signed-off-by: Maxim Levitsky
    Signed-off-by: David S. Miller

    Maxim Levitsky
     

09 Dec, 2010

1 commit

  • Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit :

    > Hmm..
    >
    > If somebody can explain why RTNL is held in arp_ioctl() (and therefore
    > in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so
    > that your patch can be applied.
    >
    > Right now it is not good, because RTNL wont be necessarly held when you
    > are going to call arp_invalidate() ?

    While doing this analysis, I found a refcount bug in llc, I'll send a
    patch for net-2.6

    Meanwhile, here is the patch for net-next-2.6

    Your patch then can be applied after mine.

    Thanks

    [PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()

    dev_getbyhwaddr() was called under RTNL.

    Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use
    RCU locking instead of RTNL.

    Change arp_ioctl() to use RCU instead of RTNL locking.

    Note: this fix a dev refcount bug in llc

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Dec, 2010

1 commit

  • Only when dont_send is 0, arp_filter() is consulted, so we can simply
    assign the return value of arp_filter() to dont_send instead.

    Signed-off-by: Changli Gao
    Signed-off-by: David S. Miller

    Changli Gao
     

18 Nov, 2010

1 commit


12 Oct, 2010

1 commit

  • Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
    dirtying neighbour in stress situation (many different flows / dsts)

    Dirtying takes place because of read_lock(&n->lock) and n->used writes.

    Switching to a seqlock, and writing n->used only on jiffies changes
    permits less dirtying.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Oct, 2010

1 commit

  • David

    This is the first step for RCU conversion of neigh code.

    Next patches will convert hash_buckets[] and "struct neighbour" to RCU
    protected objects.

    Thanks

    [PATCH net-next] net neigh: RCU conversion of neigh hash table

    Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
    neigh_table", a new structure is defined :

    struct neigh_hash_table {
    struct neighbour **hash_buckets;
    unsigned int hash_mask;
    __u32 hash_rnd;
    struct rcu_head rcu;
    };

    And "struct neigh_table" has an RCU protected pointer to such a
    neigh_hash_table.

    This means the signature of (*hash)() function changed: We need to add a
    third parameter with the actual hash_rnd value, since this is not
    anymore a neigh_table field.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2010

1 commit


24 Sep, 2010

1 commit