01 Aug, 2012

6 commits

  • This patch series is based on top of "Swap-over-NBD without deadlocking
    v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

    When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. In diskless systems this is not an option so if swap if
    required then swapping over the network is considered. The two likely
    scenarios are when blade servers are used as part of a cluster where the
    form factor or maintenance costs do not allow the use of disks and thin
    clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap but this is not always an option. There is no
    guarantee that the network attached storage (NAS) device is running Linux
    or supports NBD. However, it is likely that it supports NFS so there are
    users that want support for swapping over NFS despite any performance
    concern. Some distributions currently carry patches that support swapping
    over NFS but it would be preferable to support it in the mainline kernel.

    Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

    Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
    reserves.

    Patch 3 adds three helpers for filesystems to handle swap cache pages.
    For example, page_file_mapping() returns page->mapping for
    file-backed pages and the address_space of the underlying
    swap file for swap cache pages.

    Patch 4 adds two address_space_operations to allow a filesystem
    to pin all metadata relevant to a swapfile in memory. Upon
    successful activation, the swapfile is marked SWP_FILE and
    the address space operation ->direct_IO is used for writing
    and ->readpage for reading in swap pages.

    Patch 5 notes that patch 3 is bolting
    filesystem-specific-swapfile-support onto the side and that
    the default handlers have different information to what
    is available to the filesystem. This patch refactors the
    code so that there are generic handlers for each of the new
    address_space operations.

    Patch 6 adds an API to allow a vector of kernel addresses to be
    translated to struct pages and pinned for IO.

    Patch 7 adds support for using highmem pages for swap by kmapping
    the pages before calling the direct_IO handler.

    Patch 8 updates NFS to use the helpers from patch 3 where necessary.

    Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

    Patch 10 implements the new swapfile-related address_space operations
    for NFS and teaches the direct IO handler how to manage
    kernel addresses.

    Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
    where appropriate.

    Patch 12 fixes a NULL pointer dereference that occurs when using
    swap-over-NFS.

    With the patches applied, it is possible to mount a swapfile that is on an
    NFS filesystem. Swap performance is not great with a swap stress test
    taking roughly twice as long to complete than if the swap device was
    backed by NBD.

    This patch: netvm: prevent a stream-specific deadlock

    It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
    that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
    buffers from receiving data, which will prevent userspace from running,
    which is needed to reduce the buffered data.

    Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
    this change it applied, it is important that sockets that set
    SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
    If this happens, a warning is generated and the tokens reclaimed to avoid
    accounting errors until the bug is fixed.

    [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Acked-by: Rik van Riel
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to make sure pfmemalloc packets receive all memory needed to
    proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
    This is limited to a subset of protocols that are expected to be used for
    writing to swap. Taps are not allowed to use PF_MEMALLOC as these are
    expected to communicate with userspace processes which could be paged out.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [jslaby@suse.cz: Lock imbalance fix]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
    for their allocations. These sockets will be able to go below watermarks
    and allocate from the emergency reserve. Such sockets are to be used to
    service the VM (iow. to swap over). They must be handled kernel side,
    exposing such a socket to user-space is a bug.

    There is a risk that the reserves be depleted so for now, the
    administrator is responsible for increasing min_free_kbytes as necessary
    to prevent deadlock for their workloads.

    [a.p.zijlstra@chello.nl: Original patches]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce sk_gfp_atomic(), this function allows to inject sock specific
    flags to each sock related allocation. It is only used on allocation
    paths that may be required for writing pages back to network storage.

    [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Jul, 2012

2 commits

  • This is the IPv6 missing bits for infrastructure added in commit
    41063e9dd1195 (ipv4: Early TCP socket demux.)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

25 Jul, 2012

1 commit

  • Pull networking changes from David S Miller:

    1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache. Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making. Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point. Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

    2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

    3) TCP SYN/ACK performance tweaks from Eric Dumazet.

    4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

    5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

    6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

    7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

    8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

    9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

    10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer. Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

    11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

    12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

    13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

    14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket. The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too. From Eric
    Dumazet.

    15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
    genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
    r8169: revert "add byte queue limit support".
    ipv4: Change rt->rt_iif encoding.
    net: Make skb->skb_iif always track skb->dev
    ipv4: Prepare for change of rt->rt_iif encoding.
    ipv4: Remove all RTCF_DIRECTSRC handliing.
    ipv4: Really ignore ICMP address requests/replies.
    decnet: Don't set RTCF_DIRECTSRC.
    net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
    ipv4: Remove redundant assignment
    rds: set correct msg_namelen
    openvswitch: potential NULL deref in sample()
    tcp: dont drop MTU reduction indications
    bnx2x: Add new 57840 device IDs
    tcp: avoid oops in tcp_metrics and reset tcpm_stamp
    niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
    niu: Fix to check for dma mapping errors.
    net: Fix references to out-of-scope variables in put_cmsg_compat()
    net: ethernet: davinci_emac: add pm_runtime support
    net: ethernet: davinci_emac: Remove unnecessary #include
    ...

    Linus Torvalds
     

24 Jul, 2012

3 commits

  • On input packet processing, rt->rt_iif will be zero if we should
    use skb->dev->ifindex.

    Since we access rt->rt_iif consistently via inet_iif(), that is
    the only spot whose interpretation have to adjust.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Use inet_iif() consistently, and for TCP record the input interface of
    cached RX dst in inet sock.

    rt->rt_iif is going to be encoded differently, so that we can
    legitimately cache input routes in the FIB info more aggressively.

    When the input interface is "use SKB device index" the rt->rt_iif will
    be set to zero.

    This forces us to move the TCP RX dst cache installation into the ipv4
    specific code, and as well it should since doing the route caching for
    ipv6 is pointless at the moment since it is not inspected in the ipv6
    input paths yet.

    Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
    obsolete set to a non-zero value to force invocation of the check
    callback.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull the big VFS changes from Al Viro:
    "This one is *big* and changes quite a few things around VFS. What's in there:

    - the first of two really major architecture changes - death to open
    intents.

    The former is finally there; it was very long in making, but with
    Miklos getting through really hard and messy final push in
    fs/namei.c, we finally have it. Unlike his variant, this one
    doesn't introduce struct opendata; what we have instead is
    ->atomic_open() taking preallocated struct file * and passing
    everything via its fields.

    Instead of returning struct file *, it returns -E... on error, 0
    on success and 1 in "deal with it yourself" case (e.g. symlink
    found on server, etc.).

    See comments before fs/namei.c:atomic_open(). That made a lot of
    goodies finally possible and quite a few are in that pile:
    ->lookup(), ->d_revalidate() and ->create() do not get struct
    nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
    flags instead, ->create() gets "do we want it exclusive" flag.

    With the introduction of new helper (kern_path_locked()) we are rid
    of all struct nameidata instances outside of fs/namei.c; it's still
    visible in namei.h, but not for long. Come the next cycle,
    declaration will move either to fs/internal.h or to fs/namei.c
    itself. [me, miklos, hch]

    - The second major change: behaviour of final fput(). Now we have
    __fput() done without any locks held by caller *and* not from deep
    in call stack.

    That obviously lifts a lot of constraints on the locking in there.
    Moreover, it's legal now to call fput() from atomic contexts (which
    has immediately simplified life for aio.c). We also don't need
    anti-recursion logics in __scm_destroy() anymore.

    There is a price, though - the damn thing has become partially
    asynchronous. For fput() from normal process we are guaranteed
    that pending __fput() will be done before the caller returns to
    userland, exits or gets stopped for ptrace.

    For kernel threads and atomic contexts it's done via
    schedule_work(), so theoretically we might need a way to make sure
    it's finished; so far only one such place had been found, but there
    might be more.

    There's flush_delayed_fput() (do all pending __fput()) and there's
    __fput_sync() (fput() analog doing __fput() immediately). I hope
    we won't need them often; see warnings in fs/file_table.c for
    details. [me, based on task_work series from Oleg merged last
    cycle]

    - sync series from Jan

    - large part of "death to sync_supers()" work from Artem; the only
    bits missing here are exofs and ext4 ones. As far as I understand,
    those are going via the exofs and ext4 trees resp.; once they are
    in, we can put ->write_super() to the rest, along with the thread
    calling it.

    - preparatory bits from unionmount series (from dhowells).

    - assorted cleanups and fixes all over the place, as usual.

    This is not the last pile for this cycle; there's at least jlayton's
    ESTALE work and fsfreeze series (the latter - in dire need of fixes,
    so I'm not sure it'll make the cut this cycle). I'll probably throw
    symlink/hardlink restrictions stuff from Kees into the next pile, too.
    Plus there's a lot of misc patches I hadn't thrown into that one -
    it's large enough as it is..."

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
    ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
    btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
    switch dentry_open() to struct path, make it grab references itself
    spufs: shift dget/mntget towards dentry_open()
    zoran: don't bother with struct file * in zoran_map
    ecryptfs: don't reinvent the wheels, please - use struct completion
    don't expose I_NEW inodes via dentry->d_inode
    tidy up namei.c a bit
    unobfuscate follow_up() a bit
    ext3: pass custom EOF to generic_file_llseek_size()
    ext4: use core vfs llseek code for dir seeks
    vfs: allow custom EOF in generic_file_llseek code
    vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
    vfs: Remove unnecessary flushing of block devices
    vfs: Make sys_sync writeout also block device inodes
    vfs: Create function for iterating over block devices
    vfs: Reorder operations during sys_sync
    quota: Move quota syncing to ->sync_fs method
    quota: Split dquot_quota_sync() to writeback and cache flushing part
    vfs: Move noop_backing_dev_info check from sync into writeback
    ...

    Linus Torvalds
     

23 Jul, 2012

5 commits

  • ICMP messages generated in output path if frame length is bigger than
    mtu are actually lost because socket is owned by user (doing the xmit)

    One example is the ipgre_tunnel_xmit() calling
    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

    We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on
    sockets hitting dst_allfrag).

    Problem of such fix is that it relied on retransmit timers, so short tcp
    sessions paid a too big latency increase price.

    This patch uses the tcp_release_cb() infrastructure so that MTU
    reduction messages (ICMP messages) are not lost, and no extra delay
    is added in TCP transmits.

    Reported-by: Maciej Żenczykowski
    Diagnosed-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Cc: Nandita Dukkipati
    Cc: Tom Herbert
    Cc: Tore Anderson
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The ipv4 routing cache is non-deterministic, performance wise, and is
    subject to reasonably easy to launch denial of service attacks.

    The routing cache works great for well behaved traffic, and the world
    was a much friendlier place when the tradeoffs that led to the routing
    cache's design were considered.

    What it boils down to is that the performance of the routing cache is
    a product of the traffic patterns seen by a system rather than being a
    product of the contents of the routing tables. The former of which is
    controllable by external entitites.

    Even for "well behaved" legitimate traffic, high volume sites can see
    hit rates in the routing cache of only ~%10.

    The general flow of this patch series is that first the routing cache
    is removed. We build a completely new rtable entry every lookup
    request.

    Next we make some simplifications due to the fact that removing the
    routing cache causes several members of struct rtable to become no
    longer necessary.

    Then we need to make some amends such that we can legally cache
    pre-constructed routes in the FIB nexthops. Firstly, we need to
    invalidate routes which are hit with nexthop exceptions. Secondly we
    have to change the semantics of rt->rt_gateway such that zero means
    that the destination is on-link and non-zero otherwise.

    Now that the preparations are ready, we start caching precomputed
    routes in the FIB nexthops. Output and input routes need different
    kinds of care when determining if we can legally do such caching or
    not. The details are in the commit log messages for those changes.

    The patch series then winds down with some more struct rtable
    simplifications and other tidy ups that remove unnecessary overhead.

    On a SPARC-T3 output route lookups are ~876 cycles. Input route
    lookups are ~1169 cycles with rpfilter disabled, and about ~1468
    cycles with rpfilter enabled.

    These measurements were taken with the kbench_mod test module in the
    net_test_tools GIT tree:

    git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

    That GIT tree also includes a udpflood tester tool and stresses
    route lookups on packet output.

    For example, on the same SPARC-T3 system we can run:

    time ./udpflood -l 10000000 10.2.2.11

    with routing cache:
    real 1m21.955s user 0m6.530s sys 1m15.390s

    without routing cache:
    real 1m31.678s user 0m6.520s sys 1m25.140s

    Performance undoubtedly can easily be improved further.

    For example fib_table_lookup() performs a lot of excessive
    computations with all the masking and shifting, some of it
    conditionalized to deal with edge cases.

    Also, Eric's no-ref optimization for input route lookups can be
    re-instated for the FIB nexthop caching code path. I would be really
    pleased if someone would work on that.

    In fact anyone suitable motivated can just fire up perf on the loading
    of the test net_test_tools benchmark kernel module. I spend much of
    my time going:

    bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
    bash# perf report

    Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
    Hutchings, and others.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • recursion in __scm_destroy() will be cut by delaying final fput()

    Signed-off-by: Al Viro

    Al Viro
     
  • Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     
  • I've seen several attempts recently made to do quick failover of sctp transports
    by reducing various retransmit timers and counters. While its possible to
    implement a faster failover on multihomed sctp associations, its not
    particularly robust, in that it can lead to unneeded retransmits, as well as
    false connection failures due to intermittent latency on a network.

    Instead, lets implement the new ietf quick failover draft found here:
    http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

    This will let the sctp stack identify transports that have had a small number of
    errors, and avoid using them quickly until their reliability can be
    re-established. I've tested this out on two virt guests connected via multiple
    isolated virt networks and believe its in compliance with the above draft and
    works well.

    Signed-off-by: Neil Horman
    CC: Vlad Yasevich
    CC: Sridhar Samudrala
    CC: "David S. Miller"
    CC: linux-sctp@vger.kernel.org
    CC: joe@perches.com
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Neil Horman
     

21 Jul, 2012

21 commits

  • We were using a special key "0" for all loopback and point-to-point
    device neigh lookups under ipv4, but we wouldn't use that special
    key for the neigh creation.

    So basically we'd make a new neigh at each and every lookup :-)

    This special case to use only one neigh for these device types
    is of dubious value, so just remove it entirely.

    Reported-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     
  • It's not really needed.

    We only grabbed a reference to the fib_info for the sake of fib_info
    local metrics.

    However, fib_info objects are freed using RCU, as are therefore their
    private metrics (if any).

    We would have triggered a route cache flush if we eliminated a
    reference to a fib_info object in the routing tables.

    Therefore, any existing cached routes will first check and see that
    they have been invalidated before an errant reference to these
    metric values would occur.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • That is this value's only use, as a boolean to indicate whether
    a route is an input route or not.

    So implement it that way, using a u16 gap present in the struct
    already.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Never actually used.

    It was being set on output routes to the original OIF specified in the
    flow key used for the lookup.

    Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of
    the flowi4_oif and flowi4_iif values, thanks to feedback from Julian
    Anastasov.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Caching input routes is slightly simpler than output routes, since we
    don't need to be concerned with nexthop exceptions. (locally
    destined, and routed packets, never trigger PMTU events or redirects
    that will be processed by us).

    However, we have to elide caching for the DIRECTSRC and non-zero itag
    cases.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • If we have an output route that lacks nexthop exceptions, we can cache
    it in the FIB info nexthop.

    Such routes will have DST_HOST cleared because such routes refer to a
    family of destinations, rather than just one.

    The sequence of the handling of exceptions during route lookup is
    adjusted to make the logic work properly.

    Before we allocate the route, we lookup the exception.

    Then we know if we will cache this route or not, and therefore whether
    DST_HOST should be set on the allocated route.

    Then we use DST_HOST to key off whether we should store the resulting
    route, during rt_set_nexthop(), in the FIB nexthop cache.

    With help from Eric Dumazet.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Mark them obsolete so there will be a re-lookup to fetch the
    FIB nexthop exception info.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add a big comment explaining how the field works, and use defines
    instead of magic constants for the values assigned to it.

    Suggested by Joe Perches.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In order to allow prefixed routes, we have to adjust how rt_gateway
    is set and interpreted.

    The new interpretation is:

    1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

    2) rt_gateway != 0, destination requires a nexthop gateway

    Abstract the fetching of the proper nexthop value using a new
    inline helper, rt_nexthop(), as suggested by Joe Perches.

    Signed-off-by: David S. Miller
    Tested-by: Vijay Subramanian

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David Miller
     
  • Signed-off-by: David S. Miller

    David Miller
     
  • They are always used in contexts where they can be reconstituted,
    or where the finally resolved rt->rt_{src,dst} is semantically
    equivalent.

    Signed-off-by: David S. Miller

    David Miller
     
  • The "noref" argument to ip_route_input_common() is now always ignored
    because we do not cache routes, and in that case we must always grab
    a reference to the resulting 'dst'.

    Signed-off-by: David S. Miller

    David Miller
     
  • The ipv4 routing cache is non-deterministic, performance wise, and is
    subject to reasonably easy to launch denial of service attacks.

    The routing cache works great for well behaved traffic, and the world
    was a much friendlier place when the tradeoffs that led to the routing
    cache's design were considered.

    What it boils down to is that the performance of the routing cache is
    a product of the traffic patterns seen by a system rather than being a
    product of the contents of the routing tables. The former of which is
    controllable by external entitites.

    Even for "well behaved" legitimate traffic, high volume sites can see
    hit rates in the routing cache of only ~%10.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • As this is going to be used not only by bonding.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Also cut out unused function parameters and possible err in return
    value.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Modern TCP stack highly depends on tcp_write_timer() having a small
    latency, but current implementation doesn't exactly meet the
    expectations.

    When a timer fires but finds the socket is owned by the user, it rearms
    itself for an additional delay hoping next run will be more
    successful.

    tcp_write_timer() for example uses a 50ms delay for next try, and it
    defeats many attempts to get predictable TCP behavior in term of
    latencies.

    Use the recently introduced tcp_release_cb(), so that the user owning
    the socket will call various handlers right before socket release.

    This will permit us to post a followup patch to address the
    tcp_tso_should_defer() syndrome (some deferred packets have to wait
    RTO timer to be transmitted, while cwnd should allow us to send them
    sooner)

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Nandita Dukkipati
    Cc: H.K. Jerry Chu
    Cc: John Heffner
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not
    guaranteed to be a power of two.

    Uses hash_32() instead of custom hash.

    tcpmhash_entries should be an unsigned int.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • …wireless-next into for-davem

    John W. Linville
     

20 Jul, 2012

2 commits