20 Aug, 2012

2 commits

  • Fix kernel-doc warning:

    Warning(net/core/dev.c:5745): No description found for parameter 'dev'

    Signed-off-by: Randy Dunlap
    Cc: "David S. Miller"
    Cc: netdev@vger.kernel.org
    Signed-off-by: David S. Miller

    Randy Dunlap
     
  • If a packet is emitted on one socket in one group of fanout sockets,
    it is transmitted again. It is thus read again on one of the sockets
    of the fanout group. This result in a loop for software which
    generate packets when receiving one.
    This retransmission is not the intended behavior: a fanout group
    must behave like a single socket. The packet should not be
    transmitted on a socket if it originates from a socket belonging
    to the same fanout group.

    This patch fixes the issue by changing the transmission check to
    take fanout group info account.

    Reported-by: Aleksandr Kotov
    Signed-off-by: Eric Leblond
    Signed-off-by: David S. Miller

    Eric Leblond
     

17 Aug, 2012

3 commits

  • A race exists where creating cgroups and also updating the priomap
    may result in losing a priomap update. This is because priomap
    writers are not protected by rtnl_lock.

    Move priority writer into rtnl_lock()/rtnl_unlock().

    CC: Neil Horman
    Reported-by: Al Viro
    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     
  • A socket fd passed in a SCM_RIGHTS datagram was not getting
    updated with the new tasks cgrp prioidx. This leaves IO on
    the socket tagged with the old tasks priority.

    To fix this add a check in the scm recvmsg path to update the
    sock cgrp prioidx with the new tasks value.

    Thanks to Al Viro for catching this.

    CC: Neil Horman
    Reported-by: Al Viro
    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Add lock to prevent a race with a file closing and also remove
    useless and ugly sscanf code. The extra code was never needed
    and the case it supposedly protected against is in fact handled
    correctly by sock_from_file as pointed out by Al Viro.

    CC: Neil Horman
    Reported-by: Al Viro
    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     

15 Aug, 2012

6 commits

  • napi->poll() needs IRQ enabled, so we have to re-enable IRQ before
    calling it.

    Cc: David Miller
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     
  • Without this patch, I can't get netconsole logs remotely over
    vlan. The reason is probably we don't handle vlan tags in either
    netpoll tx or rx path.

    I am not sure if I use these vlan functions correctly, at
    least this patch works.

    Cc: Benjamin LaHaise
    Cc: Patrick McHardy
    Cc: David Miller
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     
  • This patch fixes several problems in the call path of
    netpoll_send_skb_on_dev():

    1. Disable IRQ's before calling netpoll_send_skb_on_dev().

    2. All the callees of netpoll_send_skb_on_dev() should use
    rcu_dereference_bh() to dereference ->npinfo.

    3. Rename arp_reply() to netpoll_arp_reply(), the former is too generic.

    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     
  • In __netpoll_rx(), it dereferences ->npinfo without rcu_dereference_bh(),
    this patch fixes it by using the 'npinfo' passed from netpoll_rx()
    where it is already dereferenced with rcu_dereference_bh().

    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     
  • Like the previous patch, slave_disable_netpoll() and __netpoll_cleanup()
    may be called with read_lock() held too, so we should make them
    non-block, by moving the cleanup and kfree() to call_rcu_bh() callbacks.

    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     
  • slave_enable_netpoll() and __netpoll_setup() may be called
    with read_lock() held, so should use GFP_ATOMIC to allocate
    memory. Eric suggested to pass gfp flags to __netpoll_setup().

    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Reported-by: Dan Carpenter
    Signed-off-by: Eric Dumazet
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     

09 Aug, 2012

2 commits

  • Do not leak memory by updating pointer with potentially NULL realloc return value.

    Found by Linux Driver Verification project (linuxtesting.org).

    Signed-off-by: Alexey Khoroshilov
    Signed-off-by: David S. Miller

    Alexey Khoroshilov
     
  • While investigating on network performance problems, I found this little
    gem :

    $ nm -v vmlinux | grep -1 dst_default_metrics
    ffffffff82736540 b busy.46605
    ffffffff82736560 B dst_default_metrics
    ffffffff82736598 b dst_busy_list

    Apparently, declaring a const array without initializer put it in
    (writeable) bss section, in middle of possibly often dirtied cache
    lines.

    Since we really want dst_default_metrics be const to avoid any possible
    false sharing and catch any buggy writes, I force a null initializer.

    ffffffff818a4c20 R dst_default_metrics

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Aug, 2012

2 commits

  • Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to
    limit the size of TSO skbs. This avoids the need to fall back to
    software GSO for local TCP senders.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • A peer (or local user) may cause TCP to use a nominal MSS of as little
    as 88 (actual MSS of 76 with timestamps). Given that we have a
    sufficiently prodigious local sender and the peer ACKs quickly enough,
    it is nevertheless possible to grow the window for such a connection
    to the point that we will try to send just under 64K at once. This
    results in a single skb that expands to 861 segments.

    In some drivers with TSO support, such an skb will require hundreds of
    DMA descriptors; a substantial fraction of a TX ring or even more than
    a full ring. The TX queue selected for the skb may stall and trigger
    the TX watchdog repeatedly (since the problem skb will be retried
    after the TX reset). This particularly affects sfc, for which the
    issue is designated as CVE-2012-3412.

    Therefore:
    1. Add the field net_device::gso_max_segs holding the device-specific
    limit.
    2. In netif_skb_features(), if the number of segments is too high then
    mask out GSO features to force fall back to software GSO.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     

01 Aug, 2012

8 commits

  • Merge Andrew's second set of patches:
    - MM
    - a few random fixes
    - a couple of RTC leftovers

    * emailed patches from Andrew Morton : (120 commits)
    rtc/rtc-88pm80x: remove unneed devm_kfree
    rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
    mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
    tmpfs: distribute interleave better across nodes
    mm: remove redundant initialization
    mm: warn if pg_data_t isn't initialized with zero
    mips: zero out pg_data_t when it's allocated
    memcg: gix memory accounting scalability in shrink_page_list
    mm/sparse: remove index_init_lock
    mm/sparse: more checks on mem_section number
    mm/sparse: optimize sparse_index_alloc
    memcg: add mem_cgroup_from_css() helper
    memcg: further prevent OOM with too many dirty pages
    memcg: prevent OOM with too many dirty pages
    mm: mmu_notifier: fix freed page still mapped in secondary MMU
    mm: memcg: only check anon swapin page charges for swap cache
    mm: memcg: only check swap cache pages for repeated charging
    mm: memcg: split swapin charge function into private and public part
    mm: memcg: remove needless !mm fixup to init_mm when charging
    mm: memcg: remove unneeded shmem charge type
    ...

    Linus Torvalds
     
  • Pull random subsystem patches from Ted Ts'o:
    "This patch series contains a major revamp of how we collect entropy
    from interrupts for /dev/random and /dev/urandom.

    The goal is to addresses weaknesses discussed in the paper "Mining
    your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
    by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
    which will be published in the Proceedings of the 21st Usenix Security
    Symposium, August 2012. (See https://factorable.net for more
    information and an extended version of the paper.)"

    Fix up trivial conflicts due to nearby changes in
    drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

    * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
    random: mix in architectural randomness in extract_buf()
    dmi: Feed DMI table to /dev/random driver
    random: Add comment to random_initialize()
    random: final removal of IRQF_SAMPLE_RANDOM
    um: remove IRQF_SAMPLE_RANDOM which is now a no-op
    sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
    [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
    board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
    isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
    pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
    omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
    goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
    uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
    drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
    xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
    n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
    pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
    i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
    input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
    mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
    ...

    Linus Torvalds
     
  • This patch series is based on top of "Swap-over-NBD without deadlocking
    v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

    When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. In diskless systems this is not an option so if swap if
    required then swapping over the network is considered. The two likely
    scenarios are when blade servers are used as part of a cluster where the
    form factor or maintenance costs do not allow the use of disks and thin
    clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap but this is not always an option. There is no
    guarantee that the network attached storage (NAS) device is running Linux
    or supports NBD. However, it is likely that it supports NFS so there are
    users that want support for swapping over NFS despite any performance
    concern. Some distributions currently carry patches that support swapping
    over NFS but it would be preferable to support it in the mainline kernel.

    Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

    Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
    reserves.

    Patch 3 adds three helpers for filesystems to handle swap cache pages.
    For example, page_file_mapping() returns page->mapping for
    file-backed pages and the address_space of the underlying
    swap file for swap cache pages.

    Patch 4 adds two address_space_operations to allow a filesystem
    to pin all metadata relevant to a swapfile in memory. Upon
    successful activation, the swapfile is marked SWP_FILE and
    the address space operation ->direct_IO is used for writing
    and ->readpage for reading in swap pages.

    Patch 5 notes that patch 3 is bolting
    filesystem-specific-swapfile-support onto the side and that
    the default handlers have different information to what
    is available to the filesystem. This patch refactors the
    code so that there are generic handlers for each of the new
    address_space operations.

    Patch 6 adds an API to allow a vector of kernel addresses to be
    translated to struct pages and pinned for IO.

    Patch 7 adds support for using highmem pages for swap by kmapping
    the pages before calling the direct_IO handler.

    Patch 8 updates NFS to use the helpers from patch 3 where necessary.

    Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

    Patch 10 implements the new swapfile-related address_space operations
    for NFS and teaches the direct IO handler how to manage
    kernel addresses.

    Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
    where appropriate.

    Patch 12 fixes a NULL pointer dereference that occurs when using
    swap-over-NFS.

    With the patches applied, it is possible to mount a swapfile that is on an
    NFS filesystem. Swap performance is not great with a swap stress test
    taking roughly twice as long to complete than if the swap device was
    backed by NBD.

    This patch: netvm: prevent a stream-specific deadlock

    It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
    that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
    buffers from receiving data, which will prevent userspace from running,
    which is needed to reduce the buffered data.

    Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
    this change it applied, it is important that sockets that set
    SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
    If this happens, a warning is generated and the tokens reclaimed to avoid
    accounting errors until the bug is fixed.

    [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Acked-by: Rik van Riel
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to make sure pfmemalloc packets receive all memory needed to
    proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
    This is limited to a subset of protocols that are expected to be used for
    writing to swap. Taps are not allowed to use PF_MEMALLOC as these are
    expected to communicate with userspace processes which could be paged out.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [jslaby@suse.cz: Lock imbalance fix]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
    for their allocations. These sockets will be able to go below watermarks
    and allocate from the emergency reserve. Such sockets are to be used to
    service the VM (iow. to swap over). They must be handled kernel side,
    exposing such a socket to user-space is a bug.

    There is a risk that the reserves be depleted so for now, the
    administrator is responsible for increasing min_free_kbytes as necessary
    to prevent deadlock for their workloads.

    [a.p.zijlstra@chello.nl: Original patches]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
    to solve a race but added a problem at device/fib dismantle time :

    We really want to call dst_free() as soon as possible, even if sockets
    still have dst in their cache.
    dst_release() calls in free_fib_info_rcu() are not welcomed.

    Root of the problem was that now we also cache output routes (in
    nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
    rt_free(), because output route lookups are done in process context.

    Based on feedback and initial patch from David Miller (adding another
    call_rcu_bh() call in fib, but it appears it was not the right fix)

    I left the inet_sk_rx_dst_set() helper and added __rcu attributes
    to nh_rth_output and nh_rth_input to better document what is going on in
    this code.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Jul, 2012

1 commit

  • commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
    added various fatal races with dst refcounts.

    crashes happen on tcp workloads if routes are added/deleted at the same
    time.

    The dst_free() calls from free_fib_info_rcu() are clearly racy.

    We need instead regular dst refcounting (dst_release()) and make
    sure dst_release() is aware of RCU grace periods :

    Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
    before dst destruction for cached dst

    Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
    to make sure we dont increase a zero refcount (On a dst currently
    waiting an rcu grace period before destruction)

    rt_cache_route() must take a reference on the new cached route, and
    release it if was not able to install it.

    With this patch, my machines survive various benchmarks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jul, 2012

1 commit

  • When userspace use RTM_GETROUTE to dump route table, with an already
    expired route entry, we always got an 'expires' value(2147157)
    calculated base on INT_MAX.

    The reason of this problem is in the following satement:
    rt->dst.expires - jiffies < INT_MAX
    gcc promoted the type of both sides of '
    Signed-off-by: David S. Miller

    Li Wei
     

28 Jul, 2012

1 commit

  • When device flags are set using rtnetlink, IFF_PROMISC and IFF_ALLMULTI
    flags are handled specially. Function dev_change_flags sets IFF_PROMISC and
    IFF_ALLMULTI bits in dev->gflags according to the passed value but
    do_setlink passes a result of rtnl_dev_combine_flags which takes those bits
    from dev->flags.

    This can be easily trigerred by doing:

    tcpdump -i eth0 &
    ip l s up eth0

    ip sets IFF_UP flag in ifi_flags and ifi_change, which is combined with
    IFF_PROMISC by rtnl_dev_combine_flags, causing __dev_change_flags to set
    IFF_PROMISC in gflags.

    Reported-by: Max Matveev
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

25 Jul, 2012

2 commits

  • Pull trivial tree from Jiri Kosina:
    "Trivial updates all over the place as usual."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
    Fix typo in include/linux/clk.h .
    pci: hotplug: Fix typo in pci
    iommu: Fix typo in iommu
    video: Fix typo in drivers/video
    Documentation: Add newline at end-of-file to files lacking one
    arm,unicore32: Remove obsolete "select MISC_DEVICES"
    module.c: spelling s/postition/position/g
    cpufreq: Fix typo in cpufreq driver
    trivial: typo in comment in mksysmap
    mach-omap2: Fix typo in debug message and comment
    scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
    Change email address for Steve Glendinning
    Btrfs: fix typo in convert_extent_bit
    via: Remove bogus if check
    netprio_cgroup.c: fix comment typo
    backlight: fix memory leak on obscure error path
    Documentation: asus-laptop.txt references an obsolete Kconfig item
    Documentation: ManagementStyle: fixed typo
    mm/vmscan: cleanup comment error in balance_pgdat
    mm: cleanup on the comments of zone_reclaim_stat
    ...

    Linus Torvalds
     
  • Pull networking changes from David S Miller:

    1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache. Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making. Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point. Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

    2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

    3) TCP SYN/ACK performance tweaks from Eric Dumazet.

    4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

    5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

    6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

    7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

    8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

    9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

    10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer. Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

    11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

    12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

    13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

    14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket. The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too. From Eric
    Dumazet.

    15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
    genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
    r8169: revert "add byte queue limit support".
    ipv4: Change rt->rt_iif encoding.
    net: Make skb->skb_iif always track skb->dev
    ipv4: Prepare for change of rt->rt_iif encoding.
    ipv4: Remove all RTCF_DIRECTSRC handliing.
    ipv4: Really ignore ICMP address requests/replies.
    decnet: Don't set RTCF_DIRECTSRC.
    net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
    ipv4: Remove redundant assignment
    rds: set correct msg_namelen
    openvswitch: potential NULL deref in sample()
    tcp: dont drop MTU reduction indications
    bnx2x: Add new 57840 device IDs
    tcp: avoid oops in tcp_metrics and reset tcpm_stamp
    niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
    niu: Fix to check for dma mapping errors.
    net: Fix references to out-of-scope variables in put_cmsg_compat()
    net: ethernet: davinci_emac: add pm_runtime support
    net: ethernet: davinci_emac: Remove unnecessary #include
    ...

    Linus Torvalds
     

24 Jul, 2012

2 commits

  • Make it follow device decapsulation, from things such as VLAN and
    bonding.

    The stuff that actually cares about pre-demuxed device pointers, is
    handled by the "orig_dev" variable in __netif_receive_skb(). And
    the only consumer of that is the po->origdev feature of AF_PACKET
    sockets.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull the big VFS changes from Al Viro:
    "This one is *big* and changes quite a few things around VFS. What's in there:

    - the first of two really major architecture changes - death to open
    intents.

    The former is finally there; it was very long in making, but with
    Miklos getting through really hard and messy final push in
    fs/namei.c, we finally have it. Unlike his variant, this one
    doesn't introduce struct opendata; what we have instead is
    ->atomic_open() taking preallocated struct file * and passing
    everything via its fields.

    Instead of returning struct file *, it returns -E... on error, 0
    on success and 1 in "deal with it yourself" case (e.g. symlink
    found on server, etc.).

    See comments before fs/namei.c:atomic_open(). That made a lot of
    goodies finally possible and quite a few are in that pile:
    ->lookup(), ->d_revalidate() and ->create() do not get struct
    nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
    flags instead, ->create() gets "do we want it exclusive" flag.

    With the introduction of new helper (kern_path_locked()) we are rid
    of all struct nameidata instances outside of fs/namei.c; it's still
    visible in namei.h, but not for long. Come the next cycle,
    declaration will move either to fs/internal.h or to fs/namei.c
    itself. [me, miklos, hch]

    - The second major change: behaviour of final fput(). Now we have
    __fput() done without any locks held by caller *and* not from deep
    in call stack.

    That obviously lifts a lot of constraints on the locking in there.
    Moreover, it's legal now to call fput() from atomic contexts (which
    has immediately simplified life for aio.c). We also don't need
    anti-recursion logics in __scm_destroy() anymore.

    There is a price, though - the damn thing has become partially
    asynchronous. For fput() from normal process we are guaranteed
    that pending __fput() will be done before the caller returns to
    userland, exits or gets stopped for ptrace.

    For kernel threads and atomic contexts it's done via
    schedule_work(), so theoretically we might need a way to make sure
    it's finished; so far only one such place had been found, but there
    might be more.

    There's flush_delayed_fput() (do all pending __fput()) and there's
    __fput_sync() (fput() analog doing __fput() immediately). I hope
    we won't need them often; see warnings in fs/file_table.c for
    details. [me, based on task_work series from Oleg merged last
    cycle]

    - sync series from Jan

    - large part of "death to sync_supers()" work from Artem; the only
    bits missing here are exofs and ext4 ones. As far as I understand,
    those are going via the exofs and ext4 trees resp.; once they are
    in, we can put ->write_super() to the rest, along with the thread
    calling it.

    - preparatory bits from unionmount series (from dhowells).

    - assorted cleanups and fixes all over the place, as usual.

    This is not the last pile for this cycle; there's at least jlayton's
    ESTALE work and fsfreeze series (the latter - in dire need of fixes,
    so I'm not sure it'll make the cut this cycle). I'll probably throw
    symlink/hardlink restrictions stuff from Kees into the next pile, too.
    Plus there's a lot of misc patches I hadn't thrown into that one -
    it's large enough as it is..."

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
    ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
    btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
    switch dentry_open() to struct path, make it grab references itself
    spufs: shift dget/mntget towards dentry_open()
    zoran: don't bother with struct file * in zoran_map
    ecryptfs: don't reinvent the wheels, please - use struct completion
    don't expose I_NEW inodes via dentry->d_inode
    tidy up namei.c a bit
    unobfuscate follow_up() a bit
    ext3: pass custom EOF to generic_file_llseek_size()
    ext4: use core vfs llseek code for dir seeks
    vfs: allow custom EOF in generic_file_llseek code
    vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
    vfs: Remove unnecessary flushing of block devices
    vfs: Make sys_sync writeout also block device inodes
    vfs: Create function for iterating over block devices
    vfs: Reorder operations during sys_sync
    quota: Move quota syncing to ->sync_fs method
    quota: Split dquot_quota_sync() to writeback and cache flushing part
    vfs: Move noop_backing_dev_info check from sync into writeback
    ...

    Linus Torvalds
     

23 Jul, 2012

7 commits

  • The ipv4 routing cache is non-deterministic, performance wise, and is
    subject to reasonably easy to launch denial of service attacks.

    The routing cache works great for well behaved traffic, and the world
    was a much friendlier place when the tradeoffs that led to the routing
    cache's design were considered.

    What it boils down to is that the performance of the routing cache is
    a product of the traffic patterns seen by a system rather than being a
    product of the contents of the routing tables. The former of which is
    controllable by external entitites.

    Even for "well behaved" legitimate traffic, high volume sites can see
    hit rates in the routing cache of only ~%10.

    The general flow of this patch series is that first the routing cache
    is removed. We build a completely new rtable entry every lookup
    request.

    Next we make some simplifications due to the fact that removing the
    routing cache causes several members of struct rtable to become no
    longer necessary.

    Then we need to make some amends such that we can legally cache
    pre-constructed routes in the FIB nexthops. Firstly, we need to
    invalidate routes which are hit with nexthop exceptions. Secondly we
    have to change the semantics of rt->rt_gateway such that zero means
    that the destination is on-link and non-zero otherwise.

    Now that the preparations are ready, we start caching precomputed
    routes in the FIB nexthops. Output and input routes need different
    kinds of care when determining if we can legally do such caching or
    not. The details are in the commit log messages for those changes.

    The patch series then winds down with some more struct rtable
    simplifications and other tidy ups that remove unnecessary overhead.

    On a SPARC-T3 output route lookups are ~876 cycles. Input route
    lookups are ~1169 cycles with rpfilter disabled, and about ~1468
    cycles with rpfilter enabled.

    These measurements were taken with the kbench_mod test module in the
    net_test_tools GIT tree:

    git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

    That GIT tree also includes a udpflood tester tool and stresses
    route lookups on packet output.

    For example, on the same SPARC-T3 system we can run:

    time ./udpflood -l 10000000 10.2.2.11

    with routing cache:
    real 1m21.955s user 0m6.530s sys 1m15.390s

    without routing cache:
    real 1m31.678s user 0m6.520s sys 1m25.140s

    Performance undoubtedly can easily be improved further.

    For example fib_table_lookup() performs a lot of excessive
    computations with all the masking and shifting, some of it
    conditionalized to deal with edge cases.

    Also, Eric's no-ref optimization for input route lookups can be
    re-instated for the FIB nexthop caching code path. I would be really
    pleased if someone would work on that.

    In fact anyone suitable motivated can just fire up perf on the loading
    of the test net_test_tools benchmark kernel module. I spend much of
    my time going:

    bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
    bash# perf report

    Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
    Hutchings, and others.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • recursion in __scm_destroy() will be cut by delaying final fput()

    Signed-off-by: Al Viro

    Al Viro
     
  • Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Export skb_copy_ubufs so that modules can orphan frags.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • zero copy packets are normally sent to the outside
    network, but bridging, tun etc might loop them
    back to host networking stack. If this happens
    destructors will never be called, so orphan
    the frags immediately on receive.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • Reduce code duplication a bit using the new helper.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • Commit 76ff5cc91935c51fcf1a6a99ffa28b97a6e7a884
    (rtnl: allow to specify number of rx and tx queues
    on device creation) added a reference to the net_device
    structure's 'num_rx_queues' member in

    net/core/rtnetlink.c:rtnl_fill_ifinfo()

    However, the definition for 'num_rx_queues' is surrounded
    by an '#ifdef CONFIG_RPS' while the new reference to it is
    not. This causes a compile error when CONFIG_RPS is not
    defined.

    Fix the compile error by surrounding the new reference to
    'num_rx_queues' by an '#ifdef CONFIG_RPS'.

    CC: Jiri Pirko
    Signed-off-by: Mark A. Greer
    Signed-off-by: David S. Miller

    Mark A. Greer
     

21 Jul, 2012

3 commits