22 Aug, 2012

3 commits

  • Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
    memory in __ip_select_ident().

    It turns out that __ip_make_skb() called ip_select_ident() before
    properly initializing iph->daddr.

    This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of
    cached route inetpeer.)

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131

    Reported-by: Christian Casteyde
    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since 0e734419923bd ("ipv4: Use inet_csk_route_child_sock() in DCCP and
    TCP."), inet_csk_route_child_sock() is called instead of
    inet_csk_route_req().

    However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
    ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
    Thus, inside inet_csk_route_child_sock() opt is always NULL and the
    SRR-options are not respected anymore.
    Packets sent by the server won't have the correct destination-IP.

    This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
    inside inet_csk_route_child_sock().

    Reported-by: Luca Boccassi
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     
  • Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
    added bug leading to following trace :

    [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
    [ 2866.131726]
    [ 2866.132188] =========================
    [ 2866.132281] [ BUG: held lock freed! ]
    [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
    [ 2866.132281] -------------------------
    [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
    [ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
    [ 2866.132281] 4 locks held by kworker/0:1/652:
    [ 2866.132281] #0: (rpciod){.+.+.+}, at: [] process_one_work+0x1de/0x47f
    [ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [] process_one_work+0x1de/0x47f
    [ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
    [ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [] run_timer_softirq+0x1ad/0x35f
    [ 2866.132281]
    [ 2866.132281] stack backtrace:
    [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
    [ 2866.132281] Call Trace:
    [ 2866.132281] [] debug_check_no_locks_freed+0x112/0x159
    [ 2866.132281] [] ? __sk_free+0xfd/0x114
    [ 2866.132281] [] kmem_cache_free+0x6b/0x13a
    [ 2866.132281] [] __sk_free+0xfd/0x114
    [ 2866.132281] [] sk_free+0x1c/0x1e
    [ 2866.132281] [] tcp_write_timer+0x51/0x56
    [ 2866.132281] [] run_timer_softirq+0x218/0x35f
    [ 2866.132281] [] ? run_timer_softirq+0x1ad/0x35f
    [ 2866.132281] [] ? rb_commit+0x58/0x85
    [ 2866.132281] [] ? tcp_write_timer_handler+0x148/0x148
    [ 2866.132281] [] __do_softirq+0xcb/0x1f9
    [ 2866.132281] [] ? _raw_spin_unlock+0x29/0x2e
    [ 2866.132281] [] call_softirq+0x1c/0x30
    [ 2866.132281] [] do_softirq+0x4a/0xa6
    [ 2866.132281] [] irq_exit+0x51/0xad
    [ 2866.132281] [] do_IRQ+0x9d/0xb4
    [ 2866.132281] [] common_interrupt+0x6f/0x6f
    [ 2866.132281] [] ? sched_clock_cpu+0x58/0xd1
    [ 2866.132281] [] ? _raw_spin_unlock_irqrestore+0x4c/0x56
    [ 2866.132281] [] mod_timer+0x178/0x1a9
    [ 2866.132281] [] sk_reset_timer+0x19/0x26
    [ 2866.132281] [] tcp_rearm_rto+0x99/0xa4
    [ 2866.132281] [] tcp_event_new_data_sent+0x6e/0x70
    [ 2866.132281] [] tcp_write_xmit+0x7de/0x8e4
    [ 2866.132281] [] ? __alloc_skb+0xa0/0x1a1
    [ 2866.132281] [] __tcp_push_pending_frames+0x2e/0x8a
    [ 2866.132281] [] tcp_sendmsg+0xb32/0xcc6
    [ 2866.132281] [] inet_sendmsg+0xaa/0xd5
    [ 2866.132281] [] ? inet_autobind+0x5f/0x5f
    [ 2866.132281] [] ? trace_clock_local+0x9/0xb
    [ 2866.132281] [] sock_sendmsg+0xa3/0xc4
    [ 2866.132281] [] ? rb_reserve_next_event+0x26f/0x2d5
    [ 2866.132281] [] ? native_sched_clock+0x29/0x6f
    [ 2866.132281] [] ? sched_clock+0x9/0xd
    [ 2866.132281] [] ? trace_clock_local+0x9/0xb
    [ 2866.132281] [] kernel_sendmsg+0x37/0x43
    [ 2866.132281] [] xs_send_kvec+0x77/0x80
    [ 2866.132281] [] xs_sendpages+0x6f/0x1a0
    [ 2866.132281] [] ? try_to_del_timer_sync+0x55/0x61
    [ 2866.132281] [] xs_tcp_send_request+0x55/0xf1
    [ 2866.132281] [] xprt_transmit+0x89/0x1db
    [ 2866.132281] [] ? call_connect+0x3c/0x3c
    [ 2866.132281] [] call_transmit+0x1c5/0x20e
    [ 2866.132281] [] __rpc_execute+0x6f/0x225
    [ 2866.132281] [] ? call_connect+0x3c/0x3c
    [ 2866.132281] [] rpc_async_schedule+0x28/0x34
    [ 2866.132281] [] process_one_work+0x24d/0x47f
    [ 2866.132281] [] ? process_one_work+0x1de/0x47f
    [ 2866.132281] [] ? __rpc_execute+0x225/0x225
    [ 2866.132281] [] worker_thread+0x236/0x317
    [ 2866.132281] [] ? process_scheduled_works+0x2f/0x2f
    [ 2866.132281] [] kthread+0x9a/0xa2
    [ 2866.132281] [] kernel_thread_helper+0x4/0x10
    [ 2866.132281] [] ? retint_restore_args+0x13/0x13
    [ 2866.132281] [] ? __init_kthread_worker+0x5a/0x5a
    [ 2866.132281] [] ? gs_change+0x13/0x13
    [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
    [ 2866.309689] =============================================================================
    [ 2866.310254] BUG TCP (Not tainted): Object already free
    [ 2866.310254] -----------------------------------------------------------------------------
    [ 2866.310254]

    The bug comes from the fact that timer set in sk_reset_timer() can run
    before we actually do the sock_hold(). socket refcount reaches zero and
    we free the socket too soon.

    timer handler is not allowed to reduce socket refcnt if socket is owned
    by the user, or we need to change sk_reset_timer() implementation.

    We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
    or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

    Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
    was used instead of TCP_DELACK_TIMER_DEFERRED.

    For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
    even if not fired from a timer.

    Reported-by: Fengguang Wu
    Tested-by: Fengguang Wu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Aug, 2012

2 commits

  • This commit removes the sk_rx_dst_set calls from
    tcp_create_openreq_child(), because at that point the icsk_af_ops
    field of ipv6_mapped TCP sockets has not been set to its proper final
    value.

    Instead, to make sure we get the right sk_rx_dst_set variant
    appropriate for the address family of the new connection, we have
    tcp_v{4,6}_syn_recv_sock() directly call the appropriate function
    shortly after the call to tcp_create_openreq_child() returns.

    This also moves inet6_sk_rx_dst_set() to avoid a forward declaration
    with the new approach.

    Signed-off-by: Neal Cardwell
    Reported-by: Artem Savkov
    Cc: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Pable Neira Ayuso says:

    ====================
    The following five patches contain fixes for 3.6-rc, they are:

    * Two fixes for message parsing in the SIP conntrack helper, from
    Patrick McHardy.

    * One fix for the SIP helper introduced in the user-space cthelper
    infrastructure, from Patrick McHardy.

    * fix missing appropriate locking while modifying one conntrack entry
    from the nfqueue integration code, from myself.

    * fix possible access to uninitiliazed timer in the nf_conntrack
    expectation infrastructure, from myself.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

15 Aug, 2012

1 commit

  • Commit caacf05e5ad1abf causes big drop of UDP loop back performance.
    The cause of the regression is that we do not cache the local output
    routes. Each time we send a datagram from unconnected UDP socket,
    the kernel allocates a dst_entry and adds it to the rt_uncached_list.
    It creates lock contention on the rt_uncached_lock.

    Reported-by: Alex Shi
    Signed-off-by: Yan, Zheng
    Signed-off-by: David S. Miller

    Yan, Zheng
     

11 Aug, 2012

1 commit

  • ip_send_skb() can send orphaned skb, so we must pass the net pointer to
    avoid possible NULL dereference in error path.

    Bug added by commit 3a7c384ffd57 (ipv4: tcp: unicast_sock should not
    land outside of TCP stack)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2012

4 commits

  • Via-headers are parsed beginning at the first character after the Via-address.
    When the address is translated first and its length decreases, the offset to
    start parsing at is incorrect and header parameters might be missed.

    Update the offset after translating the Via-address to fix this.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • Within SIP messages IPv6 addresses are enclosed in square brackets in most
    cases, with the exception of the "received=" header parameter. Currently
    the helper fails to parse enclosed addresses.

    This patch:

    - changes the SIP address parsing function to enforce square brackets
    when required, and accept them when not required but present, as
    recommended by RFC 5118.

    - adds a new SDP address parsing function that never accepts square
    brackets since SDP doesn't use them.

    With these changes, the SIP helper correctly parses all test messages
    from RFC 5118 (Session Initiation Protocol (SIP) Torture Test Messages
    for Internet Protocol Version 6 (IPv6)).

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • commit 5d299f3d3c8a2fb (net: ipv6: fix TCP early demux) added a
    regression for ipv6_mapped case.

    [ 67.422369] SELinux: initialized (dev autofs, type autofs), uses
    genfs_contexts
    [ 67.449678] SELinux: initialized (dev autofs, type autofs), uses
    genfs_contexts
    [ 92.631060] BUG: unable to handle kernel NULL pointer dereference at
    (null)
    [ 92.631435] IP: [< (null)>] (null)
    [ 92.631645] PGD 0
    [ 92.631846] Oops: 0010 [#1] SMP
    [ 92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
    dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
    parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
    snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
    snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
    snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
    uhci_hcd
    [ 92.634294] CPU 0
    [ 92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
    [ 92.634294] RIP: 0010:[] [< (null)>]
    (null)
    [ 92.634294] RSP: 0018:ffff880245fc7cb0 EFLAGS: 00010282
    [ 92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
    0000000000000000
    [ 92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
    ffff88024827ad00
    [ 92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
    0000000000000000
    [ 92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
    ffff880254735380
    [ 92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
    7fffffffffff0218
    [ 92.634294] FS: 00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
    knlGS:0000000000000000
    [ 92.634294] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
    00000000000007f0
    [ 92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    0000000000000000
    [ 92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
    0000000000000400
    [ 92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
    task ffff880254b8cac0)
    [ 92.634294] Stack:
    [ 92.634294] ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
    ffff880245fc7d68
    [ 92.634294] ffffffff81385083 00000000001d2680 ffff8802547353a8
    ffff880245fc7d18
    [ 92.634294] ffffffff8105903a ffff88024827ad60 0000000000000002
    00000000000000ff
    [ 92.634294] Call Trace:
    [ 92.634294] [] ? tcp_finish_connect+0x2c/0xfa
    [ 92.634294] [] tcp_rcv_state_process+0x2b6/0x9c6
    [ 92.634294] [] ? sched_clock_cpu+0xc3/0xd1
    [ 92.634294] [] ? local_clock+0x2b/0x3c
    [ 92.634294] [] tcp_v4_do_rcv+0x63a/0x670
    [ 92.634294] [] release_sock+0x128/0x1bd
    [ 92.634294] [] __inet_stream_connect+0x1b1/0x352
    [ 92.634294] [] ? lock_sock_nested+0x74/0x7f
    [ 92.634294] [] ? wake_up_bit+0x25/0x25
    [ 92.634294] [] ? lock_sock_nested+0x74/0x7f
    [ 92.634294] [] ? inet_stream_connect+0x22/0x4b
    [ 92.634294] [] inet_stream_connect+0x33/0x4b
    [ 92.634294] [] sys_connect+0x78/0x9e
    [ 92.634294] [] ? sysret_check+0x1b/0x56
    [ 92.634294] [] ? __audit_syscall_entry+0x195/0x1c8
    [ 92.634294] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [ 92.634294] [] system_call_fastpath+0x16/0x1b
    [ 92.634294] Code: Bad RIP value.
    [ 92.634294] RIP [< (null)>] (null)
    [ 92.634294] RSP
    [ 92.634294] CR2: 0000000000000000
    [ 92.648982] ---[ end trace 24e2bed94314c8d9 ]---
    [ 92.649146] Kernel panic - not syncing: Fatal exception in interrupt

    Fix this using inet_sk_rx_dst_set(), and export this function in case
    IPv6 is modular.

    Reported-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit be9f4a44e7d41cee (ipv4: tcp: remove per net tcp_sock) added a
    selinux regression, reported and bisected by John Stultz

    selinux_ip_postroute_compat() expect to find a valid sk->sk_security
    pointer, but this field is NULL for unicast_sock

    It turns out that unicast_sock are really temporary stuff to be able
    to reuse part of IP stack (ip_append_data()/ip_push_pending_frames())

    Fact is that frames sent by ip_send_unicast_reply() should be orphaned
    to not fool LSM.

    Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
    fake socket at all. I'll probably implement tcp_v4_send_response() to
    remove these unicast_sock in linux-3.7

    Reported-by: John Stultz
    Bisected-by: John Stultz
    Signed-off-by: Eric Dumazet
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: "Serge E. Hallyn"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Aug, 2012

2 commits

  • We currently leak all tcp metrics at struct net dismantle time.

    tcp_net_metrics_exit() frees the hash table, we must first
    iterate it to free all metrics.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • After IP route cache removal, I believe rcu_bh() has very little use and
    we should remove this RCU variant, since it adds some cycles in fast
    path.

    Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
    some users only assert rcu_read_lock().

    Signed-off-by: Eric Dumazet
    Cc: "Paul E. McKenney"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Aug, 2012

3 commits


02 Aug, 2012

2 commits


01 Aug, 2012

8 commits

  • Merge Andrew's second set of patches:
    - MM
    - a few random fixes
    - a couple of RTC leftovers

    * emailed patches from Andrew Morton : (120 commits)
    rtc/rtc-88pm80x: remove unneed devm_kfree
    rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
    mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
    tmpfs: distribute interleave better across nodes
    mm: remove redundant initialization
    mm: warn if pg_data_t isn't initialized with zero
    mips: zero out pg_data_t when it's allocated
    memcg: gix memory accounting scalability in shrink_page_list
    mm/sparse: remove index_init_lock
    mm/sparse: more checks on mem_section number
    mm/sparse: optimize sparse_index_alloc
    memcg: add mem_cgroup_from_css() helper
    memcg: further prevent OOM with too many dirty pages
    memcg: prevent OOM with too many dirty pages
    mm: mmu_notifier: fix freed page still mapped in secondary MMU
    mm: memcg: only check anon swapin page charges for swap cache
    mm: memcg: only check swap cache pages for repeated charging
    mm: memcg: split swapin charge function into private and public part
    mm: memcg: remove needless !mm fixup to init_mm when charging
    mm: memcg: remove unneeded shmem charge type
    ...

    Linus Torvalds
     
  • This patch series is based on top of "Swap-over-NBD without deadlocking
    v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

    When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. In diskless systems this is not an option so if swap if
    required then swapping over the network is considered. The two likely
    scenarios are when blade servers are used as part of a cluster where the
    form factor or maintenance costs do not allow the use of disks and thin
    clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap but this is not always an option. There is no
    guarantee that the network attached storage (NAS) device is running Linux
    or supports NBD. However, it is likely that it supports NFS so there are
    users that want support for swapping over NFS despite any performance
    concern. Some distributions currently carry patches that support swapping
    over NFS but it would be preferable to support it in the mainline kernel.

    Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

    Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
    reserves.

    Patch 3 adds three helpers for filesystems to handle swap cache pages.
    For example, page_file_mapping() returns page->mapping for
    file-backed pages and the address_space of the underlying
    swap file for swap cache pages.

    Patch 4 adds two address_space_operations to allow a filesystem
    to pin all metadata relevant to a swapfile in memory. Upon
    successful activation, the swapfile is marked SWP_FILE and
    the address space operation ->direct_IO is used for writing
    and ->readpage for reading in swap pages.

    Patch 5 notes that patch 3 is bolting
    filesystem-specific-swapfile-support onto the side and that
    the default handlers have different information to what
    is available to the filesystem. This patch refactors the
    code so that there are generic handlers for each of the new
    address_space operations.

    Patch 6 adds an API to allow a vector of kernel addresses to be
    translated to struct pages and pinned for IO.

    Patch 7 adds support for using highmem pages for swap by kmapping
    the pages before calling the direct_IO handler.

    Patch 8 updates NFS to use the helpers from patch 3 where necessary.

    Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

    Patch 10 implements the new swapfile-related address_space operations
    for NFS and teaches the direct IO handler how to manage
    kernel addresses.

    Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
    where appropriate.

    Patch 12 fixes a NULL pointer dereference that occurs when using
    swap-over-NFS.

    With the patches applied, it is possible to mount a swapfile that is on an
    NFS filesystem. Swap performance is not great with a swap stress test
    taking roughly twice as long to complete than if the swap device was
    backed by NBD.

    This patch: netvm: prevent a stream-specific deadlock

    It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
    that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
    buffers from receiving data, which will prevent userspace from running,
    which is needed to reduce the buffered data.

    Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
    this change it applied, it is important that sockets that set
    SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
    If this happens, a warning is generated and the tokens reclaimed to avoid
    accounting errors until the bug is fixed.

    [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Acked-by: Rik van Riel
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce sk_gfp_atomic(), this function allows to inject sock specific
    flags to each sock related allocation. It is only used on allocation
    paths that may be required for writing pages back to network storage.

    [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When a device is unregistered, we have to purge all of the
    references to it that may exist in the entire system.

    If a route is uncached, we currently have no way of accomplishing
    this.

    So create a global list that is scanned when a network device goes
    down. This mirrors the logic in net/core/dst.c's dst_ifdown().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Input path is mostly run under RCU and doesnt touch dst refcnt

    But output path on forwarding or UDP workloads hits
    badly dst refcount, and we have lot of false sharing, for example
    in ipv4_mtu() when reading rt->rt_pmtu

    Using a percpu cache for nh_rth_output gives a nice performance
    increase at a small cost.

    24 udpflood test on my 24 cpu machine (dummy0 output device)
    (each process sends 1.000.000 udp frames, 24 processes are started)

    before : 5.24 s
    after : 2.06 s
    For reference, time on linux-3.5 : 6.60 s

    Signed-off-by: Eric Dumazet
    Tested-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
    to solve a race but added a problem at device/fib dismantle time :

    We really want to call dst_free() as soon as possible, even if sockets
    still have dst in their cache.
    dst_release() calls in free_fib_info_rcu() are not welcomed.

    Root of the problem was that now we also cache output routes (in
    nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
    rt_free(), because output route lookups are done in process context.

    Based on feedback and initial patch from David Miller (adding another
    call_rcu_bh() call in fib, but it appears it was not the right fix)

    I left the inet_sk_rx_dst_set() helper and added __rcu attributes
    to nh_rth_output and nh_rth_input to better document what is going on in
    this code.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Jul, 2012

3 commits

  • After IP route cache removal, rt_cache_rebuild_count is no longer
    used.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
    added various fatal races with dst refcounts.

    crashes happen on tcp workloads if routes are added/deleted at the same
    time.

    The dst_free() calls from free_fib_info_rcu() are clearly racy.

    We need instead regular dst refcounting (dst_release()) and make
    sure dst_release() is aware of RCU grace periods :

    Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
    before dst destruction for cached dst

    Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
    to make sure we dont increase a zero refcount (On a dst currently
    waiting an rcu grace period before destruction)

    rt_cache_route() must take a reference on the new cached route, and
    release it if was not able to install it.

    With this patch, my machines survive various benchmarks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • early_demux() handlers should be called in RCU context, and as we
    use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
    before dst use (skb_dst(skb)) or release (skb_drop(dst))

    Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
    ->early_demux() are confusing and not needed :

    Protocol handlers are already in an RCU read lock section.
    (__netif_receive_skb() does the rcu_read_lock() )

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jul, 2012

2 commits


28 Jul, 2012

3 commits

  • Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT")
    added support for receive offloading to IOAT dma engine if available.

    The code in tcp_rcv_established() tries to perform early DMA copy if
    applicable. It however does so without checking whether the userspace
    task is actually expecting the data in the buffer.

    This is not a problem under normal circumstances, but there is a corner
    case where this doesn't work -- and that's when MSG_TRUNC flag to
    recvmsg() is used.

    If the IOAT dma engine is not used, the code properly checks whether
    there is a valid ucopy.task and the socket is owned by userspace, but
    misses the check in the dmaengine case.

    This problem can be observed in real trivially -- for example 'tbench' is a
    good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
    IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
    have been already early-copied in tcp_rcv_established() using dma engine.

    This patch introduces the same check we are performing in the simple
    iovec copy case to the IOAT case as well. It fixes the indefinite
    recvmsg(MSG_TRUNC) hangs.

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     
  • commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.)
    invalidated TCP early demux, because rx_dst_ifindex is not properly
    initialized and checked.

    Also remove the use of inet_iif(skb) in favor or skb->skb_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But
    patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative
    values. If a user assign -1 to it, the socket will set successfully and wait
    for 4294967295 miliseconds. This patch add a negative value check to avoid
    this issue.

    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

27 Jul, 2012

2 commits

  • This is the IPv6 missing bits for infrastructure added in commit
    41063e9dd1195 (ipv4: Early TCP socket demux.)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

26 Jul, 2012

1 commit

  • commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.)
    introduced rt_cache_valid() helper. It unfortunately doesn't check if
    route is expired before caching it.

    I noticed sk_setup_caps() was constantly called on a tcp workload.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Jul, 2012

1 commit

  • 1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
    and fix a potential bug if skb->head was reallocated
    (iph & th pointers were not reloaded)

    TCP stack will pull/check headers anyway.

    2) must reload iph in ip_rcv_finish() after early_demux()
    call since skb->head might have changed.

    3) skb->dev->ifindex can be now replaced by skb->skb_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Jul, 2012

2 commits

  • On input packet processing, rt->rt_iif will be zero if we should
    use skb->dev->ifindex.

    Since we access rt->rt_iif consistently via inet_iif(), that is
    the only spot whose interpretation have to adjust.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Use inet_iif() consistently, and for TCP record the input interface of
    cached RX dst in inet sock.

    rt->rt_iif is going to be encoded differently, so that we can
    legitimately cache input routes in the FIB info more aggressively.

    When the input interface is "use SKB device index" the rt->rt_iif will
    be set to zero.

    This forces us to move the TCP RX dst cache installation into the ipv4
    specific code, and as well it should since doing the route caching for
    ipv6 is pointless at the moment since it is not inspected in the ipv6
    input paths yet.

    Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
    obsolete set to a non-zero value to force invocation of the check
    callback.

    Signed-off-by: David S. Miller

    David S. Miller