08 Nov, 2012

1 commit

  • Due to a NULL dereference, the following patch is causing oops
    in normal trafic condition:

    commit c0de08d04215031d68fa13af36f347a6cfa252ca
    Author: Eric Leblond
    Date:   Thu Aug 16 22:02:58 2012 +0000

        af_packet: don't emit packet on orig fanout group

    This buggy patch was a feature fix and has reached most stable
    branches.

    When skb->sk is NULL and when packet fanout is used, there is a
    crash in match_fanout_group where skb->sk is accessed.
    This patch fixes the issue by returning false as soon as the
    socket is NULL: this correspond to the wanted behavior because
    the kernel as to resend the skb to all the listening socket in
    this case.

    Signed-off-by: Eric Leblond
    Signed-off-by: David S. Miller

    Eric Leblond
     

04 Nov, 2012

1 commit

  • Change the dflt fdb dump handler to use RTM_NEWNEIGH to
    be compatible with bridge dump routines.

    The dump reply from the network driver handlers should
    match the reply from bridge handler. The fact they were
    not in the ixgbe case was effectively a bug. This patch
    resolves it.

    Applications that were not checking the nlmsg type will
    continue to work. And now applications that do check
    the type will work as expected.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

23 Oct, 2012

1 commit

  • Mike Kazantsev found 3.5 kernels and beyond were leaking memory,
    and tracked the faulty commit to a1c7fff7e18f59e ("net:
    netdev_alloc_skb() use build_skb()")

    While this commit seems fine, it uncovered a bug introduced
    in commit bad43ca8325 ("net: introduce skb_try_coalesce()), in function
    kfree_skb_partial()"):

    If head is stolen, we free the sk_buff,
    without removing references on secpath (skb->sp).

    So IPsec + IP defrag/reassembly (using skb coalescing), or
    TCP coalescing could leak secpath objects.

    Fix this bug by calling skb_release_head_state(skb) to properly
    release all possible references to linked objects.

    Reported-by: Mike Kazantsev
    Signed-off-by: Eric Dumazet
    Bisected-by: Mike Kazantsev
    Tested-by: Mike Kazantsev
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2012

2 commits


11 Oct, 2012

5 commits


09 Oct, 2012

2 commits

  • 6a32e4f9dd9219261f8856f817e6655114cfec2f made the vlan code skip marking
    vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
    there was an rx_handler, as the rx_handler could cause the frame to be received
    on a different (virtual) vlan-capable interface where that vlan might be
    configured.

    As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
    frames for unknown vlans to be delivered to the protocol stack as if they had
    been received untagged.

    For example, if an ipv6 router advertisement that's tagged for a locally not
    configured vlan is received on an interface with macvlan interfaces attached,
    macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
    macvlan interfaces, which caused it to be passed to the protocol stack, leading
    to ipv6 addresses for the announced prefix being configured even though those
    are completely unusable on the underlying interface.

    The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
    rx_handler, if there is one, sees the frame unchanged, but afterwards,
    before the frame is delivered to the protocol stack, it gets marked whether
    there is an rx_handler or not.

    Signed-off-by: Florian Zumbiehl
    Signed-off-by: David S. Miller

    Florian Zumbiehl
     
  • Current GRO can hold packets in gro_list for almost unlimited
    time, in case napi->poll() handler consumes its budget over and over.

    In this case, napi_complete()/napi_gro_flush() are not called.

    Another problem is that gro_list is flushed in non friendly way :
    We scan the list and complete packets in the reverse order.
    (youngest packets first, oldest packets last)
    This defeats priorities that sender could have cooked.

    Since GRO currently only store TCP packets, we dont really notice the
    bug because of retransmits, but this behavior can add unexpected
    latencies, particularly on mice flows clamped by elephant flows.

    This patch makes sure no packet can stay more than 1 ms in queue, and
    only in stress situations.

    It also complete packets in the right order to minimize latencies.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Cc: Jesse Gross
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2012

2 commits

  • Before accessing skb first fragment, better make sure there
    is one.

    This is probably not needed for old kernels, since an ethernet frame
    cannot contain only an ethernet header, but the recent GRO addition
    to tunnels makes this patch needed.

    Also skb_gro_reset_offset() can be static, it actually allows
    compiler to inline it.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The retry loop in neigh_resolve_output() and neigh_connected_output()
    call dev_hard_header() with out reseting the skb to network_header.
    This causes the retry to fail with skb_under_panic. The fix is to
    reset the network_header within the retry loop.

    Signed-off-by: Ramesh Nagappa
    Reviewed-by: Shawn Lu
    Reviewed-by: Robert Coulson
    Reviewed-by: Billie Alsup
    Signed-off-by: David S. Miller

    ramesh.nagappa@gmail.com
     

07 Oct, 2012

1 commit

  • Over time, skb recycling infrastructure got litle interest and
    many bugs. Generic rx path skb allocation is now using page
    fragments for efficient GRO / TCP coalescing, and recyling
    a tx skb for rx path is not worth the pain.

    Last identified bug is that fat skbs can be recycled
    and it can endup using high order pages after few iterations.

    With help from Maxime Bizon, who pointed out that commit
    87151b8689d (net: allow pskb_expand_head() to get maximum tailroom)
    introduced this regression for recycled skbs.

    Instead of fixing this bug, lets remove skb recycling.

    Drivers wanting really hot skbs should use build_skb() anyway,
    to allocate/populate sk_buff right before netif_receive_skb()

    Signed-off-by: Eric Dumazet
    Cc: Maxime Bizon
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2012

6 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - xattr support added. The implementation is shared with tmpfs. The
    usage is restricted and intended to be used to manage per-cgroup
    metadata by system software. tmpfs changes are routed through this
    branch with Hugh's permission.

    - cgroup subsystem ID handling simplified.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
    cgroup: Assign subsystem IDs during compile time
    cgroup: Do not depend on a given order when populating the subsys array
    cgroup: Wrap subsystem selection macro
    cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
    cgroup: net_prio: Do not define task_netpioidx() when not selected
    cgroup: net_cls: Do not define task_cls_classid() when not selected
    cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
    cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
    xattr: mark variable as uninitialized to make both gcc and smatch happy
    fs: add missing documentation to simple_xattr functions
    cgroup: add documentation on extended attributes usage
    cgroup: rename subsys_bits to subsys_mask
    cgroup: add xattr support
    cgroup: revise how we re-populate root directory
    xattr: extract simple_xattr code from tmpfs

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

02 Oct, 2012

4 commits

  • Later changes need to be able to refer to neighbour attributes
    when doing fdb_add.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     
  • This adds a new include file (include/net/gro_cells.h), to bring GRO
    (Generic Receive Offload) capability to tunnels, in a modular way.

    Because tunnels receive path is lockless, and GRO adds a serialization
    using a napi_struct, I chose to add an array of up to
    DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
    slowed down because of GRO layer.

    skb_get_rx_queue() is used as selector.

    In the future, we might add optional fanout capabilities, using rxhash
    for example.

    With help from Ben Hutchings who reminded me
    netif_get_num_default_rss_queues() function.

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Commit ec47ea824774(skb: Add inline helper for getting the skb end offset from
    head) introduces this helper function, skb_end_offset(),
    we should make use of it.

    Signed-off-by: Weiping Pan
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Weiping Pan
     
  • Pull driver core merge from Greg Kroah-Hartman:
    "Here is the big driver core update for 3.7-rc1.

    A number of firmware_class.c updates (as you saw a month or so ago),
    and some hyper-v updates and some printk fixes as well. All patches
    that are outside of the drivers/base area have been acked by the
    respective maintainers, and have all been in the linux-next tree for a
    while.

    Signed-off-by: Greg Kroah-Hartman "

    * tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
    memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
    device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
    memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
    Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
    Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
    Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
    device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
    dev: Add dev_vprintk_emit and dev_printk_emit
    netdev_printk/netif_printk: Remove a superfluous logging colon
    netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
    dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
    driver-core: Shut up dev_dbg_reatelimited() without DEBUG
    tools/hv: Parse /etc/os-release
    tools/hv: Check for read/write errors
    tools/hv: Fix exit() error code
    tools/hv: Fix file handle leak
    Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
    Tools: hv: Rename the function kvp_get_ip_address()
    Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
    Tools: hv: Add an example script to configure an interface
    ...

    Linus Torvalds
     

29 Sep, 2012

1 commit

  • Conflicts:
    drivers/net/team/team.c
    drivers/net/usb/qmi_wwan.c
    net/batman-adv/bat_iv_ogm.c
    net/ipv4/fib_frontend.c
    net/ipv4/route.c
    net/l2tp/l2tp_netlink.c

    The team, fib_frontend, route, and l2tp_netlink conflicts were simply
    overlapping changes.

    qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

    With help from Antonio Quartulli.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Sep, 2012

2 commits

  • We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()

    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

    - Better filling of space (the ending hole overhead is less an issue)

    - Less calls to page allocator or accesses to page->_count

    - Could allow struct skb_shared_info futures changes without major
    performance impact.

    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.

    It also uses a standard "struct page_frag" instead of a custom one.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Benjamin LaHaise
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • It seems sk_init() has no value today and even does strange things :

    # grep . /proc/sys/net/core/?mem_*
    /proc/sys/net/core/rmem_default:212992
    /proc/sys/net/core/rmem_max:131071
    /proc/sys/net/core/wmem_default:212992
    /proc/sys/net/core/wmem_max:131071

    We can remove it completely.

    Signed-off-by: Eric Dumazet
    Reviewed-by: Shan Wei
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2012

2 commits

  • simplifies a bunch of callers...

    Signed-off-by: Al Viro

    Al Viro
     
  • iterates through the opened files in given descriptor table,
    calling a supplied function; we stop once non-zero is returned.
    Callback gets struct file *, descriptor number and const void *
    argument passed to iterator. It is called with files->file_lock
    held, so it is not allowed to block.

    tty_io, netprio_cgroup and selinux flush_unauthorized_files()
    converted to its use.

    Signed-off-by: Al Viro

    Al Viro
     

25 Sep, 2012

3 commits

  • Its possible to use RAW sockets to get a crash in
    tcp_set_keepalive() / sk_reset_timer()

    Fix is to make sure socket is a SOCK_STREAM one.

    Reported-by: Dave Jones
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
    operation that is invoked through a negative offset in K within BPF
    load operations. Since BPF_MOD has recently been added, BPF_XOR should
    also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
    might not be an option since this is exposed to user space.

    Signed-off-by: Daniel Borkmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2012

1 commit

  • A change in a series of VLAN-related changes appears to have
    inadvertently disabled the use of the scatter gather feature of
    network cards for transmission of non-IP ethernet protocols like ATA
    over Ethernet (AoE). Below is a reference to the commit that
    introduces a "harmonize_features" function that turns off scatter
    gather when the NIC does not support hardware checksumming for the
    ethernet protocol of an sk buff.

    commit f01a5236bd4b140198fbcc550f085e8361fd73fa
    Author: Jesse Gross
    Date: Sun Jan 9 06:23:31 2011 +0000

    net offloading: Generalize netif_get_vlan_features().

    The can_checksum_protocol function is not equipped to consider a
    protocol that does not require checksumming. Calling it for a
    protocol that requires no checksum is inappropriate.

    The patch below has harmonize_features call can_checksum_protocol when
    the protocol needs a checksum, so that the network layer is not forced
    to perform unnecessary skb linearization on the transmission of AoE
    packets. Unnecessary linearization results in decreased performance
    and increased memory pressure, as reported here:

    http://www.spinics.net/lists/linux-mm/msg15184.html

    The problem has probably not been widely experienced yet, because
    only recently has the kernel.org-distributed aoe driver acquired the
    ability to use payloads of over a page in size, with the patchset
    recently included in the mm tree:

    https://lkml.org/lkml/2012/8/28/140

    The coraid.com-distributed aoe driver already could use payloads of
    greater than a page in size, but its users generally do not use the
    newest kernels.

    Signed-off-by: Ed Cashin
    Signed-off-by: David S. Miller

    Ed Cashin
     

20 Sep, 2012

6 commits

  • It should be the skb which is not cloned

    Signed-off-by: Li RongQing
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Li RongQing
     
  • In netpoll tx path, we miss the chance of calling ->ndo_select_queue(),
    thus could cause problems when bonding is involved.

    This patch makes dev_pick_tx() extern (and rename it to netdev_pick_tx())
    to let netpoll call it in netpoll_send_skb_on_dev().

    Reported-by: Sylvain Munaut
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Tested-by: Sylvain Munaut
    Signed-off-by: David S. Miller

    Amerigo Wang
     
  • The internal functions for add/deleting addresses don't change
    their argument.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     
  • Instead of forcing device drivers to provide empty ethtool_ops or tweak
    net/core/ethtool.c again, we could provide a generic ethtool_ops.

    This occurred to me when I wanted to add GSO support to GRE tunnels.
    ethtool -k support should be generic for all drivers.

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Maciej Żenczykowski
    Reviewed-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When moving a nic from net namespace A to net namespace B,
    in dev_change_net_namesapce,we call __dev_get_by_name to
    decide if the netns B has the device has the same name.

    if the netns B already has the same named device,we call
    dev_get_valid_name to try to get a valid name for this nic in
    the netns B,but net_device->nd_net still point to netns A now.

    this patch fix it.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     
  • dev_queue_xmit_nit() should be called right before ndo_start_xmit()
    calls or we might give wrong packet contents to taps users :

    Packet checksum can be changed, or packet can be linearized or
    segmented, and segments partially sent for the later case.

    Also a memory allocation can fail and packet never really hit the
    driver entry point.

    Reported-by: Jamie Gloudon
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet