13 Apr, 2012

1 commit

  • Pull networking fixes from David Miller:

    1) Fix bluetooth userland regression reported by Keith Packard, from
    Gustavo Padovan.

    2) Revert ath9k PS idle change, from Sujith Manoharan.

    3) Correct default TCP memory limits (again), from Eric Dumazet.

    4) Fix tcp_rcv_rtt_update() accidental use of unscaled RTT, from Neal
    Cardwell.

    5) We made a facility for layers like wireless to say how much tailroom
    they need in the SKB for link layer stuff such as wireless
    encryption etc., but TCP works hard to fill every SKB out to the end
    defeating this specification.

    This leads to every TCP packet getting reallocated by the wireless
    code in order to have the right amount of tailroom available.

    Fix TCP to only fill SKBs out to the real amount of data area it
    asked for during the allocation, this way it won't eat into the
    slack added for the device's tailroom needs.

    Reported by Marc Merlin and fixed by Eric Dumazet.

    6) Leaks, endian bugs, and new device IDs in bluetooth from Santosh
    Nayak, João Paulo Rechi Vita, Cho, Yu-Chen, Andrei Emeltchenko,
    AceLan Kao, and Andrei Emeltchenko.

    7) OOPS on tty_close fix in bluetooth's hci_ldisc from Johan Hovold.

    8) netfilter erroneously scales TCP window twice, fix from Changli Gao.

    9) Memleak fix in wext-core from Julia Lawall.

    10) Consistently handle invalid TCP packets in ipv4 vs. ipv6 conntrack,
    from Jozsef Kadlecsik.

    11) Validate IP header length properly in netfilter conntrack's
    ipv4_get_l4proto().

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits)
    NFC: Fix the LLCP Tx fragmentation loop
    rtlwifi: Add missing DMA buffer unmapping for PCI drivers
    rtlwifi: Preallocate USB read buffers and eliminate kalloc in read routine
    tcp: avoid order-1 allocations on wifi and tx path
    net: allow pskb_expand_head() to get maximum tailroom
    bridge: Do not send queries on multicast group leaves
    MAINTAINERS: Mark NATSEMI driver as orphan'd.
    tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample
    tcp: restore correct limit
    Revert "ath9k: fix going to full-sleep on PS idle"
    rt2x00: Fix rfkill_polling register function.
    bcma: fix build error on MIPS; implicit pcibios_enable_device
    netfilter: nf_conntrack: fix incorrect logic in nf_conntrack_init_net
    netfilter: nf_ct_ipv4: packets with wrong ihl are invalid
    netfilter: nf_ct_ipv4: handle invalid IPv4 and IPv6 packets consistently
    net/wireless/wext-core.c: add missing kfree
    rtlwifi: Fix oops on rate-control failure
    mac80211: Convert WARN_ON to WARN_ON_ONCE
    rtlwifi: rtl8192de: Fix firmware initialization
    nl80211: ensure interface is up in various APIs
    ...

    Linus Torvalds
     

11 Apr, 2012

4 commits

  • Marc Merlin reported many order-1 allocations failures in TX path on its
    wireless setup, that dont make any sense with MTU=1500 network, and non
    SG capable hardware.

    After investigation, it turns out TCP uses sk_stream_alloc_skb() and
    used as a convention skb_tailroom(skb) to know how many bytes of data
    payload could be put in this skb (for non SG capable devices)

    Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
    sizeof(struct skb_shared_info) being above 2048)

    Later, mac80211 layer need to add some bytes at the tail of skb
    (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
    available has to call pskb_expand_head() and request order-1
    allocations.

    This patch changes sk_stream_alloc_skb() so that only
    sk->sk_prot->max_header bytes of headroom are reserved, and use a new
    skb field, avail_size to hold the data payload limit.

    This way, order-0 allocations done by TCP stack can leave more than 2 KB
    of tailroom and no more allocation is performed in mac80211 layer (or
    any layer needing some tailroom)

    avail_size is unioned with mark/dropcount, since mark will be set later
    in IP stack for output packets. Therefore, skb size is unchanged.

    Reported-by: Marc MERLIN
    Tested-by: Marc MERLIN
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull dmaengine fixes from Dan Williams:

    1/ regression fix for Xen as it now trips over a broken assumption
    about the dma address size on 32-bit builds

    2/ new quirk for netdma to ignore dma channels that cannot meet
    netdma alignment requirements

    3/ fixes for two long standing issues in ioatdma (ring size overflow)
    and iop-adma (potential stack corruption)

    * tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
    netdma: adding alignment check for NETDMA ops
    ioatdma: DMA copy alignment needed to address IOAT DMA silicon errata
    ioat: ring size variables need to be 32bit to avoid overflow
    iop-adma: Corrected array overflow in RAID6 Xscale(R) test.
    ioat: fix size of 'completion' for Xen

    Linus Torvalds
     
  • Fix a code path in tcp_rcv_rtt_update() that was comparing scaled and
    unscaled RTT samples.

    The intent in the code was to only use the 'm' measurement if it was a
    new minimum. However, since 'm' had not yet been shifted left 3 bits
    but 'new_sample' had, this comparison would nearly always succeed,
    leading us to erroneously set our receive-side RTT estimate to the 'm'
    sample when that sample could be nearly 8x too high to use.

    The overall effect is to often cause the receive-side RTT estimate to
    be significantly too large (up to 40% too large for brief periods in
    my tests).

    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Commit c43b874d5d714f (tcp: properly initialize tcp memory limits) tried
    to fix a regression added in commits 4acb4190 & 3dc43e3,
    but still get it wrong.

    Result is machines with low amount of memory have too small tcp_rmem[2]
    value and slow tcp receives : Per socket limit being 1/1024 of memory
    instead of 1/128 in old kernels, so rcv window is capped to small
    values.

    Fix this to match comment and previous behavior.

    Signed-off-by: Eric Dumazet
    Cc: Jason Wang
    Cc: Glauber Costa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Apr, 2012

2 commits

  • It was reported that the Linux kernel sometimes logs:

    klogd: [2629147.402413] kernel BUG at net / netfilter /
    nf_conntrack_proto_tcp.c: 447!
    klogd: [1072212.887368] kernel BUG at net / netfilter /
    nf_conntrack_proto_tcp.c: 392

    ipv4_get_l4proto() in nf_conntrack_l3proto_ipv4.c and tcp_error() in
    nf_conntrack_proto_tcp.c should catch malformed packets, so the errors
    at the indicated lines - TCP options parsing - should not happen.
    However, tcp_error() relies on the "dataoff" offset to the TCP header,
    calculated by ipv4_get_l4proto(). But ipv4_get_l4proto() does not check
    bogus ihl values in IPv4 packets, which then can slip through tcp_error()
    and get caught at the TCP options parsing routines.

    The patch fixes ipv4_get_l4proto() by invalidating packets with bogus
    ihl value.

    The patch closes netfilter bugzilla id 771.

    Signed-off-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Jozsef Kadlecsik
     
  • IPv6 conntrack marked invalid packets as INVALID and let the user
    drop those by an explicit rule, while IPv4 conntrack dropped such
    packets itself.

    IPv4 conntrack is changed so that it marks INVALID packets and let
    the user to drop them.

    Signed-off-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Jozsef Kadlecsik
     

06 Apr, 2012

2 commits

  • commit 2f533844242 (tcp: allow splice() to build full TSO packets) added
    a regression for splice() calls using SPLICE_F_MORE.

    We need to call tcp_flush() at the end of the last page processed in
    tcp_sendpages(), or else transmits can be deferred and future sends
    stall.

    Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
    with different semantic.

    For all sendpage() providers, its a transparent change. Only
    sock_sendpage() and tcp_sendpages() can differentiate the two different
    flags provided by pipe_to_sendpage()

    Reported-by: Tom Herbert
    Cc: Nandita Dukkipati
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Cc: H.K. Jerry Chu
    Cc: Maciej Żenczykowski
    Cc: Mahesh Bandewar
    Cc: Ilpo Järvinen
    Signed-off-by: Eric Dumazet com>
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is the fallout from adding memcpy alignment workaround for certain
    IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
    ops.

    Acked-by: David S. Miller
    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

04 Apr, 2012

1 commit

  • vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
    time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

    The call to tcp_push() at the end of do_tcp_sendpages() forces an
    immediate xmit when pipe is not already filled, and tso_fragment() try
    to split these skb to MSS multiples.

    4096 bytes are usually split in a skb with 2 MSS, and a remaining
    sub-mss skb (assuming MTU=1500)

    This makes slow start suboptimal because many small frames are sent to
    qdisc/driver layers instead of big ones (constrained by cwnd and packets
    in flight of course)

    In fact, applications using sendmsg() (adding an additional memory copy)
    instead of vmsplice()/splice()/sendfile() are a bit faster because of
    this anomaly, especially if serving small files in environments with
    large initial [c]wnd.

    Call tcp_push() only if MSG_MORE is not set in the flags parameter.

    This bit is automatically provided by splice() internals but for the
    last page, or on all pages if user specified SPLICE_F_MORE splice()
    flag.

    In some workloads, this can reduce number of sent logical packets by an
    order of magnitude, making zero-copy TCP actually faster than
    one-copy :)

    Reported-by: Tom Herbert
    Cc: Nandita Dukkipati
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Cc: H.K. Jerry Chu
    Cc: Maciej Żenczykowski
    Cc: Mahesh Bandewar
    Cc: Ilpo Järvinen
    Signed-off-by: Eric Dumazet com>
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Apr, 2012

1 commit

  • Pull networking fixes from David Miller:

    1) Provide device string properly for USB i2400m wimax devices, also
    don't OOPS when providing firmware string. From Phil Sutter.

    2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu.

    3) Add another device ID to USB zaurus driver, from Guan Xin.

    4) Loop index start in pool vector iterator is wrong causing MAC to not
    get configured in bnx2x driver, fix from Dmitry Kravkov.

    5) EQL driver assumes HZ=100, fix from Eric Dumazet.

    6) Now that skb_add_rx_frag() can specify the truesize increment
    separately, do so in f_phonet and cdc_phonet, also from Eric
    Dumazet.

    7) virtio_net accidently uses net_ratelimit() not only on the kernel
    warning but also the statistic bump, fix from Rick Jones.

    8) ip_route_input_mc() uses fixed init_net namespace, oops, use
    dev_net(dev) instead. Fix from Benjamin LaHaise.

    9) dev_forward_skb() needs to clear the incoming interface index of the
    SKB so that it looks like a new incoming packet, also from Benjamin
    LaHaise.

    10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of
    5GHZ, fix from Stanislav Yakovlev.

    11) Missing kmalloc() return value checks in orinoco, from Santosh
    Nayak.

    12) ath9k doesn't check for HT capabilities in the right way, it is
    checking ht_supported instead of the ATH9K_HW_CAP_HT flag. Fix from
    Sujith Manoharan.

    13) Fix x86 BPF JIT emission of 16-bit immediate field of AND
    instructions, from Feiran Zhuang.

    14) Avoid infinite loop in GARP code when registering sysfs entries.
    From David Ward.

    15) rose protocol uses memcpy instead of memcmp in a device address
    comparison, oops. Fix from Daniel Borkmann.

    16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being
    renamed to eth_hw_addr_random(). From Roland Stigge.

    17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way
    that ipv4 does. Fix from Shmulik Ladkani.

    18) via-rhine has an inverted bit test, causing suspend/resume
    regressions. Fix from Andreas Mohr.

    19) RIONET assumes 4K page size, fix from Akinobu Mita.

    20) Initialization of imask register in sky2 is buggy, because bits are
    "or'd" into an uninitialized local variable. Fix from Lino
    Sanfilippo.

    21) Fix FCOE checksum offload handling, from Yi Zou.

    22) Fix VLAN processing regression in e1000, from Jiri Pirko.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
    sky2: dont overwrite settings for PHY Quick link
    tg3: Fix 5717 serdes powerdown problem
    net: usb: cdc_eem: fix mtu
    net: sh_eth: fix endian check for architecture independent
    usb/rtl8150 : Remove duplicated definitions
    rionet: fix page allocation order of rionet_active
    via-rhine: fix wait-bit inversion.
    ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
    net: lpc_eth: Fix rename of dev_hw_addr_random
    net/netfilter/nfnetlink_acct.c: use linux/atomic.h
    rose_dev: fix memcpy-bug in rose_set_mac_address
    Fix non TBI PHY access; a bad merge undid bug fix in a previous commit.
    net/garp: avoid infinite loop if attribute already exists
    x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND
    bonding: emit event when bonding changes MAC
    mac80211: fix oper channel timestamp updation
    ath9k: Use HW HT capabilites properly
    MAINTAINERS: adding maintainer for ipw2x00
    net: orinoco: add error handling for failed kmalloc().
    net/wireless: ipw2x00: fix a typo in wiphy struct initilization
    ...

    Linus Torvalds
     

29 Mar, 2012

2 commits

  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Remove all #inclusions of asm/system.h preparatory to splitting and killing
    it. Performed with the following command:

    perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`

    Signed-off-by: David Howells

    David Howells
     

28 Mar, 2012

1 commit

  • When using multicast over a local bridge feeding a number of LXC guests
    using veth, the LXC guests are unable to get a response from other guests
    when pinging 224.0.0.1. Multicast packets did not appear to be getting
    delivered to the network namespaces of the guest hosts, and further
    inspection showed that the incoming route was pointing to the loopback
    device of the host, not the guest. This lead to the wrong network namespace
    being picked up by sockets (like ICMP). Fix this by using the correct
    network namespace when creating the inbound route entry.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: David S. Miller

    Benjamin LaHaise
     

23 Mar, 2012

2 commits

  • The following patch aimed to resolve an issue where secondary, tertiary,
    etc. addresses added to bond interfaces could overwrite the
    bond->master_ip and vlan_ip values.

    commit 917fbdb32f37e9a93b00bb12ee83532982982df3
    Author: Henrik Saavedra Persson
    Date: Wed Nov 23 23:37:15 2011 +0000

    bonding: only use primary address for ARP

    That patch was good because it prevented bonds using ARP monitoring from
    sending frames with an invalid source IP address. Unfortunately, it
    didn't always work as expected.

    When using an ioctl (like ifconfig does) to set the IP address and
    netmask, 2 separate ioctls are actually called to set the IP and netmask
    if the mask chosen doesn't match the standard mask for that class of
    address. The first ioctl did not have a mask that matched the one in
    the primary address and would still cause the device address to be
    overwritten. The second ioctl that was called to set the mask would
    then detect as secondary and ignored, but the damage was already done.

    This was not an issue when using an application that used netlink
    sockets as the setting of IP and netmask came down at once. The
    inconsistent behavior between those two interfaces was something that
    needed to be resolved.

    While I was thinking about how I wanted to resolve this, Ralf Zeidler
    came with a patch that resolved this on a RHEL kernel by keeping a full
    shadow of the entries in dev->ifa_list for the bonding device and vlan
    devices in the bonding driver. I didn't like the duplication of the
    list as I want to see the 'bonding' struct and code shrink rather than
    grow, but liked the general idea.

    As the Subject indicates this patch drops the master_ip and vlan_ip
    elements from the 'bonding' and 'vlan_entry' structs, respectively.
    This can be done because a device's address-list is now traversed to
    determine the optimal source IP address for ARP requests and for checks
    to see if the bonding device has a particular IP address. This code
    could have all be contained inside the bonding driver, but it made more
    sense to me to EXPORT and call inet_confirm_addr since it did exactly
    what was needed.

    I tested this and a backported patch and everything works as expected.
    Ralf also helped with verification of the backported patch.

    Thanks to Ralf for all his help on this.

    v2: Whitespace and organizational changes based on suggestions from Jay
    Vosburgh and Dave Miller.

    v3: Fixup incorrect usage of rcu_read_unlock based on Dave Miller's
    suggestion.

    Signed-off-by: Andy Gospodarek
    CC: Ralf Zeidler
    Signed-off-by: David S. Miller

    Andy Gospodarek
     
  • It used to be an int, and it got changed to a bool parameter at least
    7 years ago. It happens that NF_ACCEPT and NF_DROP are 0 and 1, so
    this works, but it's unclear, and the check that it's in range is not
    required.

    Reported-by: Dan Carpenter
    Signed-off-by: Rusty Russell
    Signed-off-by: David S. Miller

    Rusty Russell
     

21 Mar, 2012

4 commits

  • Pull networking merge from David Miller:
    "1) Move ixgbe driver over to purely page based buffering on receive.
    From Alexander Duyck.

    2) Add receive packet steering support to e1000e, from Bruce Allan.

    3) Convert TCP MD5 support over to RCU, from Eric Dumazet.

    4) Reduce cpu usage in handling out-of-order TCP packets on modern
    systems, also from Eric Dumazet.

    5) Support the IP{,V6}_UNICAST_IF socket options, making the wine
    folks happy, from Erich Hoover.

    6) Support VLAN trunking from guests in hyperv driver, from Haiyang
    Zhang.

    7) Support byte-queue-limtis in r8169, from Igor Maravic.

    8) Outline code intended for IP_RECVTOS in IP_PKTOPTIONS existed but
    was never properly implemented, Jiri Benc fixed that.

    9) 64-bit statistics support in r8169 and 8139too, from Junchang Wang.

    10) Support kernel side dump filtering by ctmark in netfilter
    ctnetlink, from Pablo Neira Ayuso.

    11) Support byte-queue-limits in gianfar driver, from Paul Gortmaker.

    12) Add new peek socket options to assist with socket migration, from
    Pavel Emelyanov.

    13) Add sch_plug packet scheduler whose queue is controlled by
    userland daemons using explicit freeze and release commands. From
    Shriram Rajagopalan.

    14) Fix FCOE checksum offload handling on transmit, from Yi Zou."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1846 commits)
    Fix pppol2tp getsockname()
    Remove printk from rds_sendmsg
    ipv6: fix incorrent ipv6 ipsec packet fragment
    cpsw: Hook up default ndo_change_mtu.
    net: qmi_wwan: fix build error due to cdc-wdm dependecy
    netdev: driver: ethernet: Add TI CPSW driver
    netdev: driver: ethernet: add cpsw address lookup engine support
    phy: add am79c874 PHY support
    mlx4_core: fix race on comm channel
    bonding: send igmp report for its master
    fs_enet: Add MPC5125 FEC support and PHY interface selection
    net: bpf_jit: fix BPF_S_LDX_B_MSH compilation
    net: update the usage of CHECKSUM_UNNECESSARY
    fcoe: use CHECKSUM_UNNECESSARY instead of CHECKSUM_PARTIAL on tx
    net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso
    ixgbe: Fix issues with SR-IOV loopback when flow control is disabled
    net/hyperv: Fix the code handling tx busy
    ixgbe: fix namespace issues when FCoE/DCB is not enabled
    rtlwifi: Remove unused ETH_ADDR_LEN defines
    igbvf: Use ETH_ALEN
    ...

    Fix up fairly trivial conflicts in drivers/isdn/gigaset/interface.c and
    drivers/net/usb/{Kconfig,qmi_wwan.c} as per David.

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "Out of the 8 commits, one fixes a long-standing locking issue around
    tasklist walking and others are cleanups."

    * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
    cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
    cgroup: remove cgroup_subsys argument from callbacks
    cgroup: remove extra calls to find_existing_css_set
    cgroup: replace tasklist_lock with rcu_read_lock
    cgroup: simplify double-check locking in cgroup_attach_proc
    cgroup: move struct cgroup_pidlist out from the header file
    cgroup: remove cgroup_attach_task_current_cg()

    Linus Torvalds
     
  • Pull perf events changes for v3.4 from Ingo Molnar:

    - New "hardware based branch profiling" feature both on the kernel and
    the tooling side, on CPUs that support it. (modern x86 Intel CPUs
    with the 'LBR' hardware feature currently.)

    This new feature is basically a sophisticated 'magnifying glass' for
    branch execution - something that is pretty difficult to extract from
    regular, function histogram centric profiles.

    The simplest mode is activated via 'perf record -b', and the result
    looks like this in perf report:

    $ perf record -b any_call,u -e cycles:u branchy

    $ perf report -b --sort=symbol
    52.34% [.] main [.] f1
    24.04% [.] f1 [.] f3
    23.60% [.] f1 [.] f2
    0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow
    0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn
    0.01% [k] _IO_vfprintf_internal [k] strchrnul
    0.01% [k] __printf [k] _IO_vfprintf_internal
    0.01% [k] main [k] __printf

    This output shows from/to branch columns and shows the highest
    percentage (from,to) jump combinations - i.e. the most likely taken
    branches in the system. "branches" can also include function calls
    and any other synchronous and asynchronous transitions of the
    instruction pointer that are not 'next instruction' - such as system
    calls, traps, interrupts, etc.

    This feature comes with (hopefully intuitive) flat ascii and TUI
    support in perf report.

    - Various 'perf annotate' visual improvements for us assembly junkies.
    It will now recognize function calls in the TUI and by hitting enter
    you can follow the call (recursively) and back, amongst other
    improvements.

    - Multiple threads/processes recording support in perf record, perf
    stat, perf top - which is activated via a comma-list of PIDs:

    perf top -p 21483,21485
    perf stat -p 21483,21485 -ddd
    perf record -p 21483,21485

    - Support for per UID views, via the --uid paramter to perf top, perf
    report, etc. For example 'perf top --uid mingo' will only show the
    tasks that I am running, excluding other users, root, etc.

    - Jump label restructurings and improvements - this includes the
    factoring out of the (hopefully much clearer) include/linux/static_key.h
    generic facility:

    struct static_key key = STATIC_KEY_INIT_FALSE;

    ...

    if (static_key_false(&key))
    do unlikely code
    else
    do likely code

    ...
    static_key_slow_inc();
    ...
    static_key_slow_inc();
    ...

    The static_key_false() branch will be generated into the code with as
    little impact to the likely code path as possible. the
    static_key_slow_*() APIs flip the branch via live kernel code patching.

    This facility can now be used more widely within the kernel to
    micro-optimize hot branches whose likelihood matches the static-key
    usage and fast/slow cost patterns.

    - SW function tracer improvements: perf support and filtering support.

    - Various hardenings of the perf.data ABI, to make older perf.data's
    smoother on newer tool versions, to make new features integrate more
    smoothly, to support cross-endian recording/analyzing workflows
    better, etc.

    - Restructuring of the kprobes code, the splitting out of 'optprobes',
    and a corner case bugfix.

    - Allow the tracing of kernel console output (printk).

    - Improvements/fixes to user-space RDPMC support, allowing user-space
    self-profiling code to extract PMU counts without performing any
    system calls, while playing nice with the kernel side.

    - 'perf bench' improvements

    - ... and lots of internal restructurings, cleanups and fixes that made
    these features possible. And, as usual this list is incomplete as
    there were also lots of other improvements

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
    perf report: Fix annotate double quit issue in branch view mode
    perf report: Remove duplicate annotate choice in branch view mode
    perf/x86: Prettify pmu config literals
    perf report: Enable TUI in branch view mode
    perf report: Auto-detect branch stack sampling mode
    perf record: Add HEADER_BRANCH_STACK tag
    perf record: Provide default branch stack sampling mode option
    perf tools: Make perf able to read files from older ABIs
    perf tools: Fix ABI compatibility bug in print_event_desc()
    perf tools: Enable reading of perf.data files from different ABI rev
    perf: Add ABI reference sizes
    perf report: Add support for taken branch sampling
    perf record: Add support for sampling taken branch
    perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
    x86/kprobes: Split out optprobe related code to kprobes-opt.c
    x86/kprobes: Fix a bug which can modify kernel code permanently
    x86/kprobes: Fix instruction recovery on optimized path
    perf: Add callback to flush branch_stack on context switch
    perf: Disable PERF_SAMPLE_BRANCH_* when not supported
    perf/x86: Add LBR software filter support for Intel CPUs
    ...

    Linus Torvalds
     
  • Pull RCU changes for v3.4 from Ingo Molnar. The major features of this
    series are:

    - making RCU more aggressive about entering dyntick-idle mode in order
    to improve energy efficiency

    - converting a few more call_rcu()s to kfree_rcu()s

    - applying a number of rcutree fixes and cleanups to rcutiny

    - removing CONFIG_SMP #ifdefs from treercu

    - allowing RCU CPU stall times to be set via sysfs

    - adding CPU-stall capability to rcutorture

    - adding more RCU-abuse diagnostics

    - updating documentation

    - fixing yet more issues located by the still-ongoing top-to-bottom
    inspection of RCU, this time with a special focus on the CPU-hotplug
    code path.

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    rcu: Stop spurious warnings from synchronize_sched_expedited
    rcu: Hold off RCU_FAST_NO_HZ after timer posted
    rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
    rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
    rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
    rcu: Remove redundant check for rcu_head misalignment
    PTR_ERR should be called before its argument is cleared.
    rcu: Convert WARN_ON_ONCE() in rcu_lock_acquire() to lockdep
    rcu: Trace only after NULL-pointer check
    rcu: Call out dangers of expedited RCU primitives
    rcu: Rework detection of use of RCU by offline CPUs
    lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
    rcu: No interrupt disabling for rcu_prepare_for_idle()
    rcu: Move synchronize_sched_expedited() to rcutree.c
    rcu: Check for illegal use of RCU from offlined CPUs
    rcu: Update stall-warning documentation
    rcu: Add CPU-stall capability to rcutorture
    rcu: Make documentation give more realistic rcutorture duration
    rcutorture: Permit holding off CPU-hotplug operations during boot
    rcu: Print scheduling-clock information on RCU CPU stall-warning messages
    ...

    Linus Torvalds
     

20 Mar, 2012

2 commits

  • With increasing receive window sizes, but speed of light not improved
    that much, out of order queue can contain a huge number of skbs, waiting
    to be moved to receive_queue when missing packets can fill the holes.

    Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
    sk_buff)) to store regular (MTU
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: H.K. Jerry Chu
    Cc: Tom Herbert
    Cc: Ilpo Järvinen
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Split tcp_data_queue() in two parts for better readability.

    tcp_data_queue_ofo() is responsible for queueing incoming skb into out
    of order queue.

    Change code layout so that the skb_set_owner_r() is performed only if
    skb is not dropped.

    This is a preliminary patch before "reduce out_of_order memory use"
    following patch.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: H.K. Jerry Chu
    Cc: Tom Herbert
    Cc: Ilpo Järvinen
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Mar, 2012

1 commit


17 Mar, 2012

1 commit

  • I found recently that the arp_process function which handles all of our received
    arp frames, is using IPV4_DEVCONF_ALL macro to check the state of the arp_process
    flag. This seems wrong, as it implies that either none or all of the network
    interfaces accept gratuitous arps. This patch corrects that, allowing
    per-interface arp_accept configuration to deviate from the all setting. Note
    this also brings us into line with the way the arp_filter setting is handled
    during arp_process execution.

    Tested this myself on my home network, and confirmed it works as expected.

    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

13 Mar, 2012

2 commits


12 Mar, 2012

2 commits

  • Use a more current kernel messaging style.

    Convert a printk block to print_hex_dump.
    Coalesce formats, align arguments.
    Use %s, __func__ instead of embedding function names.

    Some messages that were prefixed with _close are
    now prefixed with _fini. Some ah4 and esp messages
    are now not prefixed with "ip ".

    The intent of this patch is to later add something like
    #define pr_fmt(fmt) "IPv4: " fmt.
    to standardize the output messages.

    Text size is trivially reduced. (x86-32 allyesconfig)

    $ size net/ipv4/built-in.o*
    text data bss dec hex filename
    887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
    887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • commit ea4fc0d619 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
    added a serious regression on synflood handling.

    Simon Kirby discovered a successful connection was delayed by 20 seconds
    before being responsive.

    In my tests, I discovered that xmit frames were lost, and needed ~4
    retransmits and a socket dst rebuild before being really sent.

    In case of syncookie initiated connection, we use a different path to
    initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

    As ip_queue_xmit() now depends on inet flow being setup, fix this by
    copying the temp flowi4 we use in cookie_v4_check().

    Reported-by: Simon Kirby
    Bisected-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Mar, 2012

3 commits


08 Mar, 2012

8 commits


07 Mar, 2012

1 commit

  • This commit fixes tcp_shift_skb_data() so that it does not shift
    SACKed data below snd_una.

    This fixes an issue whose symptoms exactly match reports showing
    tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at
    net/ipv4/tcp_input.c:3418" thread on netdev).

    Since 2008 (832d11c5cd076abc0aa1eaf7be96c81d1a59ce41)
    tcp_shift_skb_data() had been shifting SACKed ranges that were below
    snd_una. It checked that the *end* of the skb it was about to shift
    from was above snd_una, but did not check that the end of the actual
    shifted range was above snd_una; this commit adds that check.

    Shifting SACKed ranges below snd_una is problematic because for such
    ranges tcp_sacktag_one() short-circuits: it does not declare anything
    as SACKed and does not increase sacked_out.

    Before the fixes in commits cc9a672ee522d4805495b98680f4a3db5d0a0af9
    and daef52bab1fd26e24e8e9578f8fb33ba1d0cb412, shifting SACKed ranges
    below snd_una happened to work because tcp_shifted_skb() was always
    (incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq
    tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence
    tcp_sacktag_one() never short-circuited and always increased
    tp->sacked_out in this case.

    After those two fixes, my testing has verified that shifting SACKed
    ranges below snd_una could cause tp->sacked_out to go negative with
    the following sequence of events:

    (1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una,
    then shifts a prefix of that skb that is below snd_una

    (2) tcp_shifted_skb() increments the packet count of the
    already-SACKed prev sk_buff

    (3) tcp_sacktag_one() sees the end of the new SACKed range is below
    snd_una, so it short-circuits and doesn't increase tp->sacked_out

    (5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed,
    decrements tp->sacked_out by this "inflated" pcount that was
    missing a matching increase in tp->sacked_out, and hence
    tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted
    to s32 is negative.

    (6) this leads to the warnings seen in the recent "WARNING: at
    net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.:
    tcp_input.c:3418 WARN_ON((int)tp->sacked_out < 0);

    More generally, I think this bug can be tickled in some cases where
    two or more ACKs from the receiver are lost and then a DSACK arrives
    that is immediately above an existing SACKed skb in the write queue.

    This fix changes tcp_shift_skb_data() to abort this sequence at step
    (1) in the scenario above by noticing that the bytes are below snd_una
    and not shifting them.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell