07 Aug, 2014

2 commits

  • All other add functions for lists have the new item as first argument
    and the position where it is added as second argument. This was changed
    for no good reason in this function and makes using it unnecessary
    confusing.

    The name was changed to hlist_add_behind() to cause unconverted code to
    generate a compile error instead of using the wrong parameter order.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Acked-by: Jeff Kirsher [intel driver bits]
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     
  • Pull networking updates from David Miller:
    "Highlights:

    1) Steady transitioning of the BPF instructure to a generic spot so
    all kernel subsystems can make use of it, from Alexei Starovoitov.

    2) SFC driver supports busy polling, from Alexandre Rames.

    3) Take advantage of hash table in UDP multicast delivery, from David
    Held.

    4) Lighten locking, in particular by getting rid of the LRU lists, in
    inet frag handling. From Florian Westphal.

    5) Add support for various RFC6458 control messages in SCTP, from
    Geir Ola Vaagland.

    6) Allow to filter bridge forwarding database dumps by device, from
    Jamal Hadi Salim.

    7) virtio-net also now supports busy polling, from Jason Wang.

    8) Some low level optimization tweaks in pktgen from Jesper Dangaard
    Brouer.

    9) Add support for ipv6 address generation modes, so that userland
    can have some input into the process. From Jiri Pirko.

    10) Consolidate common TCP connection request code in ipv4 and ipv6,
    from Octavian Purdila.

    11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

    12) Generic resizable RCU hash table, with intial users in netlink and
    nftables. From Thomas Graf.

    13) Maintain a name assignment type so that userspace can see where a
    network device name came from (enumerated by kernel, assigned
    explicitly by userspace, etc.) From Tom Gundersen.

    14) Automatic flow label generation on transmit in ipv6, from Tom
    Herbert.

    15) New packet timestamping facilities from Willem de Bruijn, meant to
    assist in measuring latencies going into/out-of the packet
    scheduler, latency from TCP data transmission to ACK, etc"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
    cxgb4 : Disable recursive mailbox commands when enabling vi
    net: reduce USB network driver config options.
    tg3: Modify tg3_tso_bug() to handle multiple TX rings
    amd-xgbe: Perform phy connect/disconnect at dev open/stop
    amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
    net: sun4i-emac: fix memory leak on bad packet
    sctp: fix possible seqlock seadlock in sctp_packet_transmit()
    Revert "net: phy: Set the driver when registering an MDIO bus device"
    cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
    team: Simplify return path of team_newlink
    bridge: Update outdated comment on promiscuous mode
    net-timestamp: ACK timestamp for bytestreams
    net-timestamp: TCP timestamping
    net-timestamp: SCHED timestamp on entering packet scheduler
    net-timestamp: add key to disambiguate concurrent datagrams
    net-timestamp: move timestamp flags out of sk_flags
    net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
    cxgb4i : Move stray CPL definitions to cxgb4 driver
    tcp: reduce spurious retransmits due to transient SACK reneging
    qlcnic: Initialize dcbnl_ops before register_netdev
    ...

    Linus Torvalds
     

06 Aug, 2014

13 commits

  • Pull security subsystem updates from James Morris:
    "In this release:

    - PKCS#7 parser for the key management subsystem from David Howells
    - appoint Kees Cook as seccomp maintainer
    - bugfixes and general maintenance across the subsystem"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
    X.509: Need to export x509_request_asymmetric_key()
    netlabel: shorter names for the NetLabel catmap funcs/structs
    netlabel: fix the catmap walking functions
    netlabel: fix the horribly broken catmap functions
    netlabel: fix a problem when setting bits below the previously lowest bit
    PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
    tpm: simplify code by using %*phN specifier
    tpm: Provide a generic means to override the chip returned timeouts
    tpm: missing tpm_chip_put in tpm_get_random()
    tpm: Properly clean sysfs entries in error path
    tpm: Add missing tpm_do_selftest to ST33 I2C driver
    PKCS#7: Use x509_request_asymmetric_key()
    Revert "selinux: fix the default socket labeling in sock_graft()"
    X.509: x509_request_asymmetric_keys() doesn't need string length arguments
    PKCS#7: fix sparse non static symbol warning
    KEYS: revert encrypted key change
    ima: add support for measuring and appraising firmware
    firmware_class: perform new LSM checks
    security: introduce kernel_fw_from_file hook
    PKCS#7: Missing inclusion of linux/err.h
    ...

    Linus Torvalds
     
  • Conflicts:
    drivers/net/Makefile
    net/ipv6/sysctl_net_ipv6.c

    Two ipv6_table_template[] additions overlap, so the index
    of the ipv6_table[x] assignments needed to be adjusted.

    In the drivers/net/Makefile case, we've gotten rid of the
    garbage whereby we had to list every single USB networking
    driver in the top-level Makefile, there is just one
    "USB_NETWORKING" that guards everything.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Dave reported following splat, caused by improper use of
    IP_INC_STATS_BH() in process context.

    BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
    caller is __this_cpu_preempt_check+0x13/0x20
    CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
    ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
    0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
    ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] check_preemption_disabled+0xfa/0x100
    [] __this_cpu_preempt_check+0x13/0x20
    [] sctp_packet_transmit+0x692/0x710 [sctp]
    [] sctp_outq_flush+0x2a2/0xc30 [sctp]
    [] ? mark_held_locks+0x7c/0xb0
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] sctp_outq_uncork+0x1a/0x20 [sctp]
    [] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
    [] sctp_do_sm+0xdb/0x330 [sctp]
    [] ? preempt_count_sub+0xab/0x100
    [] ? sctp_cname+0x70/0x70 [sctp]
    [] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
    [] sctp_sendmsg+0x88f/0xe30 [sctp]
    [] ? lock_release_holdtime.part.28+0x9a/0x160
    [] ? put_lock_stats.isra.27+0xe/0x30
    [] inet_sendmsg+0x104/0x220
    [] ? inet_sendmsg+0x5/0x220
    [] sock_sendmsg+0x9e/0xe0
    [] ? might_fault+0xb9/0xc0
    [] ? might_fault+0x5e/0xc0
    [] SYSC_sendto+0x124/0x1c0
    [] ? syscall_trace_enter+0x250/0x330
    [] SyS_sendto+0xe/0x10
    [] tracesys+0xdd/0xe2

    This is a followup of commits f1d8cba61c3c4b ("inet: fix possible
    seqlock deadlocks") and 7f88c6b23afbd315 ("ipv6: fix possible seqlock
    deadlock in ip6_finish_output2")

    Signed-off-by: Eric Dumazet
    Cc: Hannes Frederic Sowa
    Reported-by: Dave Jones
    Acked-by: Neil Horman
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Antonio Quartulli says:

    ====================
    pull request: batman-adv 2014-08-05

    this is a pull request intended for net-next/linux-3.17 (yeah..it's really
    late).

    Patches 1, 2 and 4 are really minor changes:
    - kmalloc_array is substituted to kmalloc when possible (as suggested by
    checkpatch);
    - net_ratelimited() is now used properly and the "suppressed" message is not
    printed anymore if not needed;
    - the internal version number has been increased to reflect our current version.

    Patch 3 instead is introducing a change in the metric computation function
    by changing the penalty applied at each mesh hop from 15/255 (~6%) to
    30/255 (~11%). This change is introduced by Simon Wunderlich after having
    observed a performance improvement in several networks when using the new value.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an
    unnecessary operation.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
    in the send() call is acknowledged. It implements the feature for TCP.

    The timestamp is generated when the TCP socket cumulative ACK is moved
    beyond the tracked seqno for the first time. The feature ignores SACK
    and FACK, because those acknowledge the specific byte, but not
    necessarily the entire contents of the buffer up to that byte.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • TCP timestamping extends SO_TIMESTAMPING to bytestreams.

    Bytestreams do not have a 1:1 relationship between send() buffers and
    network packets. The feature interprets a send call on a bytestream as
    a request for a timestamp for the last byte in that send() buffer.

    The choice corresponds to a request for a timestamp when all bytes in
    the buffer have been sent. That assumption depends on in-order kernel
    transmission. This is the common case. That said, it is possible to
    construct a traffic shaping tree that would result in reordering.
    The guarantee is strong, then, but not ironclad.

    This implementation supports send and sendpages (splice). GSO replaces
    one large packet with multiple smaller packets. This patch also copies
    the option into the correct smaller packet.

    This patch does not yet support timestamping on data in an initial TCP
    Fast Open SYN, because that takes a very different data path.

    If ID generation in ee_data is enabled, bytestream timestamps return a
    byte offset, instead of the packet counter for datagrams.

    The implementation supports a single timestamp per packet. It silenty
    replaces requests for previous timestamps. To avoid missing tstamps,
    flush the tcp queue by disabling Nagle, cork and autocork. Missing
    tstamps can be detected by offset when the ee_data ID is enabled.

    Implementation details:

    - On GSO, the timestamping code can be included in the main loop. I
    moved it into its own loop to reduce the impact on the common case
    to a single branch.

    - To avoid leaking the absolute seqno to userspace, the offset
    returned in ee_data must always be relative. It is an offset between
    an skb and sk field. The first is always set (also for GSO & ACK).
    The second must also never be uninitialized. Only allow the ID
    option on sockets in the ESTABLISHED state, for which the seqno
    is available. Never reset it to zero (instead, move it to the
    current seqno when reenabling the option).

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Kernel transmit latency is often incurred in the packet scheduler.
    Introduce a new timestamp on transmission just before entering the
    scheduler. When data travels through multiple devices (bonding,
    tunneling, ...) each device will export an individual timestamp.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Datagrams timestamped on transmission can coexist in the kernel stack
    and be reordered in packet scheduling. When reading looped datagrams
    from the socket error queue it is not always possible to unique
    correlate looped data with original send() call (for application
    level retransmits). Even if possible, it may be expensive and complex,
    requiring packet inspection.

    Introduce a data-independent ID mechanism to associate timestamps with
    send calls. Pass an ID alongside the timestamp in field ee_data of
    sock_extended_err.

    The ID is a simple 32 bit unsigned int that is associated with the
    socket and incremented on each send() call for which software tx
    timestamp generation is enabled.

    The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
    avoid changing ee_data for existing applications that expect it 0.
    The counter is reset each time the flag is reenabled. Reenabling
    does not change the ID of already submitted data. It is possible
    to receive out of order IDs if the timestamp stream is not quiesced
    first.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • sk_flags is reaching its limit. New timestamping options will not fit.
    Move all of them into a new field sk->sk_tsflags.

    Added benefit is that this removes boilerplate code to convert between
    SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

    SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
    timestamp logic (netstamp_needed). That can be simplified and this
    last key removed, but will leave that for a separate patch.

    Signed-off-by: Willem de Bruijn

    ----

    The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
    though that scatters tstamp fields throughout the struct.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Applications that request kernel tx timestamps with SO_TIMESTAMPING
    read timestamps as recvmsg() ancillary data. The response is defined
    implicitly as timespec[3].

    1) define struct scm_timestamping explicitly and

    2) add support for new tstamp types. On tx, scm_timestamping always
    accompanies a sock_extended_err. Define previously unused field
    ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
    define the existing behavior.

    The reception path is not modified. On rx, no struct similar to
    sock_extended_err is passed along with SCM_TIMESTAMPING.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • This commit reduces spurious retransmits due to apparent SACK reneging
    by only reacting to SACK reneging that persists for a short delay.

    When a sequence space hole at snd_una is filled, some TCP receivers
    send a series of ACKs as they apparently scan their out-of-order queue
    and cumulatively ACK all the packets that have now been consecutiveyly
    received. This is essentially misbehavior B in "Misbehaviors in TCP
    SACK generation" ACM SIGCOMM Computer Communication Review, April
    2011, so we suspect that this is from several common OSes (Windows
    2000, Windows Server 2003, Windows XP). However, this issue has also
    been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
    into spurious retransmissions by lack of timestamps?" from March 2014,
    where the receiver was thought to be a BSD box.

    Since snd_una would temporarily be adjacent to a previously SACKed
    range in these scenarios, this receiver behavior triggered the Linux
    SACK reneging code path in the sender. This led the sender to clear
    the SACK scoreboard, enter CA_Loss, and spuriously retransmit
    (potentially) every packet from the entire write queue at line rate
    just a few milliseconds before the ACK for each packet arrives at the
    sender.

    To avoid such situations, now when a sender sees apparent reneging it
    does not yet retransmit, but rather adjusts the RTO timer to give the
    receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
    that will restore sanity to the SACK scoreboard. If the reneging
    persists until this RTO then, as before, we clear the SACK scoreboard
    and enter CA_Loss.

    A 10ms delay tolerates a receiver sending such a stream of ACKs at
    56Kbit/sec. And to allow for receivers with slower or more congested
    paths, we wait for at least RTT/2.

    We validated the resulting max(RTT/2, 10ms) delay formula with a mix
    of North American and South American Google web server traffic, and
    found that for ACKs displaying transient reneging:

    (1) 90% of inter-ACK delays were less than 10ms
    (2) 99% of inter-ACK delays were less than RTT/2

    In tests on Google web servers this commit reduced reneging events by
    75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
    any measurable impact on latency for user HTTP and SPDY requests.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • …inville/wireless-next

    Conflicts:
    net/6lowpan/iphc.c

    Minor conflicts in iphc.c were changes overlapping with some
    style cleanups.

    John W. Linville says:

    ====================
    Please pull this last(?) batch of wireless change intended for the
    3.17 stream...

    For the NFC bits, Samuel says:

    "This is a rather quiet one, we have:

    - A new driver from ST Microelectronics for their NCI ST21NFCB,
    including device tree support.

    - p2p support for the ST21NFCA driver

    - A few fixes an enhancements for the NFC digital laye"

    For the Atheros bits, Kalle says:

    "Michal and Janusz did some important RX aggregation fixes, basically we
    were missing RX reordering altogether. The 10.1 firmware doesn't support
    Ad-Hoc mode and Michal fixed ath10k so that it doesn't advertise Ad-Hoc
    support with that firmware. Also he implemented a workaround for a KVM
    issue."

    For the Bluetooth bits, Gustavo and Johan say:

    "To quote Gustavo from his previous request:

    'Some last minute fixes for -next. We have a fix for a use after free in
    RFCOMM, another fix to an issue with ADV_DIRECT_IND and one for ADV_IND with
    auto-connection handling. Last, we added support for reading the codec and
    MWS setting for controllers that support these features.'

    Additionally there are fixes to LE scanning, an update to conform to the 4.1
    core specification as well as fixes for tracking the page scan state. All
    of these fixes are important for 3.17."

    And,

    "We've got:

    - 6lowpan fixes/cleanups
    - A couple crash fixes, one for the Marvell HCI driver and another in LE SMP.
    - Fix for an incorrect connected state check
    - Fix for the bondable requirement during pairing (an issue which had
    crept in because of using "pairable" when in fact the actual meaning
    was "bondable" (these have different meanings in Bluetooth)"

    Along with those are some late-breaking hardware support patches in
    brcmfmac and b43 as well as a stray ath9k patch.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     

05 Aug, 2014

10 commits

  • Signed-off-by: Simon Wunderlich
    Signed-off-by: Antonio Quartulli

    Simon Wunderlich
     
  • The default hop penalty is currently set to 15, which is applied like
    that for multi interface devices (e.g. dual band APs). Single band
    devices will still use an effective penalty of 30 (hop penalty + wifi
    penalty).

    After receiving reports of too long paths in mesh networks with dual
    band APs which were fixed by increasing the hop penalty, we'd like to
    suggest to increase that default value in the default setting as well.
    We've evaluated that increase in a handful of medium sized mesh
    networks (5-20 nodes) with single and dual band devices, with changes
    for the better (shorter routes, higher throughput) or no change at all.

    This patch changes the hop penalty to 30, which will give an effective
    penalty of 60 on single band devices (hop penalty + wifi penalty).

    Signed-off-by: Simon Wunderlich
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Simon Wunderlich
     
  • This patch removes unnecessary logspam which resulted from superfluous
    calls to net_ratelimit(). With the supplied patch, net_ratelimit() is
    called after the loglevel has been checked.

    Signed-off-by: André Gaul
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    André Gaul
     
  • batadv_frag_insert_packet was unable to handle out-of-order packets because it
    dropped them directly. This is caused by the way the fragmentation lists is
    checked for the correct place to insert a fragmentation entry.

    The fragmentation code keeps the fragments in lists. The fragmentation entries
    are kept in descending order of sequence number. The list is traversed and each
    entry is compared with the new fragment. If the current entry has a smaller
    sequence number than the new fragment then the new one has to be inserted
    before the current entry. This ensures that the list is still in descending
    order.

    An out-of-order packet with a smaller sequence number than all entries in the
    list still has to be added to the end of the list. The used hlist has no
    information about the last entry in the list inside hlist_head and thus the
    last entry has to be calculated differently. Currently the code assumes that
    the iterator variable of hlist_for_each_entry can be used for this purpose
    after the hlist_for_each_entry finished. This is obviously wrong because the
    iterator variable is always NULL when the list was completely traversed.

    Instead the information about the last entry has to be stored in a different
    variable.

    This problem was introduced in 610bfc6bc99bc83680d190ebc69359a05fc7f605
    ("batman-adv: Receive fragmented packets and merge").

    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Sven Eckelmann
     
  • With netlink_lookup() conversion to RCU, we need to use appropriate
    rcu dereference in netlink_seq_socket_idx() & netlink_seq_next()

    Reported-by: Sasha Levin
    Signed-off-by: Eric Dumazet
    Fixes: e341694e3eb57fc ("netlink: Convert netlink_lookup() to use RCU protected hash table")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull tty / serial driver update from Greg KH:
    "Here's the big tty / serial driver update for 3.17-rc1.

    Nothing major, just a number of fixes and new features for different
    serial drivers, and some more tty core fixes and documentation of the
    tty locks.

    All of these have been in linux-next for a while"

    * tag 'tty-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (82 commits)
    tty/n_gsm.c: fix a memory leak in gsmld_open
    pch_uart: don't hardcode PCI slot to get DMA device
    tty: n_gsm, use setup_timer
    Revert "ARC: [arcfpga] stdout-path now suffices for earlycon/console"
    serial: sc16is7xx: Correct initialization of s->clk
    serial: 8250_dw: Add support for deferred probing
    serial: 8250_dw: Add optional reset control support
    serial: st-asc: Fix overflow in baudrate calculation
    serial: st-asc: Don't call BUG in asc_console_setup()
    tty: serial: msm: Make of_device_id array const
    tty/n_gsm.c: get gsm->num after gsm_activate_mux
    serial/core: Fix too big allocation for attribute member
    drivers/tty/serial: use correct type for dma_map/unmap
    serial: altera_jtaguart: Fix putchar function passed to uart_console_write()
    serial/uart/8250: Add tunable RX interrupt trigger I/F of FIFO buffers
    Serial: allow port drivers to have a default attribute group
    tty: kgdb_nmi: Automatically manage tty enable
    serial: altera_jtaguart: Adpot uart_console_write()
    serial: samsung: improve code clarity by defining a variable
    serial: samsung: correct the case and default order in switch
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
    from Frederic Weisbecker.

    This necessiated quite some background infrastructure rework,
    including:

    * Clean up some irq-work internals
    * Implement remote irq-work
    * Implement nohz kick on top of remote irq-work
    * Move full dynticks timer enqueue notification to new kick
    * Move multi-task notification to new kick
    * Remove unecessary barriers on multi-task notification

    - Remove proliferation of wait_on_bit() action functions and allow
    wait_on_bit_action() functions to support a timeout. (Neil Brown)

    - Another round of sched/numa improvements, cleanups and fixes. (Rik
    van Riel)

    - Implement fast idling of CPUs when the system is partially loaded,
    for better scalability. (Tim Chen)

    - Restructure and fix the CPU hotplug handling code that may leave
    cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
    cpu. (Kirill Tkhai)

    - Robustify the sched topology setup code. (Peterz Zijlstra)

    - Improve sched_feat() handling wrt. static_keys (Jason Baron)

    - Misc fixes.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    sched/fair: Fix 'make xmldocs' warning caused by missing description
    sched: Use macro for magic number of -1 for setparam
    sched: Robustify topology setup
    sched: Fix sched_setparam() policy == -1 logic
    sched: Allow wait_on_bit_action() functions to support a timeout
    sched: Remove proliferation of wait_on_bit() action functions
    sched/numa: Revert "Use effective_load() to balance NUMA loads"
    sched: Fix static_key race with sched_feat()
    sched: Remove extra static_key*() function indirection
    sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
    sched: Transform resched_task() into resched_curr()
    sched/deadline: Kill task_struct->pi_top_task
    sched: Rework check_for_tasks()
    sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
    sched/fair: Disable runtime_enabled on dying rq
    sched/numa: Change scan period code to match intent
    sched/numa: Rework best node setting in task_numa_migrate()
    sched/numa: Examine a task move when examining a task swap
    sched/numa: Simplify task_numa_compare()
    sched/numa: Use effective_load() to balance NUMA loads
    ...

    Linus Torvalds
     
  • tcpm_key is an array inside struct tcp_md5sig, there is no need to check it
    against NULL.

    Signed-off-by: Dmitry Popov
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Dmitry Popov
     
  • commit 6cbdceeb1cb12c7d620161925a8c3e81daadb2e4
    bridge: Dump vlan information from a bridge port
    introduced a comment in an attempt to explain the
    code logic. The comment is unfinished so it confuses more
    than it explains, remove it.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     

04 Aug, 2014

1 commit


03 Aug, 2014

14 commits

  • The sizing of the hash table and the practice of requiring a lookup
    to retrieve the pprev to be stored in the element cookie before the
    deletion of an entry is left intact.

    Signed-off-by: Thomas Graf
    Acked-by: Patrick McHardy
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Heavy Netlink users such as Open vSwitch spend a considerable amount of
    time in netlink_lookup() due to the read-lock on nl_table_lock. Use of
    RCU relieves the lock contention.

    Makes use of the new resizable hash table to avoid locking on the
    lookup.

    The hash table will grow if entries exceeds 75% of table size up to a
    total table size of 64K. It will automatically shrink if usage falls
    below 30%.

    Also splits nl_table_lock into a separate mutex to protect hash table
    mutations and allow synchronize_rcu() to sleep while waiting for readers
    during expansion and shrinking.

    Before:
    9.16% kpktgend_0 [openvswitch] [k] masked_flow_lookup
    6.42% kpktgend_0 [pktgen] [k] mod_cur_headers
    6.26% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    6.23% kpktgend_0 [kernel.kallsyms] [k] memset
    4.79% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup
    4.37% kpktgend_0 [kernel.kallsyms] [k] memcpy
    3.60% kpktgend_0 [openvswitch] [k] ovs_flow_extract
    2.69% kpktgend_0 [kernel.kallsyms] [k] jhash2

    After:
    15.26% kpktgend_0 [openvswitch] [k] masked_flow_lookup
    8.12% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    7.92% kpktgend_0 [pktgen] [k] mod_cur_headers
    5.11% kpktgend_0 [kernel.kallsyms] [k] memset
    4.11% kpktgend_0 [openvswitch] [k] ovs_flow_extract
    4.06% kpktgend_0 [kernel.kallsyms] [k] _raw_spin_lock
    3.90% kpktgend_0 [kernel.kallsyms] [k] jhash2
    [...]
    0.67% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup

    Signed-off-by: Thomas Graf
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS fixes for net

    The following patchset contains Netfilter/IPVS fixes for your net tree,
    they are:

    1) Maintain all DSCP and ECN bits for IPv6 tun forwarding. This
    resolves an inconsistency between IPv4 and IPv6 behaviour.
    Patch from Alex Gartrell via Simon Horman.

    2) Fix unnoticeable blink in xt_LED when the led-always-blink option is
    used, from Jiri Prchal.

    3) Add missing return in nft_del_setelem(), otherwise this results in a
    double call of nft_data_uninit() in the nf_tables code, from Thomas Graf.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Fixes: e110861f86094cd ("net: add a sysctl to reflect the fwmark on replies")
    Cc: Lorenzo Colitti
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Use kmem_cache to allocate/free inet_frag_queue objects since they're
    all the same size per inet_frags user and are alloced/freed in high volumes
    thus making it a perfect case for kmem_cache.

    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Now that we have INET_FRAG_EVICTED we might as well use it to stop
    sending icmp messages in the "frag_expire" functions instead of
    stripping INET_FRAG_FIRST_IN from their flags when evicting.
    Also fix the comment style in ip6_expire_frag_queue().

    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Florian Westphal
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Fix a couple of functions' declaration alignments.

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • The last_in field has been used to store various flags different from
    first/last frag in so give it a more descriptive name: flags.

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Softirqs are already disabled so no need to do it again, thus let's be
    consistent and use the IP6_INC_STATS_BH variant.

    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • ip_local_deliver_finish() already have a rcu_read_lock/unlock, so
    the rcu_read_lock/unlock is unnecessary.

    See the stack below:
    ip_local_deliver_finish
    |
    |
    ->icmp_rcv
    |
    |
    ->icmp_socket_deliver

    Suggested-by: Hannes Frederic Sowa
    Signed-off-by: Duan Jiong
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Duan Jiong
     
  • clean up names related to socket filtering and bpf in the following way:
    - everything that deals with sockets keeps 'sk_*' prefix
    - everything that is pure BPF is changed to 'bpf_*' prefix

    split 'struct sk_filter' into
    struct sk_filter {
    atomic_t refcnt;
    struct rcu_head rcu;
    struct bpf_prog *prog;
    };
    and
    struct bpf_prog {
    u32 jited:1,
    len:31;
    struct sock_fprog_kern *orig_prog;
    unsigned int (*bpf_func)(const struct sk_buff *skb,
    const struct bpf_insn *filter);
    union {
    struct sock_filter insns[0];
    struct bpf_insn insnsi[0];
    struct work_struct work;
    };
    };
    so that 'struct bpf_prog' can be used independent of sockets and cleans up
    'unattached' bpf use cases

    split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

    __sk_filter_release(struct sk_filter *) gains
    __bpf_prog_release(struct bpf_prog *) helper function

    also perform related renames for the functions that work
    with 'struct bpf_prog *', since they're on the same lines:

    sk_filter_size -> bpf_prog_size
    sk_filter_select_runtime -> bpf_prog_select_runtime
    sk_filter_free -> bpf_prog_free
    sk_unattached_filter_create -> bpf_prog_create
    sk_unattached_filter_destroy -> bpf_prog_destroy
    sk_store_orig_filter -> bpf_prog_store_orig_filter
    sk_release_orig_filter -> bpf_release_orig_filter
    __sk_migrate_filter -> bpf_migrate_filter
    __sk_prepare_filter -> bpf_prepare_filter

    API for attaching classic BPF to a socket stays the same:
    sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
    and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
    which is used by sockets, tun, af_packet

    API for 'unattached' BPF programs becomes:
    bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
    and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
    which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • to indicate that this function is converting classic BPF into eBPF
    and not related to sockets

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • trivial rename to indicate that this functions performs classic BPF checking

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • trivial rename to better match semantics of macro

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov