14 Aug, 2014

27 commits

  • Greg Kroah-Hartman
     
  • commit 110dc24ad2ae4e9b94b08632fe1eb2fcdff83045 upstream.

    The addition of direct formatting of log items into the CIL
    linear buffer added alignment restrictions that the start of each
    vector needed to be 64 bit aligned. Hence padding was added in
    xlog_finish_iovec() to round up the vector length to ensure the next
    vector started with the correct alignment.

    This adds a small number of bytes to the size of
    the linear buffer that is otherwise unused. The issue is that we
    then use the linear buffer size to determine the log space used by
    the log item, and this includes the unused space. Hence when we
    account for space used by the log item, it's more than is actually
    written into the iclogs, and hence we slowly leak this space.

    This results on log hangs when reserving space, with threads getting
    stuck with these stack traces:

    Call Trace:
    [] schedule+0x29/0x70
    [] xlog_grant_head_wait+0xa2/0x1a0
    [] xlog_grant_head_check+0xbd/0x140
    [] xfs_log_reserve+0x103/0x220
    [] xfs_trans_reserve+0x2f5/0x310
    .....

    The 4 bytes is significant. Brain Foster did all the hard work in
    tracking down a reproducable leak to inode chunk allocation (it went
    away with the ikeep mount option). His rough numbers were that
    creating 50,000 inodes leaked 11 log blocks. This turns out to be
    roughly 800 inode chunks or 1600 inode cluster buffers. That
    works out at roughly 4 bytes per cluster buffer logged, and at that
    I started looking for a 4 byte leak in the buffer logging code.

    What I found was that a struct xfs_buf_log_format structure for an
    inode cluster buffer is 28 bytes in length. This gets rounded up to
    32 bytes, but the vector length remains 28 bytes. Hence the CIL
    ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
    for that vector rather than 28 bytes which are written into the log.

    The fix for this problem is to separately track the bytes used by
    the log vectors in the item and use that instead of the buffer
    length when accounting for the log space that will be used by the
    formatted log item.

    Again, thanks to Brian Foster for doing all the hard work and long
    hours to isolate this leak and make finding the bug relatively
    simple.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Cc: Bill
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • [ Upstream commit 093758e3daede29cb4ce6aedb111becf9d4bfc57 ]

    This commit is a guesswork, but it seems to make sense to drop this
    break, as otherwise the following line is never executed and becomes
    dead code. And that following line actually saves the result of
    local calculation by the pointer given in function argument. So the
    proposed change makes sense if this code in the whole makes sense (but I
    am unable to analyze it in the whole).

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641
    Reported-by: David Binderman
    Signed-off-by: Andrey Utkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andrey Utkin
     
  • [ Upstream commit 4ec1b01029b4facb651b8ef70bc20a4be4cebc63 ]

    The LDC handshake could have been asynchronously triggered
    after ldc_bind() enables the ldc_rx() receive interrupt-handler
    (and thus intercepts incoming control packets)
    and before vio_port_up() calls ldc_connect(). If that is the case,
    ldc_connect() should return 0 and let the state-machine
    progress.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Karl Volz
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sowmini Varadhan
     
  • [ Upstream commit fe418231b195c205701c0cc550a03f6c9758fd9e ]

    Fix detection of BREAK on sunsab serial console: BREAK detection was only
    performed when there were also serial characters received simultaneously.
    To handle all BREAKs correctly, the check for BREAK and the corresponding
    call to uart_handle_break() must also be done if count == 0, therefore
    duplicate this code fragment and pull it out of the loop over the received
    characters.

    Patch applies to 3.16-rc6.

    Signed-off-by: Christopher Alexander Tobias Schulze
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christopher Alexander Tobias Schulze
     
  • [ Upstream commit 5cdceab3d5e02eb69ea0f5d8fa9181800baf6f77 ]

    Fix regression in bbc i2c temperature and fan control on some Sun systems
    that causes the driver to refuse to load due to the bbc_i2c_bussel resource not
    being present on the (second) i2c bus where the temperature sensors and fan
    control are located. (The check for the number of resources was removed when
    the driver was ported to a pure OF driver in mid 2008.)

    Signed-off-by: Christopher Alexander Tobias Schulze
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christopher Alexander Tobias Schulze
     
  • [ Upstream commit 4ca9a23765da3260058db3431faf5b4efd8cf926 ]

    Based almost entirely upon a patch by Christopher Alexander Tobias
    Schulze.

    In commit db64fe02258f1507e13fe5212a989922323685ce ("mm: rewrite vmap
    layer") lazy VMAP tlb flushing was added to the vmalloc layer. This
    causes problems on sparc64.

    Sparc64 has two VMAP mapped regions and they are not contiguous with
    eachother. First we have the malloc mapping area, then another
    unrelated region, then the vmalloc region.

    This "another unrelated region" is where the firmware is mapped.

    If the lazy TLB flushing logic in the vmalloc code triggers after
    we've had both a module unload and a vfree or similar, it will pass an
    address range that goes from somewhere inside the malloc region to
    somewhere inside the vmalloc region, and thus covering the
    openfirmware area entirely.

    The sparc64 kernel learns about openfirmware's dynamic mappings in
    this region early in the boot, and then services TLB misses in this
    area. But openfirmware has some locked TLB entries which are not
    mentioned in those dynamic mappings and we should thus not disturb
    them.

    These huge lazy TLB flush ranges causes those openfirmware locked TLB
    entries to be removed, resulting in all kinds of problems including
    hard hangs and crashes during reboot/reset.

    Besides causing problems like this, such huge TLB flush ranges are
    also incredibly inefficient. A plea has been made with the author of
    the VMAP lazy TLB flushing code, but for now we'll put a safety guard
    into our flush_tlb_kernel_range() implementation.

    Since the implementation has become non-trivial, stop defining it as a
    macro and instead make it a function in a C source file.

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 18f38132528c3e603c66ea464727b29e9bbcb91b ]

    The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
    only be called when the PTE being installed will be accessible by the user.

    This is not true for code paths originating from remove_migration_pte().

    There are dire consequences for placing a non-valid PTE into the TSB. The TLB
    miss frramework assumes thatwhen a TSB entry matches we can just load it into
    the TLB and return from the TLB miss trap.

    So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
    and over, never satisfying the miss.

    Just exit early from update_mmu_cache() and friends in this situation.

    Based upon a report and patch from Christopher Alexander Tobias Schulze.

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 26053926feb1c16ade9c30bc7443bf28d829d08e ]

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 757efd32d5ce31f67193cc0e6a56e4dffcc42fb1 ]

    Dave reported following splat, caused by improper use of
    IP_INC_STATS_BH() in process context.

    BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
    caller is __this_cpu_preempt_check+0x13/0x20
    CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
    ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
    0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
    ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] check_preemption_disabled+0xfa/0x100
    [] __this_cpu_preempt_check+0x13/0x20
    [] sctp_packet_transmit+0x692/0x710 [sctp]
    [] sctp_outq_flush+0x2a2/0xc30 [sctp]
    [] ? mark_held_locks+0x7c/0xb0
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] sctp_outq_uncork+0x1a/0x20 [sctp]
    [] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
    [] sctp_do_sm+0xdb/0x330 [sctp]
    [] ? preempt_count_sub+0xab/0x100
    [] ? sctp_cname+0x70/0x70 [sctp]
    [] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
    [] sctp_sendmsg+0x88f/0xe30 [sctp]
    [] ? lock_release_holdtime.part.28+0x9a/0x160
    [] ? put_lock_stats.isra.27+0xe/0x30
    [] inet_sendmsg+0x104/0x220
    [] ? inet_sendmsg+0x5/0x220
    [] sock_sendmsg+0x9e/0xe0
    [] ? might_fault+0xb9/0xc0
    [] ? might_fault+0x5e/0xc0
    [] SYSC_sendto+0x124/0x1c0
    [] ? syscall_trace_enter+0x250/0x330
    [] SyS_sendto+0xe/0x10
    [] tracesys+0xdd/0xe2

    This is a followup of commits f1d8cba61c3c4b ("inet: fix possible
    seqlock deadlocks") and 7f88c6b23afbd315 ("ipv6: fix possible seqlock
    deadlock in ip6_finish_output2")

    Signed-off-by: Eric Dumazet
    Cc: Hannes Frederic Sowa
    Reported-by: Dave Jones
    Acked-by: Neil Horman
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit d9124268d84a836f14a6ead54ff9d8eee4c43be5 ]

    batadv_frag_insert_packet was unable to handle out-of-order packets because it
    dropped them directly. This is caused by the way the fragmentation lists is
    checked for the correct place to insert a fragmentation entry.

    The fragmentation code keeps the fragments in lists. The fragmentation entries
    are kept in descending order of sequence number. The list is traversed and each
    entry is compared with the new fragment. If the current entry has a smaller
    sequence number than the new fragment then the new one has to be inserted
    before the current entry. This ensures that the list is still in descending
    order.

    An out-of-order packet with a smaller sequence number than all entries in the
    list still has to be added to the end of the list. The used hlist has no
    information about the last entry in the list inside hlist_head and thus the
    last entry has to be calculated differently. Currently the code assumes that
    the iterator variable of hlist_for_each_entry can be used for this purpose
    after the hlist_for_each_entry finished. This is obviously wrong because the
    iterator variable is always NULL when the list was completely traversed.

    Instead the information about the last entry has to be stored in a different
    variable.

    This problem was introduced in 610bfc6bc99bc83680d190ebc69359a05fc7f605
    ("batman-adv: Receive fragmented packets and merge").

    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli
    Signed-off-by: Greg Kroah-Hartman

    Sven Eckelmann
     
  • [ Upstream commit 06ebb06d49486676272a3c030bfeef4bd969a8e6 ]

    Check for cases when the caller requests 0 bytes instead of running off
    and dereferencing potentially invalid iovecs.

    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sasha Levin
     
  • [ Upstream commit fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ]

    When performing segmentation, the mac_len value is copied right
    out of the original skb. However, this value is not always set correctly
    (like when the packet is VLAN-tagged) and we'll end up copying a bad
    value.

    One way to demonstrate this is to configure a VM which tags
    packets internally and turn off VLAN acceleration on the forwarding
    bridge port. The packets show up corrupt like this:
    16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
    (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
    0x0000: 8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
    0x0010: 0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
    0x0020: 29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
    0x0030: 000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
    0x0040: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
    0x0050: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
    0x0060: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
    ...

    This also leads to awful throughput as GSO packets are dropped and
    cause retransmissions.

    The solution is to set the mac_len using the values already available
    in then new skb. We've already adjusted all of the header offset, so we
    might as well correctly figure out the mac_len using skb_reset_mac_len().
    After this change, packets are segmented correctly and performance
    is restored.

    CC: Eric Dumazet
    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vlad Yasevich
     
  • [ Upstream commit 081e83a78db9b0ae1f5eabc2dedecc865f509b98 ]

    Macvlan devices do not initialize vlan_features. As a result,
    any vlan devices configured on top of macvlans perform very poorly.
    Initialize vlan_features based on the vlan features of the lower-level
    device.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vlad Yasevich
     
  • [ Upstream commit 1be9a950c646c9092fb3618197f7b6bfb50e82aa ]

    Jason reported an oops caused by SCTP on his ARM machine with
    SCTP authentication enabled:

    Internal error: Oops: 17 [#1] ARM
    CPU: 0 PID: 104 Comm: sctp-test Not tainted 3.13.0-68744-g3632f30c9b20-dirty #1
    task: c6eefa40 ti: c6f52000 task.ti: c6f52000
    PC is at sctp_auth_calculate_hmac+0xc4/0x10c
    LR is at sg_init_table+0x20/0x38
    pc : [] lr : [] psr: 40000013
    sp : c6f538e8 ip : 00000000 fp : c6f53924
    r10: c6f50d80 r9 : 00000000 r8 : 00010000
    r7 : 00000000 r6 : c7be4000 r5 : 00000000 r4 : c6f56254
    r3 : c00c8170 r2 : 00000001 r1 : 00000008 r0 : c6f1e660
    Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
    Control: 0005397f Table: 06f28000 DAC: 00000015
    Process sctp-test (pid: 104, stack limit = 0xc6f521c0)
    Stack: (0xc6f538e8 to 0xc6f54000)
    [...]
    Backtrace:
    [] (sctp_auth_calculate_hmac+0x0/0x10c) from [] (sctp_packet_transmit+0x33c/0x5c8)
    [] (sctp_packet_transmit+0x0/0x5c8) from [] (sctp_outq_flush+0x7fc/0x844)
    [] (sctp_outq_flush+0x0/0x844) from [] (sctp_outq_uncork+0x24/0x28)
    [] (sctp_outq_uncork+0x0/0x28) from [] (sctp_side_effects+0x1134/0x1220)
    [] (sctp_side_effects+0x0/0x1220) from [] (sctp_do_sm+0xac/0xd4)
    [] (sctp_do_sm+0x0/0xd4) from [] (sctp_assoc_bh_rcv+0x118/0x160)
    [] (sctp_assoc_bh_rcv+0x0/0x160) from [] (sctp_inq_push+0x6c/0x74)
    [] (sctp_inq_push+0x0/0x74) from [] (sctp_rcv+0x7d8/0x888)

    While we already had various kind of bugs in that area
    ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if
    we/peer is AUTH capable") and b14878ccb7fa ("net: sctp: cache
    auth_enable per endpoint"), this one is a bit of a different
    kind.

    Giving a bit more background on why SCTP authentication is
    needed can be found in RFC4895:

    SCTP uses 32-bit verification tags to protect itself against
    blind attackers. These values are not changed during the
    lifetime of an SCTP association.

    Looking at new SCTP extensions, there is the need to have a
    method of proving that an SCTP chunk(s) was really sent by
    the original peer that started the association and not by a
    malicious attacker.

    To cause this bug, we're triggering an INIT collision between
    peers; normal SCTP handshake where both sides intent to
    authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO
    parameters that are being negotiated among peers:

    ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->



    ...

    Since such collisions can also happen with verification tags,
    the RFC4895 for AUTH rather vaguely says under section 6.1:

    In case of INIT collision, the rules governing the handling
    of this Random Number follow the same pattern as those for
    the Verification Tag, as explained in Section 5.2.4 of
    RFC 2960 [5]. Therefore, each endpoint knows its own Random
    Number and the peer's Random Number after the association
    has been established.

    In RFC2960, section 5.2.4, we're eventually hitting Action B:

    B) In this case, both sides may be attempting to start an
    association at about the same time but the peer endpoint
    started its INIT after responding to the local endpoint's
    INIT. Thus it may have picked a new Verification Tag not
    being aware of the previous Tag it had sent this endpoint.
    The endpoint should stay in or enter the ESTABLISHED
    state but it MUST update its peer's Verification Tag from
    the State Cookie, stop any init or cookie timers that may
    running and send a COOKIE ACK.

    In other words, the handling of the Random parameter is the
    same as behavior for the Verification Tag as described in
    Action B of section 5.2.4.

    Looking at the code, we exactly hit the sctp_sf_do_dupcook_b()
    case which triggers an SCTP_CMD_UPDATE_ASSOC command to the
    side effect interpreter, and in fact it properly copies over
    peer_{random, hmacs, chunks} parameters from the newly created
    association to update the existing one.

    Also, the old asoc_shared_key is being released and based on
    the new params, sctp_auth_asoc_init_active_key() updated.
    However, the issue observed in this case is that the previous
    asoc->peer.auth_capable was 0, and has *not* been updated, so
    that instead of creating a new secret, we're doing an early
    return from the function sctp_auth_asoc_init_active_key()
    leaving asoc->asoc_shared_key as NULL. However, we now have to
    authenticate chunks from the updated chunk list (e.g. COOKIE-ACK).

    That in fact causes the server side when responding with ...

    active_key_id is still inherited from the
    endpoint, and the same as encoded into the chunk, it uses
    asoc->asoc_shared_key, which is still NULL, as an asoc_key
    and dereferences it in ...

    crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len)

    ... causing an oops. All this happens because sctp_make_cookie_ack()
    called with the *new* association has the peer.auth_capable=1
    and therefore marks the chunk with auth=1 after checking
    sctp_auth_send_cid(), but it is *actually* sent later on over
    the then *updated* association's transport that didn't initialize
    its shared key due to peer.auth_capable=0. Since control chunks
    in that case are not sent by the temporary association which
    are scheduled for deletion, they are issued for xmit via
    SCTP_CMD_REPLY in the interpreter with the context of the
    *updated* association. peer.auth_capable was 0 in the updated
    association (which went from COOKIE_WAIT into ESTABLISHED state),
    since all previous processing that performed sctp_process_init()
    was being done on temporary associations, that we eventually
    throw away each time.

    The correct fix is to update to the new peer.auth_capable
    value as well in the collision case via sctp_assoc_update(),
    so that in case the collision migrated from 0 -> 1,
    sctp_auth_asoc_init_active_key() can properly recalculate
    the secret. This therefore fixes the observed server panic.

    Fixes: 730fc3d05cd4 ("[SCTP]: Implete SCTP-AUTH parameter processing")
    Reported-by: Jason Gunthorpe
    Signed-off-by: Daniel Borkmann
    Tested-by: Jason Gunthorpe
    Cc: Vlad Yasevich
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit c36c9d50cc6af5c5bfcc195f21b73f55520c15f9 ]

    The recent commit "e29aa33 bna: Enable Multi Buffer RX" is causing
    a performance regression. It does not properly update 'cmpl' pointer
    at the end of the loop in NAPI handler bnad_cq_process(). The result is
    only one packet / per NAPI-schedule is processed.

    Signed-off-by: Ivan Vecera
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ivan Vecera
     
  • [ Upstream commit 1f74e613ded11517db90b2bd57e9464d9e0fb161 ]

    In vegas we do a multiplication of the cwnd and the rtt. This
    may overflow and thus their result is stored in a u64. However, we first
    need to cast the cwnd so that actually 64-bit arithmetic is done.

    Then, we need to do do_div to allow this to be used on 32-bit arches.

    Cc: Stephen Hemminger
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: David Laight
    Cc: Doug Leith
    Fixes: 8d3a564da34e (tcp: tcp_vegas cong avoid fix)
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christoph Paasch
     
  • [ Upstream commit 04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ]

    In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
    Jedidiah describe ways exploiting linux IP identifier generation to
    infer whether two machines are exchanging packets.

    With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
    changed IP id generation, but this does not really prevent this
    side-channel technique.

    This patch adds a random amount of perturbation so that IP identifiers
    for a given destination [1] are no longer monotonically increasing after
    an idle period.

    Note that prandom_u32_max(1) returns 0, so if generator is used at most
    once per jiffy, this patch inserts no hole in the ID suite and do not
    increase collision probability.

    This is jiffies based, so in the worst case (HZ=1000), the id can
    rollover after ~65 seconds of idle time, which should be fine.

    We also change the hash used in __ip_select_ident() to not only hash
    on daddr, but also saddr and protocol, so that ICMP probes can not be
    used to infer information for other protocols.

    For IPv6, adds saddr into the hash as well, but not nexthdr.

    If I ping the patched target, we can see ID are now hard to predict.

    21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
    21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

    21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
    21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

    21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
    21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

    [1] TCP sessions uses a per flow ID generator not changed by this patch.

    Signed-off-by: Eric Dumazet
    Reported-by: Jeffrey Knockel
    Reported-by: Jedidiah R. Crandall
    Cc: Willy Tarreau
    Cc: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ]

    Ideally, we would need to generate IP ID using a per destination IP
    generator.

    linux kernels used inet_peer cache for this purpose, but this had a huge
    cost on servers disabling MTU discovery.

    1) each inet_peer struct consumes 192 bytes

    2) inetpeer cache uses a binary tree of inet_peer structs,
    with a nominal size of ~66000 elements under load.

    3) lookups in this tree are hitting a lot of cache lines, as tree depth
    is about 20.

    4) If server deals with many tcp flows, we have a high probability of
    not finding the inet_peer, allocating a fresh one, inserting it in
    the tree with same initial ip_id_count, (cf secure_ip_id())

    5) We garbage collect inet_peer aggressively.

    IP ID generation do not have to be 'perfect'

    Goal is trying to avoid duplicates in a short period of time,
    so that reassembly units have a chance to complete reassembly of
    fragments belonging to one message before receiving other fragments
    with a recycled ID.

    We simply use an array of generators, and a Jenkin hash using the dst IP
    as a key.

    ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
    belongs (it is only used from this file)

    secure_ip_id() and secure_ipv6_id() no longer are needed.

    Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
    unnecessary decrement/increment of the number of segments.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 45a07695bc64b3ab5d6d2215f9677e5b8c05a7d0 ]

    In veno we do a multiplication of the cwnd and the rtt. This
    may overflow and thus their result is stored in a u64. However, we first
    need to cast the cwnd so that actually 64-bit arithmetic is done.

    A first attempt at fixing 76f1017757aa0 ([TCP]: TCP Veno congestion
    control) was made by 159131149c2 (tcp: Overflow bug in Vegas), but it
    failed to add the required cast in tcp_veno_cong_avoid().

    Fixes: 76f1017757aa0 ([TCP]: TCP Veno congestion control)
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christoph Paasch
     
  • [ Upstream commit 95cb5745983c222867cc9ac593aebb2ad67d72c0 ]

    Ipv4 tunnels created with "local any remote $ip" didn't work properly since
    7d442fab0 (ipv4: Cache dst in tunnels). 99% of packets sent via those tunnels
    had src addr = 0.0.0.0. That was because only dst_entry was cached, although
    fl4.saddr has to be cached too. Every time ip_tunnel_xmit used cached dst_entry
    (tunnel_rtable_get returned non-NULL), fl4.saddr was initialized with
    tnl_params->saddr (= 0 in our case), and wasn't changed until iptunnel_xmit().

    This patch adds saddr to ip_tunnel->dst_cache, fixing this issue.

    Reported-by: Sergey Popov
    Signed-off-by: Dmitry Popov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Popov
     
  • [ Upstream commit d92f5dec6325079c550889883af51db1b77d5623 ]

    Commit 87aa9f9c61ad ("net: phy: consolidate PHY reset in phy_init_hw()")
    moved the call to phy_scan_fixups() in phy_init_hw() after a software
    reset is performed.

    By the time phy_init_hw() is called in phy_device_register(), no driver
    has been bound to this PHY yet, so all the checks in phy_init_hw()
    against the PHY driver and the PHY driver's config_init function will
    return 0. We will therefore never call phy_scan_fixups() as we should.

    Fix this by calling phy_scan_fixups() and check for its return value to
    restore the intended functionality.

    This broke PHY drivers which do register an early PHY fixup callback to
    intercept the PHY probing and do things like changing the 32-bits unique
    PHY identifier when a pseudo-PHY address has been used, as well as
    board-specific PHY fixups that need to be applied during driver probe
    time.

    Reported-by: Hauke Merthens
    Reported-by: Jonas Gorski
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 40eea803c6b2cfaab092f053248cbeab3f368412 ]

    Sasha's report:
    > While fuzzing with trinity inside a KVM tools guest running the latest -next
    > kernel with the KASAN patchset, I've stumbled on the following spew:
    >
    > [ 4448.949424] ==================================================================
    > [ 4448.951737] AddressSanitizer: user-memory-access on address 0
    > [ 4448.952988] Read of size 2 by thread T19638:
    > [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
    > [ 4448.956823] ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
    > [ 4448.958233] ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
    > [ 4448.959552] 0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
    > [ 4448.961266] Call Trace:
    > [ 4448.963158] dump_stack (lib/dump_stack.c:52)
    > [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
    > [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
    > [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
    > [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
    > [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
    > [ 4448.970103] sock_sendmsg (net/socket.c:654)
    > [ 4448.971584] ? might_fault (mm/memory.c:3741)
    > [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
    > [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
    > [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
    > [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
    > [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
    > [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
    > [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
    > [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
    > [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
    > [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
    > [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
    > [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
    > [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
    > [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
    > [ 4448.988929] ==================================================================

    This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.

    After this report there was no usual "Unable to handle kernel NULL pointer dereference"
    and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.

    This bug was introduced in f3d3342602f8bcbf37d7c46641cb9bca7618eb1c
    (net: rework recvmsg handler msg_name and msg_namelen logic).
    Commit message states that:
    "Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
    non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
    affect sendto as it would bail out earlier while trying to copy-in the
    address."
    But in fact this affects sendto when address 0 is mapped and contains
    socket address structure in it. In such case copy-in address will succeed,
    verify_iovec() function will successfully exit with msg->msg_namelen > 0
    and msg->msg_name == NULL.

    This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.

    Cc: Hannes Frederic Sowa
    Cc: Eric Dumazet
    Cc:
    Reported-by: Sasha Levin
    Signed-off-by: Andrey Ryabinin
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • [ Upstream commit fe26566d8a05151ba1dce75081f6270f73ec4ae1 ]

    When TSO packet is transmitted additional BD w/o mapping is used
    to describe the packed. The BD needs special handling in tx
    completion.

    kernel: Call Trace:
    kernel: [] dump_stack+0x19/0x1b
    kernel: [] warn_slowpath_common+0x61/0x80
    kernel: [] warn_slowpath_fmt+0x5c/0x80
    kernel: [] ? find_iova+0x4d/0x90
    kernel: [] intel_unmap_page.part.36+0x142/0x160
    kernel: [] intel_unmap_page+0x26/0x30
    kernel: [] bnx2x_free_tx_pkt+0x157/0x2b0 [bnx2x]
    kernel: [] bnx2x_tx_int+0xac/0x220 [bnx2x]
    kernel: [] ? read_tsc+0x9/0x20
    kernel: [] bnx2x_poll+0xbb/0x3c0 [bnx2x]
    kernel: [] net_rx_action+0x15a/0x250
    kernel: [] __do_softirq+0xf7/0x290
    kernel: [] call_softirq+0x1c/0x30
    kernel: [] do_softirq+0x55/0x90
    kernel: [] irq_exit+0x115/0x120
    kernel: [] do_IRQ+0x58/0xf0
    kernel: [] common_interrupt+0x6d/0x6d
    kernel: [] ? clockevents_notify+0x127/0x140
    kernel: [] ? cpuidle_enter_state+0x4f/0xc0
    kernel: [] cpuidle_idle_call+0xc5/0x200
    kernel: [] arch_cpu_idle+0xe/0x30
    kernel: [] cpu_startup_entry+0xf5/0x290
    kernel: [] start_secondary+0x265/0x27b
    kernel: ---[ end trace 11aa7726f18d7e80 ]---

    Fixes: a848ade408b ("bnx2x: add CSUM and TSO support for encapsulation protocols")
    Reported-by: Yulong Pei
    Cc: Michal Schmidt
    Signed-off-by: Dmitry Kravkov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Kravkov
     
  • [ Upstream commit a0e5ef53aac8e5049f9344857d8ec5237d31e58b ]

    The SPI check introduced in ea9884b3acf3311c8a11db67bfab21773f6f82ba
    was intended for IPComp SAs but actually prevented AH SAs from getting
    installed (depending on the SPI).

    Fixes: ea9884b3acf3 ("xfrm: check user specified spi for IPComp")
    Cc: Fan Du
    Signed-off-by: Tobias Brunner
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Tobias Brunner
     
  • [ Upstream commit b7eea4545ea775df957460f58eb56085a8892856 ]

    xfrm_lookup must return a dst_entry with a refcount for the caller.
    Git commit 1a1ccc96abb ("xfrm: Remove caching of xfrm_policy_sk_bundles")
    removed this refcount for the socket policy case accidentally.
    This patch restores it and sets DST_NOCACHE flag to make sure
    that the dst_entry is freed when the refcount becomes null.

    Fixes: 1a1ccc96abb ("xfrm: Remove caching of xfrm_policy_sk_bundles")
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Steffen Klassert
     
  • [ Upstream commit 474ea9cafc459976827a477f2c30eaf6313cb7c1 ]

    Packets shorter than ETH_ZLEN were not padded with zeroes, hence leaking
    potentially sensitive information. This bug has been present since the
    driver got accepted in commit 1c1008c793fa46703a2fee469f4235e1c7984333
    ("net: bcmgenet: add main driver file").

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     

08 Aug, 2014

13 commits

  • Greg Kroah-Hartman
     
  • commit 8762e5092828c4dc0f49da5a47a644c670df77f3 upstream.

    init_espfix_ap() is currently off by one level when informing hypervisor
    that allocated pages will be used for ministacks' page tables.

    The most immediate effect of this on a PV guest is that if
    'stack_page = __get_free_page()' returns a non-zeroed-out page the hypervisor
    will refuse to use it for a page table (which it shouldn't be anyway). This will
    result in warnings by both Xen and Linux.

    More importantly, a subsequent write to that page (again, by a PV guest) is
    likely to result in fatal page fault.

    Signed-off-by: Boris Ostrovsky
    Link: http://lkml.kernel.org/r/1404926298-5565-1-git-send-email-boris.ostrovsky@oracle.com
    Reviewed-by: Konrad Rzeszutek Wilk
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit c75b53af2f0043aff500af0a6f878497bef41bca upstream.

    I use btree from 3.14-rc2 in my own module. When the btree module is
    removed, a warning arises:

    kmem_cache_destroy btree_node: Slab cache still has objects
    CPU: 13 PID: 9150 Comm: rmmod Tainted: GF O 3.14.0-rc2 #1
    Hardware name: Inspur NF5270M3/NF5270M3, BIOS CHEETAH_2.1.3 09/10/2013
    Call Trace:
    dump_stack+0x49/0x5d
    kmem_cache_destroy+0xcf/0xe0
    btree_module_exit+0x10/0x12 [btree]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

    The cause is that it doesn't release the last btree node, when height = 1
    and fill = 1.

    [akpm@linux-foundation.org: remove unneeded test of NULL]
    Signed-off-by: Minfei Huang
    Cc: Joern Engel
    Cc: Johannes Berg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minfei Huang
     
  • commit 3cf521f7dc87c031617fd47e4b7aa2593c2f3daf upstream.

    The l2tp [get|set]sockopt() code has fallen back to the UDP functions
    for socket option levels != SOL_PPPOL2TP since day one, but that has
    never actually worked, since the l2tp socket isn't an inet socket.

    As David Miller points out:

    "If we wanted this to work, it'd have to look up the tunnel and then
    use tunnel->sk, but I wonder how useful that would be"

    Since this can never have worked so nobody could possibly have depended
    on that functionality, just remove the broken code and return -EINVAL.

    Reported-by: Sasha Levin
    Acked-by: James Chapman
    Acked-by: David Miller
    Cc: Phil Turnbull
    Cc: Vegard Nossum
    Cc: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sasha Levin
     
  • commit 17290231df16eeee5dfc198dbf5ee4b419996dcd upstream.

    There are two FIXMEs in the double exception handler 'for the extremely
    unlikely case'. This case gets hit by gcc during kernel build once in
    a few hours, resulting in an unrecoverable exception condition.

    Provide missing fixup routine to handle this case. Double exception
    literals now need 8 more bytes, add them to the linker script.

    Also replace bbsi instructions with bbsi.l as we're branching depending
    on 8th and 7th LSB-based bits of exception address.

    This may be tested by adding the explicit DTLB invalidation to window
    overflow handlers, like the following:

    # --- a/arch/xtensa/kernel/vectors.S
    # +++ b/arch/xtensa/kernel/vectors.S
    # @@ -592,6 +592,14 @@ ENDPROC(_WindowUnderflow4)
    # ENTRY_ALIGN64(_WindowOverflow8)
    #
    # s32e a0, a9, -16
    # + bbsi.l a9, 31, 1f
    # + rsr a0, ccount
    # + bbsi.l a0, 4, 1f
    # + pdtlb a0, a9
    # + idtlb a0
    # + movi a0, 9
    # + idtlb a0
    # +1:
    # l32e a0, a1, -12
    # s32e a2, a9, -8
    # s32e a1, a9, -12

    Signed-off-by: Max Filippov
    Signed-off-by: Greg Kroah-Hartman

    Max Filippov
     
  • commit ea9f9274bf4337ba7cbab241c780487651642d63 upstream.

    Remove xen_enable_nmi() to fix a 64-bit guest crash when registering
    the NMI callback on Xen 3.1 and earlier.

    It's not needed since the NMI callback is set by a set_trap_table
    hypercall (in xen_load_idt() or xen_write_idt_entry()).

    It's also broken since it only set the current VCPU's callback.

    Signed-off-by: David Vrabel
    Reported-by: Vitaly Kuznetsov
    Tested-by: Vitaly Kuznetsov
    Cc: Steven Noonan
    Signed-off-by: Greg Kroah-Hartman

    David Vrabel
     
  • commit 724cb06fa9b1e1ffd98188275543fdb3b8eaca4f upstream.

    commit c675949ec58ca50d5a3ae3c757892f1560f6e896
    drm/i915: do not setup backlight if not available according to VBT

    caused a regression on the HP Chromebook 14 (with Celeron 2955U CPU),
    which has a misconfigured VBT. Apply quirk to ignore the VBT backlight
    presence check during backlight setup.

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=79813
    Signed-off-by: Scot Doyle
    Tested-by: Stefan Nagy
    Cc: Jani Nikula
    Cc: Daniel Vetter
    Signed-off-by: Daniel Vetter
    Signed-off-by: Greg Kroah-Hartman

    Scot Doyle
     
  • commit 6cff1f6ad4c615319c1a146b2aa0af1043c5e9f5 upstream.

    WARNING: CPU: 0 PID: 929 at /home/apw/COD/linux/kernel/irq/handle.c:147 handle_irq_event_percpu+0x1d1/0x1e0()
    irq 17 handler device_intr+0x0/0xa80 [vt6655_stage] enabled interrupts

    Using spin_lock_irqsave appears to fix this.

    Signed-off-by: Malcolm Priestley
    Signed-off-by: Greg Kroah-Hartman

    Malcolm Priestley
     
  • commit e120fb459693bbc1ac3eabdd65c3659d7cfbfd2a upstream.

    After clarification from the hardware team it was found that
    this 1.8V PHY supply can't be switched OFF when SoC is Active.

    Since the PHY IPs don't contain isolation logic built in the design to
    allow the power rail to be switched off, there is a very high risk
    of IP reliability and additional leakage paths which can result in
    additional power consumption.

    The only scenario where this rail can be switched off is part of Power on
    reset sequencing, but it needs to be kept always-on during operation.

    This patch is required for proper functionality of USB, SATA
    and PCIe on DRA7-evm.

    CC: Rajendra Nayak
    CC: Tero Kristo
    Signed-off-by: Roger Quadros
    Signed-off-by: Tony Lindgren
    Signed-off-by: Greg Kroah-Hartman

    Roger Quadros
     
  • commit 6d2b6170c8914c6c69256b687651fb16d7ec3e18 upstream.

    Fix the broken check for calling sys_fallocate() on an active swapfile,
    introduced by commit 0790b31b69374ddadefe ("fs: disallow all fallocate
    operation on active swapfile").

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 23d9cec07c589276561c13b180577c0b87930140 upstream.

    The DRA74/72 control module pins have a weak pull up and pull down.
    This is configured by bit offset 17. if BIT(17) is 1, a pull up is
    selected, else a pull down is selected.

    However, this pull resisstor is applied based on BIT(16) -
    PULLUDENABLE - if BIT(18) is *0*, then pull as defined in BIT(17) is
    applied, else no weak pulls are applied. We defined this in reverse.

    Reference: Table 18-5 (Description of the pad configuration register
    bits) in Technical Reference Manual Revision (DRA74x revision Q:
    SPRUHI2Q Revised June 2014 and DRA72x revision F: SPRUHP2F - Revised
    June 2014)

    Fixes: 6e58b8f1daaf1a ("ARM: dts: DRA7: Add the dts files for dra7 SoC and dra7-evm board")
    Signed-off-by: Nishanth Menon
    Tested-by: Felipe Balbi
    Acked-by: Felipe Balbi
    Signed-off-by: Tony Lindgren
    Signed-off-by: Greg Kroah-Hartman

    Nishanth Menon
     
  • commit 7209a75d2009dbf7745e2fd354abf25c3deb3ca3 upstream.

    This moves the espfix64 logic into native_iret. To make this work,
    it gets rid of the native patch for INTERRUPT_RETURN:
    INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.

    This changes the 16-bit SS behavior on Xen from OOPSing to leaking
    some bits of the Xen hypervisor's RSP (I think).

    [ hpa: this is a nonzero cost on native, but probably not enough to
    measure. Xen needs to fix this in their own code, probably doing
    something equivalent to espfix64. ]

    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit 34273f41d57ee8d854dcd2a1d754cbb546cb548f upstream.

    Embedded systems, which may be very memory-size-sensitive, are
    extremely unlikely to ever encounter any 16-bit software, so make it
    a CONFIG_EXPERT option to turn off support for any 16-bit software
    whatsoever.

    Signed-off-by: H. Peter Anvin
    Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    H. Peter Anvin