10 Mar, 2014

3 commits

  • security_xfrm_policy_alloc can be called in atomic context so the
    allocation should be done with GFP_ATOMIC. Add an argument to let the
    callers choose the appropriate way. In order to do so a gfp argument
    needs to be added to the method xfrm_policy_alloc_security in struct
    security_operations and to the internal function
    selinux_xfrm_alloc_user. After that switch to GFP_ATOMIC in the atomic
    callers and leave GFP_KERNEL as before for the rest.
    The path that needed the gfp argument addition is:
    security_xfrm_policy_alloc -> security_ops.xfrm_policy_alloc_security ->
    all users of xfrm_policy_alloc_security (e.g. selinux_xfrm_policy_alloc) ->
    selinux_xfrm_alloc_user (here the allocation used to be GFP_KERNEL only)

    Now adding a gfp argument to selinux_xfrm_alloc_user requires us to also
    add it to security_context_to_sid which is used inside and prior to this
    patch did only GFP_KERNEL allocation. So add gfp argument to
    security_context_to_sid and adjust all of its callers as well.

    CC: Paul Moore
    CC: Dave Jones
    CC: Steffen Klassert
    CC: Fan Du
    CC: David S. Miller
    CC: LSM list
    CC: SELinux list

    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Paul Moore
    Signed-off-by: Steffen Klassert

    Nikolay Aleksandrov
     
  • There's a kmalloc with GFP_KERNEL in a helper
    (pfkey_sadb2xfrm_user_sec_ctx) used in pfkey_compile_policy which is
    called under rcu_read_lock. Adjust pfkey_sadb2xfrm_user_sec_ctx to have
    a gfp argument and adjust the users.

    CC: Dave Jones
    CC: Steffen Klassert
    CC: Fan Du
    CC: David S. Miller

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: Steffen Klassert

    Nikolay Aleksandrov
     
  • The pci shutdown handler added in:

    bnx2: Add pci shutdown handler
    commit 25bfb1dd4ba3b2d9a49ce9d9b0cd7be1840e15ed

    created a shutdown down sequence without chip reset if the device was
    never brought up. This can cause the firmware to shutdown the PHY
    prematurely and cause MMIO read cycles to be unresponsive. On some
    systems, it may generate NMI in the bnx2's pci shutdown handler.

    The fix is to tell the firmware not to shutdown the PHY if there was
    no prior chip reset.

    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Michael Chan
     

07 Mar, 2014

19 commits

  • DST_NOCOUNT should only be used if an authorized user adds routes
    locally. In case of routes which are added on behalf of router
    advertisments this flag must not get used as it allows an unlimited
    number of routes getting added remotely.

    Signed-off-by: Sabrina Dubroca
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     
  • Currently, the PF call to pci_enable_sriov from the PF probe function
    stalls for 10 seconds times the number of VFs probed on the host. This
    happens because the way for such VFs to determine of the PF
    initialization finished, is by attempting to issue reset on the
    comm-channel and get timeout (after 10s).

    The PF probe function is called from a kenernel workqueue, and therefore
    during that time, rcu lock is being held and kernel's workqueue is
    stalled. This blocks other processes that try to use the workqueue
    or rcu lock. For example, interface renaming which is calling
    rcu_synchronize is blocked, and timedout by systemd.

    Changed mlx4_init_slave() to allow VF probed on the host to immediatly
    detect that the PF is not ready, and return EPROBE_DEFER instantly.

    Only when the PF finishes the initialization, allow such VFs to
    access the comm channel.

    This issue and fix are relevant only for probed VFs on the hypervisor,
    there is no way to pass this information to a VM until comm channel is
    ready, so in a VM, if PF is not ready, the first command will be timedout
    after 10 seconds and return EPROBE_DEFER.

    Signed-off-by: Amir Vadai
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Amir Vadai
     
  • Fix a regression introduced by [1]. outbox was accessed instead of
    outbox->buf. Typo was copy-pasted to [2] and [3].

    [1] - cc1ade9 mlx4_core: Disable memory windows for virtual functions
    [2] - 4de6580 mlx4_core: Add support for steerable IB UD QPs
    [3] - 7ffdf72 net/mlx4_core: Add basic support for TCP/IP offloads under
    tunneling

    Signed-off-by: Or Gerlitz
    Signed-off-by: Amir Vadai
    Signed-off-by: David S. Miller

    Amir Vadai
     
  • We didn't correctly check cases where the value for lp_interval is not
    within the legal range due to a missing table terminator.

    This would let userspace trigger a kernel panic by specifying a value out
    of range:

    echo -1 > /sys/devices/virtual/net/bond0/bonding/lp_interval

    Introduced by commit 4325b374f84 ("bonding: convert lp_interval to use
    the new option API").

    Acked-by: Nikolay Aleksandrov
    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     
  • Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS
    offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds).

    This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS
    header.

    The issue detected and the fix verified by running MS HCK Offload LSO test on
    top of QEMU Windows guests, as this test sends IPv6 packets with
    IPPROTO_DSTOPTS.

    Signed-off-by: Anton Nayshtut
    Signed-off-by: David S. Miller

    Anton Nayshtut
     
  • The code to load a MAC address into a u64 for passing to the
    hypervisor via a register is broken on little endian.

    Create a helper function called ibmveth_encode_mac_addr
    which does the right thing in both big and little endian.

    We were storing the MAC address in a long in struct ibmveth_adapter.
    It's never used so remove it - we don't need another place in the
    driver where we create endian issues with MAC addresses.

    Signed-off-by: Anton Blanchard
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller

    Anton Blanchard
     
  • The unix socket code is using the result of csum_partial to
    hash into a lookup table:

    unix_hash_fold(csum_partial(sunaddr, len, 0));

    csum_partial is only guaranteed to produce something that can be
    folded into a checksum, as its prototype explains:

    * returns a 32-bit number suitable for feeding into itself
    * or csum_tcpudp_magic

    The 32bit value should not be used directly.

    Depending on the alignment, the ppc64 csum_partial will return
    different 32bit partial checksums that will fold into the same
    16bit checksum.

    This difference causes the following testcase (courtesy of
    Gustavo) to sometimes fail:

    #include
    #include

    int main()
    {
    int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

    int i = 1;
    setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

    struct sockaddr addr;
    addr.sa_family = AF_LOCAL;
    bind(fd, &addr, 2);

    listen(fd, 128);

    struct sockaddr_storage ss;
    socklen_t sslen = (socklen_t)sizeof(ss);
    getsockname(fd, (struct sockaddr*)&ss, &sslen);

    fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

    if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
    perror(NULL);
    return 1;
    }
    printf("OK\n");
    return 0;
    }

    As suggested by davem, fix this by using csum_fold to fold the
    partial 32bit checksum into a 16bit checksum before using it.

    Signed-off-by: Anton Blanchard
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller

    Anton Blanchard
     
  • The original documentation was very unclear.

    The code fix is presumably related to the formerly unclear
    documentation: SOCK_TIMESTAMPING_RX_SOFTWARE has no effect on
    __sock_recv_timestamp's behavior, so calling __sock_recv_ts_and_drops
    from sock_recv_ts_and_drops if only SOCK_TIMESTAMPING_RX_SOFTWARE is
    set is pointless. This should have no user-observable effect.

    Signed-off-by: Andy Lutomirski
    Acked-by: Richard Cochran
    Signed-off-by: David S. Miller

    Andrew Lutomirski
     
  • With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the
    settings[] "array subscript is above array bounds", I think because idx is
    a signed integer and if the caller supplied idx < 0, we pass the guard but
    still reference out of bounds.

    Fix this by making idx unsigned here and elsewhere.

    Signed-off-by: Bjorn Helgaas
    Acked-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Bjorn Helgaas
     
  • Quoting Alexander Aring:
    While fragmentation and unloading of 6lowpan module I got this kernel Oops
    after few seconds:

    BUG: unable to handle kernel paging request at f88bbc30
    [..]
    Modules linked in: ipv6 [last unloaded: 6lowpan]
    Call Trace:
    [] ? call_timer_fn+0x54/0xb3
    [] ? process_timeout+0xa/0xa
    [] run_timer_softirq+0x140/0x15f

    Problem is that incomplete frags are still around after unload; when
    their frag expire timer fires, we get crash.

    When a netns is removed (also done when unloading module), inet_frag
    calls the evictor with 'force' argument to purge remaining frags.

    The evictor loop terminates when accounted memory ('work') drops to 0
    or the lru-list becomes empty. However, the mem accounting is done
    via percpu counters and may not be accurate, i.e. loop may terminate
    prematurely.

    Alter evictor to only stop once the lru list is empty when force is
    requested.

    Reported-by: Phoebe Buckheister
    Reported-by: Alexander Aring
    Tested-by: Alexander Aring
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Eric Hugne says:

    ====================
    tipc: refcount and memory leak fixes

    v3: Remove error logging from data path completely. Rebased on top of
    latest net merge.

    v2: Drop specific -ENOMEM logging in patch #1 (tipc: allow connection
    shutdown callback to be invoked in advance) And add a general error
    message if an internal server tries to send a message on a
    closed/nonexisting connection.

    In addition to the fix for refcount leak and memory leak during
    module removal, we also fix a problem where the topology server
    listening socket where unexpectedly closed. We also eliminate an
    unnecessary context switch during accept()/recvmsg() for nonblocking
    sockets.

    It might be good to include this patchset in stable aswell. After the
    v3 rebase on latest merge from net all patches apply cleanly on that
    tree.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Failure to schedule a TIPC tasklet with tipc_k_signal because the
    tasklet handler is disabled is not an error. It means TIPC is
    currently in the process of shutting down. We remove the error
    logging in this case.

    Signed-off-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • When the TIPC module is removed, the tasklet handler is disabled
    before all other subsystems. This will cause lingering publications
    in the name table because the node_down tasklets responsible to
    clean up publications from an unreachable node will never run.
    When the name table is shut down, these publications are detected
    and an error message is logged:
    tipc: nametbl_stop(): orphaned hash chain detected
    This is actually a memory leak, introduced with commit
    993b858e37b3120ee76d9957a901cca22312ffaa ("tipc: correct the order
    of stopping services at rmmod")

    Instead of just logging an error and leaking memory, we free
    the orphaned entries during nametable shutdown.

    Signed-off-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • When a topology server subscriber is disconnected, the associated
    connection id is set to zero. A check vs zero is then done in the
    subscription timeout function to see if the subscriber have been
    shut down. This is unnecessary, because all subscription timers
    will be cancelled when a subscriber terminates. Setting the
    connection id to zero is actually harmful because id zero is the
    identity of the topology server listening socket, and can cause a
    race that leads to this socket being closed instead.

    Signed-off-by: Erik Hugne
    Acked-by: Ying Xue
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • When messages are received via tipc socket under non-block mode,
    schedule_timeout() is called in tipc_wait_for_rcvmsg(), that is,
    the process of receiving messages will be scheduled once although
    timeout value passed to schedule_timeout() is 0. The same issue
    exists in accept()/wait_for_accept(). To avoid this unnecessary
    process switch, we only call schedule_timeout() if the timeout
    value is non-zero.

    Signed-off-by: Ying Xue
    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • When tipc_conn_sendmsg() calls tipc_conn_lookup() to query a
    connection instance, its reference count value is increased if
    it's found. But subsequently if it's found that the connection is
    closed, the work of sending message is not queued into its server
    send workqueue, and the connection reference count is not decreased.
    This will cause a reference count leak. To reproduce this problem,
    an application would need to open and closes topology server
    connections with high intensity.

    We fix this by immediately decrementing the connection reference
    count if a send fails due to the connection being closed.

    Signed-off-by: Ying Xue
    Acked-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Currently connection shutdown callback function is called when
    connection instance is released in tipc_conn_kref_release(), and
    receiving packets and sending packets are running in different
    threads. Even if connection is closed by the thread of receiving
    packets, its shutdown callback may not be called immediately as
    the connection reference count is non-zero at that moment. So,
    although the connection is shut down by the thread of receiving
    packets, the thread of sending packets doesn't know it. Before
    its shutdown callback is invoked to tell the sending thread its
    connection has been closed, the sending thread may deliver
    messages by tipc_conn_sendmsg(), this is why the following error
    information appears:

    "Sending subscription event failed, no memory"

    To eliminate it, allow connection shutdown callback function to
    be called before connection id is removed in tipc_close_conn(),
    which makes the sending thread know the truth in time that its
    socket is closed so that it doesn't send message to it. We also
    remove the "Sending XXX failed..." error reporting for topology
    and config services.

    Signed-off-by: Ying Xue
    Signed-off-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • As pppol2tp_recv() never queues up packets to plain L2TP sockets,
    pppol2tp_recvmsg() never returns data to userspace, thus making
    the recv*() system calls unusable.

    Instead of dropping packets when the L2TP socket isn't bound to a PPP
    channel, this patch adds them to its reception queue.

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • Commit e0d4435f "l2tp: Update PPP-over-L2TP driver to work over L2TPv3"
    broke the PPPOL2TP_SO_SENDSEQ setsockopt. The L2TP header length was
    previously computed by pppol2tp_l2t_header_len() before each call to
    l2tp_xmit_skb(). Now that header length is retrieved from the hdr_len
    session field, this field must be updated every time the L2TP header
    format is modified, or l2tp_xmit_skb() won't push the right amount of
    data for the L2TP header.

    This patch uses l2tp_session_set_header_len() to adjust hdr_len every
    time sequencing is (de)activated from userspace (either by the
    PPPOL2TP_SO_SENDSEQ setsockopt or the L2TP_ATTR_SEND_SEQ netlink
    attribute).

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     

06 Mar, 2014

8 commits

  • It moves the state setting for query into rndis_filter_receive_response().
    All callbacks including query-complete and status-callback are synchronized
    by channel->inbound_lock. This prevents pentential race between them.

    Signed-off-by: Haiyang Zhang
    Signed-off-by: David S. Miller

    Haiyang Zhang
     
  • When allocating RX buffers a fixed size is used, while freeing is based
    on actually received bytes, resulting in the following kernel warning
    when CONFIG_DMA_API_DEBUG is enabled:
    WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:1051 check_unmap+0x258/0x894()
    macb e000b000.ethernet: DMA-API: device driver frees DMA memory with different size [device address=0x000000002d170040] [map size=1536 bytes] [unmap size=60 bytes]
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc3-xilinx-00220-g49f84081ce4f #65
    [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [] (show_stack) from [] (dump_stack+0x7c/0xc8)
    [] (dump_stack) from [] (warn_slowpath_common+0x60/0x84)
    [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x2c/0x3c)
    [] (warn_slowpath_fmt) from [] (check_unmap+0x258/0x894)
    [] (check_unmap) from [] (debug_dma_unmap_page+0x64/0x70)
    [] (debug_dma_unmap_page) from [] (gem_rx+0x118/0x170)
    [] (gem_rx) from [] (macb_poll+0x24/0x94)
    [] (macb_poll) from [] (net_rx_action+0x6c/0x188)
    [] (net_rx_action) from [] (__do_softirq+0x108/0x280)
    [] (__do_softirq) from [] (irq_exit+0x84/0xf8)
    [] (irq_exit) from [] (handle_IRQ+0x68/0x8c)
    [] (handle_IRQ) from [] (gic_handle_irq+0x3c/0x60)
    [] (gic_handle_irq) from [] (__irq_svc+0x44/0x78)
    Exception stack(0xc056df20 to 0xc056df68)
    df20: 00000001 c0577430 00000000 c0577430 04ce8e0d 00000002 edfce238 00000000
    df40: 04e20f78 00000002 c05981f4 00000000 00000008 c056df68 c0064008 c02d7658
    df60: 20000013 ffffffff
    [] (__irq_svc) from [] (cpuidle_enter_state+0x54/0xf8)
    [] (cpuidle_enter_state) from [] (cpuidle_idle_call+0xe0/0x138)
    [] (cpuidle_idle_call) from [] (arch_cpu_idle+0x8/0x3c)
    [] (arch_cpu_idle) from [] (cpu_startup_entry+0xbc/0x124)
    [] (cpu_startup_entry) from [] (start_kernel+0x350/0x3b0)
    ---[ end trace d5fdc38641bd3a11 ]---
    Mapped at:
    [] debug_dma_map_page+0x48/0x11c
    [] gem_rx_refill+0x154/0x1f8
    [] macb_open+0x270/0x3e0
    [] __dev_open+0x7c/0xfc
    [] __dev_change_flags+0x8c/0x140

    Fixing this by passing the same size which is passed during mapping the
    memory to the unmap function as well.

    Signed-off-by: Soren Brinkmann
    Reviewed-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Soren Brinkmann
     
  • With CONFIG_DMA_API_DEBUG enabled the following warning is printed:
    WARNING: CPU: 0 PID: 619 at lib/dma-debug.c:1101 check_unmap+0x758/0x894()
    macb e000b000.ethernet: DMA-API: device driver failed to check map error[device address=0x000000002d171c02] [size=322 bytes] [mapped as single]
    Modules linked in:
    CPU: 0 PID: 619 Comm: udhcpc Not tainted 3.14.0-rc3-xilinx-00219-gd158fc7f36a2 #63
    [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [] (show_stack) from [] (dump_stack+0x7c/0xc8)
    [] (dump_stack) from [] (warn_slowpath_common+0x60/0x84)
    [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x2c/0x3c)
    [] (warn_slowpath_fmt) from [] (check_unmap+0x758/0x894)
    [] (check_unmap) from [] (debug_dma_unmap_page+0x64/0x70)
    [] (debug_dma_unmap_page) from [] (macb_interrupt+0x1f8/0x2dc)
    [] (macb_interrupt) from [] (handle_irq_event_percpu+0x2c/0x178)
    [] (handle_irq_event_percpu) from [] (handle_irq_event+0x3c/0x5c)
    [] (handle_irq_event) from [] (handle_fasteoi_irq+0xb8/0x100)
    [] (handle_fasteoi_irq) from [] (generic_handle_irq+0x20/0x30)
    [] (generic_handle_irq) from [] (handle_IRQ+0x64/0x8c)
    [] (handle_IRQ) from [] (gic_handle_irq+0x3c/0x60)
    [] (gic_handle_irq) from [] (__irq_svc+0x44/0x78)
    Exception stack(0xed197f60 to 0xed197fa8)
    7f60: 00000134 60000013 bd94362e bd94362e be96b37c 00000014 fffffd72 00000122
    7f80: c000ebe4 ed196000 00000000 00000011 c032c0d8 ed197fa8 c0064008 c000ea20
    7fa0: 60000013 ffffffff
    [] (__irq_svc) from [] (ret_fast_syscall+0x0/0x48)
    ---[ end trace 478f921d0d542d1e ]---
    Mapped at:
    [] debug_dma_map_page+0x48/0x11c
    [] macb_start_xmit+0x184/0x2a8
    [] dev_hard_start_xmit+0x334/0x470
    [] sch_direct_xmit+0x78/0x2f8
    [] __dev_queue_xmit+0x318/0x708

    due to missing checks of the dma mapping. Add the appropriate checks to fix
    this.

    Signed-off-by: Soren Brinkmann
    Reviewed-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Soren Brinkmann
     
  • While working on ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to
    verify if we/peer is AUTH capable"), we noticed that there's a skb
    memory leakage in the error path.

    Running the same reproducer as in ec0223ec48a9 and by unconditionally
    jumping to the error label (to simulate an error condition) in
    sctp_sf_do_5_1D_ce() receive path lets kmemleak detector bark about
    the unfreed chunk->auth_chunk skb clone:

    Unreferenced object 0xffff8800b8f3a000 (size 256):
    comm "softirq", pid 0, jiffies 4294769856 (age 110.757s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    89 ab 75 5e d4 01 58 13 00 00 00 00 00 00 00 00 ..u^..X.........
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] kmem_cache_alloc+0xc8/0x210
    [] skb_clone+0x49/0xb0
    [] sctp_endpoint_bh_rcv+0x1d9/0x230 [sctp]
    [] sctp_inq_push+0x4c/0x70 [sctp]
    [] sctp_rcv+0x82e/0x9a0 [sctp]
    [] ip_local_deliver_finish+0xa8/0x210
    [] nf_reinject+0xbf/0x180
    [] nfqnl_recv_verdict+0x1d2/0x2b0 [nfnetlink_queue]
    [] nfnetlink_rcv_msg+0x14b/0x250 [nfnetlink]
    [] netlink_rcv_skb+0xa9/0xc0
    [] nfnetlink_rcv+0x23f/0x408 [nfnetlink]
    [] netlink_unicast+0x168/0x250
    [] netlink_sendmsg+0x2e1/0x3f0
    [] sock_sendmsg+0x8b/0xc0
    [] ___sys_sendmsg+0x369/0x380

    What happens is that commit bbd0d59809f9 clones the skb containing
    the AUTH chunk in sctp_endpoint_bh_rcv() when having the edge case
    that an endpoint requires COOKIE-ECHO chunks to be authenticated:

    ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->

    auth_chunk, we could hit the "goto nomem_init" path from
    an error condition and thus leave the cloned skb around w/o
    freeing it.

    The fix is to centrally free such clones in sctp_chunk_destroy()
    handler that is invoked from sctp_chunk_free() after all refs have
    dropped; and also move both kfree_skb(chunk->auth_chunk) there,
    so that chunk->auth_chunk is either NULL (since sctp_chunkify()
    allocs new chunks through kmem_cache_zalloc()) or non-NULL with
    a valid skb pointer. chunk->skb and chunk->auth_chunk are the
    only skbs in the sctp_chunk structure that need to be handeled.

    While at it, we should use consume_skb() for both. It is the same
    as dev_kfree_skb() but more appropriately named as we are not
    a device but a protocol. Also, this effectively replaces the
    kfree_skb() from both invocations into consume_skb(). Functions
    are the same only that kfree_skb() assumes that the frame was
    being dropped after a failure (e.g. for tools like drop monitor),
    usage of consume_skb() seems more appropriate in function
    sctp_chunk_destroy() though.

    Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk")
    Signed-off-by: Daniel Borkmann
    Cc: Vlad Yasevich
    Cc: Neil Horman
    Acked-by: Vlad Yasevich
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • There are known issues for switching the drivers between ECM mode and
    vendor mode. The interrup transfer may become abnormal. The hardware
    may have the opportunity to die if you change the configuration without
    unloading the current driver first, because all the control transfers
    of the current driver would fail after the command of switching the
    configuration.

    Although to use the ecm driver and vendor driver independently is fine,
    it may have problems to change the driver from one to the other by
    switching the configuration. Additionally, now the vendor mode driver
    is more powerful than the ECM driver. Thus, disable the ECM mode driver,
    and let r8152 to set the configuration to vendor mode and reset the
    device automatically.

    Signed-off-by: Hayes Wang
    Signed-off-by: David S. Miller

    hayeswang
     
  • In kexec scenario, we failed to load the mlx4 driver in the
    second kernel because the ownership bit was hold by the first
    kernel without release correctly.

    The patch adds shutdown() interface so that the ownership can
    be released correctly in the first kernel. It also helps avoiding
    EEH error happened during boot stage of the second kernel because
    of undesired traffic, which can't be handled by hardware during
    that stage on Power platform.

    Signed-off-by: Gavin Shan
    Tested-by: Wei Yang
    Signed-off-by: David S. Miller

    Gavin Shan
     
  • MLD queries are supposed to have an IPv6 link-local source address
    according to RFC2710, section 4 and RFC3810, section 5.1.14. This patch
    adds a sanity check to ignore such broken MLD queries.

    Without this check, such malformed MLD queries can result in a
    denial of service: The queries are ignored by any MLD listener
    therefore they will not respond with an MLD report. However,
    without this patch these malformed MLD queries would enable the
    snooping part in the bridge code, potentially shutting down the
    according ports towards these hosts for multicast traffic as the
    bridge did not learn about these listeners.

    Reported-by: Jan Stancek
    Signed-off-by: Linus Lüssing
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Linus Lüssing
     
  • I stumbled upon this very serious bug while hunting for another one,
    it's a very subtle race condition between inet_frag_evictor,
    inet_frag_intern and the IPv4/6 frag_queue and expire functions
    (basically the users of inet_frag_kill/inet_frag_put).

    What happens is that after a fragment has been added to the hash chain
    but before it's been added to the lru_list (inet_frag_lru_add) in
    inet_frag_intern, it may get deleted (either by an expired timer if
    the system load is high or the timer sufficiently low, or by the
    fraq_queue function for different reasons) before it's added to the
    lru_list, then after it gets added it's a matter of time for the
    evictor to get to a piece of memory which has been freed leading to a
    number of different bugs depending on what's left there.

    I've been able to trigger this on both IPv4 and IPv6 (which is normal
    as the frag code is the same), but it's been much more difficult to
    trigger on IPv4 due to the protocol differences about how fragments
    are treated.

    The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
    in a RR bond, so the same flow can be seen on multiple cards at the
    same time. Then I used multiple instances of ping/ping6 to generate
    fragmented packets and flood the machines with them while running
    other processes to load the attacked machine.

    *It is very important to have the _same flow_ coming in on multiple CPUs
    concurrently. Usually the attacked machine would die in less than 30
    minutes, if configured properly to have many evictor calls and timeouts
    it could happen in 10 minutes or so.

    An important point to make is that any caller (frag_queue or timer) of
    inet_frag_kill will remove both the timer refcount and the
    original/guarding refcount thus removing everything that's keeping the
    frag from being freed at the next inet_frag_put. All of this could
    happen before the frag was ever added to the LRU list, then it gets
    added and the evictor uses a freed fragment.

    An example for IPv6 would be if a fragment is being added and is at
    the stage of being inserted in the hash after the hash lock is
    released, but before inet_frag_lru_add executes (or is able to obtain
    the lru lock) another overlapping fragment for the same flow arrives
    at a different CPU which finds it in the hash, but since it's
    overlapping it drops it invoking inet_frag_kill and thus removing all
    guarding refcounts, and afterwards freeing it by invoking
    inet_frag_put which removes the last refcount added previously by
    inet_frag_find, then inet_frag_lru_add gets executed by
    inet_frag_intern and we have a freed fragment in the lru_list.

    The fix is simple, just move the lru_add under the hash chain locked
    region so when a removing function is called it'll have to wait for
    the fragment to be added to the lru_list, and then it'll remove it (it
    works because the hash chain removal is done before the lru_list one
    and there's no window between the two list adds when the frag can get
    dropped). With this fix applied I couldn't kill the same machine in 24
    hours with the same setup.

    Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
    rwlock")

    CC: Florian Westphal
    CC: Jesper Dangaard Brouer
    CC: David S. Miller

    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

05 Mar, 2014

3 commits

  • Pull networking fixes from David Miller:

    1) Fix memory leak in ieee80211_prep_connection(), sta_info leaked on
    error. From Eytan Lifshitz.

    2) Unintentional switch case fallthrough in nft_reject_inet_eval(),
    from Patrick McHardy.

    3) Must check if payload lenth is a power of 2 in
    nft_payload_select_ops(), from Nikolay Aleksandrov.

    4) Fix mis-checksumming in xen-netfront driver, ip_hdr() is not in the
    correct place when we invoke skb_checksum_setup(). From Wei Liu.

    5) TUN driver should not advertise HW vlan offload features in
    vlan_features. Fix from Fernando Luis Vazquez Cao.

    6) IPV6_VTI needs to select NET_IPV_TUNNEL to avoid build errors, fix
    from Steffen Klassert.

    7) Add missing locking in xfrm_migrade_state_find(), we must hold the
    per-namespace xfrm_state_lock while traversing the lists. Fix from
    Steffen Klassert.

    8) Missing locking in ath9k driver, access to tid->sched must be done
    under ath_txq_lock(). Fix from Stanislaw Gruszka.

    9) Fix two bugs in TCP fastopen. First respect the size argument given
    to tcp_sendmsg() in the fastopen path, and secondly prevent
    tcp_send_syn_data() from potentially using order-5 allocations.
    From Eric Dumazet.

    10) Fix handling of default neigh garbage collection params, from Jiri
    Pirko.

    11) Fix cwnd bloat and over-inflation of RTT when transmit segmentation
    is in use. From Eric Dumazet.

    12) Missing initialization of Realtek r8169 driver's statistics
    seqlocks. Fix from Kyle McMartin.

    13) Fix RTNL assertion failures in 802.3ad and AB ARP monitor of bonding
    driver, from Ding Tianhong.

    14) Bonding slave release race can cause divide by zero, fix from
    Nikolay Aleksandrov.

    15) Overzealous return from neigh_periodic_work() causes reachability
    time to not be computed. Fix from Duain Jiong.

    16) Fix regression in ipv6_find_hdr(), it should not return -ENOENT when
    a specific target is specified and found. From Hans Schillstrom.

    17) Fix VLAN tag stripping regression in BNA driver, from Ivan Vecera.

    18) Tail loss probe can calculate bogus RTTs due to missing packet
    marking on retransmit. Fix from Yuchung Cheng.

    19) We cannot do skb_dst_drop() in iptunnel_pull_header() because
    multicast loopback detection in later code paths need access to
    skb_rtable(). Fix from Xin Long.

    20) The macvlan driver regresses in that it propagates lower device
    offload support disables into itself, causing severe slowdowns when
    running over a bridge. Provide the software offloads always on
    macvlan devices to deal with this and the regression is gone. From
    Vlad Yasevich.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
    macvlan: Add support for 'always_on' offload features
    net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable
    ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer
    net: cpsw: fix cpdma rx descriptor leak on down interface
    be2net: isolate TX workarounds not applicable to Skyhawk-R
    be2net: Fix skb double free in be_xmit_wrokarounds() failure path
    be2net: clear promiscuous bits in adapter->flags while disabling promiscuous mode
    be2net: Fix to reset transparent vlan tagging
    qlcnic: dcb: a couple off by one bugs
    tcp: fix bogus RTT on special retransmission
    hsr: off by one sanity check in hsr_register_frame_in()
    can: remove CAN FD compatibility for CAN 2.0 sockets
    can: flexcan: factor out soft reset into seperate funtion
    can: flexcan: flexcan_remove(): add missing netif_napi_del()
    can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze
    can: flexcan: factor out transceiver {en,dis}able into seperate functions
    can: flexcan: fix transition from and to low power mode in chip_{en,dis}able
    can: flexcan: flexcan_open(): fix error path if flexcan_chip_start() fails
    can: flexcan: fix shutdown: first disable chip, then all interrupts
    USB AX88179/178A: Support D-Link DUB-1312
    ...

    Linus Torvalds
     
  • Pull regulator fixes from Mark Brown:
    "A couple of fixes here which ensure that regulators using the core
    support for GPIO enables work in all cases by ensuring that helpers
    are used consistently rather than open coding in places and hence not
    having GPIO support in some of them"

    * tag 'regulator-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
    regulator: core: Replace direct ops->disable usage
    regulator: core: Replace direct ops->enable usage

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton.

    * emailed patches from Andrew Morton akpm@linux-foundation.org>:
    mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness
    mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
    MAINTAINERS: add and correct types of some "T:" entries
    MAINTAINERS: use tab for separator
    rapidio/tsi721: fix tasklet termination in dma channel release
    hfsplus: fix remount issue
    zram: avoid null access when fail to alloc meta
    sh: prefix sh-specific "CCR" and "CCR2" by "SH_"
    ocfs2: fix quota file corruption
    drivers/rtc/rtc-s3c.c: fix incorrect way of save/restore of S3C2410_TICNT for TYPE_S3C64XX
    kallsyms: fix absolute addresses for kASLR
    scripts/gen_initramfs_list.sh: fix flags for initramfs LZ4 compression
    mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking
    memcg: reparent charges of children before processing parent
    memcg: fix endless loop in __mem_cgroup_iter_next()
    lib/radix-tree.c: swapoff tmpfs radix_tree: remember to rcu_read_unlock
    dma debug: account for cachelines and read-only mappings in overlap tracking
    mm: close PageTail race
    MAINTAINERS: EDAC: add Mauro and Borislav as interim patch collectors

    Linus Torvalds
     

04 Mar, 2014

7 commits

  • Jan Stancek reports manual page migration encountering allocation
    failures after some pages when there is still plenty of memory free, and
    bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
    zone allocator policy").

    The problem is that GFP_THISNODE obeys the zone fairness allocation
    batches on one hand, but doesn't reset them and wake kswapd on the other
    hand. After a few of those allocations, the batches are exhausted and
    the allocations fail.

    Fixing this means either having GFP_THISNODE wake up kswapd, or
    GFP_THISNODE not participating in zone fairness at all. The latter
    seems safer as an acute bugfix, we can clean up later.

    Reported-by: Jan Stancek
    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When doing some numa tests on powerpc, I triggered an oops bug. I find
    it is caused by using page->_last_cpupid. It should be initialized as
    "-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(),
    we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And
    finally cause an oops bug in task_numa_group(), since the online cpu is
    less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled

    Call trace:

    SMP NR_CPUS=64 NUMA PowerNV
    Modules linked in:
    CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32
    task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000
    REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+)
    MSR: 9000000000009032 CR:28024424 XER: 20000000
    CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1
    NIP .task_numa_fault+0x1470/0x2370
    LR .task_numa_fault+0x1468/0x2370
    Call Trace:
    .task_numa_fault+0x1468/0x2370 (unreliable)
    .do_numa_page+0x480/0x4a0
    .handle_mm_fault+0x4ec/0xc90
    .do_page_fault+0x3a8/0x890
    handle_page_fault+0x10/0x30
    Instruction dump:
    3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb
    3884b138 48d9cfd5 60000000 e93f0100 7d2907b45529063e 7d2a07b4
    ---[ end trace 15f2510da5ae07cf ]---

    Signed-off-by: Liu Ping Fan
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Ping Fan
     
  • Tree location entries should start with the appropriate type.

    Add git to some, hg to another.

    Neaten tree type description.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Convert whitespace to single tab for separators.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • This patch is a modification of the patch originally proposed by
    Xiaotian Feng : https://lkml.org/lkml/2012/11/5/413
    This new version disables DMA channel interrupts and ensures that the
    tasklet wil not be scheduled again before calling tasklet_kill().

    Unfortunately the updated patch was not released at that time due to
    planned rework of Tsi721 mport driver to use threaded interrupts (which
    has yet to happen). Recently the issue was reported again:
    https://lkml.org/lkml/2014/2/19/762.

    Description from the original Xiaotian's patch:

    "Some drivers use tasklet_disable in device remove/release process,
    tasklet_disable will inc tasklet->count and return. If the tasklet is
    not handled yet under some softirq pressure, the tasklet will be
    placed on the tasklet_vec, never have a chance to be excuted. This
    might lead to a heavy loaded ksoftirqd, wakeup with pending_softirq,
    but tasklet is disabled. tasklet_kill should be used in this case."

    This patch is applicable to kernel versions starting from v3.5.

    Signed-off-by: Alexandre Bounine
    Cc: Matt Porter
    Cc: Xiaotian Feng
    Reviewed-by: Thomas Gleixner
    Cc: Mike Galbraith
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Current implementation of HFS+ driver has small issue with remount
    option. Namely, for example, you are unable to remount from RO mode
    into RW mode by means of command "mount -o remount,rw /dev/loop0
    /mnt/hfsplus". Trying to execute sequence of commands results in an
    error message:

    mount /dev/loop0 /mnt/hfsplus
    mount -o remount,ro /dev/loop0 /mnt/hfsplus
    mount -o remount,rw /dev/loop0 /mnt/hfsplus

    mount: you must specify the filesystem type

    mount -t hfsplus -o remount,rw /dev/loop0 /mnt/hfsplus

    mount: /mnt/hfsplus not mounted or bad option

    The reason of such issue is failure of mount syscall:

    mount("/dev/loop0", "/mnt/hfsplus", 0x2282a60, MS_MGC_VAL|MS_REMOUNT, NULL) = -1 EINVAL (Invalid argument)

    Namely, hfsplus_parse_options_remount() method receives empty "input"
    argument and return false in such case. As a result, hfsplus_remount()
    returns -EINVAL error code.

    This patch fixes the issue by means of return true for the case of empty
    "input" argument in hfsplus_parse_options_remount() method.

    Signed-off-by: Vyacheslav Dubeyko
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • zram_meta_alloc could fail so caller should check it. Otherwise, your
    system will hang.

    Signed-off-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim