23 Oct, 2009

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    move virtrng_remove to .devexit.text
    move virtballoon_remove to .devexit.text
    virtio_blk: Revert serial number support
    virtio: let header files include virtio_ids.h
    virtio_blk: revert QUEUE_FLAG_VIRT addition

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits)
    niu: VLAN_ETH_HLEN should be used to make sure that the whole MAC header was copied to the head buffer in the Vlan packets case
    KS8851: Fix ks8851_set_rx_mode() for IFF_MULTICAST
    KS8851: Fix MAC address write order
    KS8851: Add soft reset at probe time
    net: fix section mismatch in fec.c
    net: Fix struct inet_timewait_sock bitfield annotation
    tcp: Try to catch MSG_PEEK bug
    net: Fix IP_MULTICAST_IF
    bluetooth: static lock key fix
    bluetooth: scheduling while atomic bug fix
    tcp: fix TCP_DEFER_ACCEPT retrans calculation
    tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
    tcp: accept socket after TCP_DEFER_ACCEPT period
    Revert "tcp: fix tcp_defer_accept to consider the timeout"
    AF_UNIX: Fix deadlock on connecting to shutdown socket
    ethoc: clear only pending irqs
    ethoc: inline regs access
    vmxnet3: use dev_dbg, fix build for CONFIG_BLOCK=n
    virtio_net: use dev_kfree_skb_any() in free_old_xmit_skbs()
    be2net: fix support for PCI hot plug
    ...

    Linus Torvalds
     

22 Oct, 2009

16 commits

  • The function virtrng_remove is used only wrapped by __devexit_p so define
    it using __devexit.

    Signed-off-by: Uwe Kleine-König
    Acked-by: Sam Ravnborg
    Cc: Rusty Russell
    Cc: Michael S. Tsirkin
    Acked-by: Christian Borntraeger
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Rusty Russell

    Uwe Kleine-König
     
  • The function virtballoon_remove is used only wrapped by __devexit_p so
    define it using __devexit.

    Signed-off-by: Uwe Kleine-König
    Acked-by: Sam Ravnborg
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Rusty Russell

    Uwe Kleine-König
     
  • This reverts "Add serial number support for virtio_blk, V4a".

    Turns out that virtio_pci, lguest and s/390 all have an 8 bit limit
    on virtio config space, so noone could ever use this.

    This is coming back later in a cleaner form.

    Signed-off-by: Rusty Russell
    Cc: john cooper
    Cc: Jens Axboe

    Rusty Russell
     
  • Rusty,

    commit 3ca4f5ca73057a617f9444a91022d7127041970a
    virtio: add virtio IDs file
    moved all device IDs into a single file. While the change itself is
    a very good one, it can break userspace applications. For example
    if a userspace tool wanted to get the ID of virtio_net it used to
    include virtio_net.h. This does no longer work, since virtio_net.h
    does not include virtio_ids.h.
    This patch moves all "#include " from the C
    files into the header files, making the header files compatible with
    the old ones.

    In addition, this patch exports virtio_ids.h to userspace.

    CC: Fernando Luis Vazquez Cao
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Rusty Russell

    Christian Borntraeger
     
  • It seems like the addition of QUEUE_FLAG_VIRT caueses major performance
    regressions for Fedora users:

    https://bugzilla.redhat.com/show_bug.cgi?id=509383
    https://bugzilla.redhat.com/show_bug.cgi?id=505695

    while I can't reproduce those extreme regressions myself I think the flag
    is wrong.

    Rationale:

    QUEUE_FLAG_VIRT expands to QUEUE_FLAG_NONROT which casus the queue
    unplugged immediately. This is not a good behaviour for at least
    qemu and kvm where we do have significant overhead for every
    I/O operations. Even with all the latested speeups (native AIO,
    MSI support, zero copy) we can only get native speed for up to 128kb
    I/O requests we already are down to 66% of native performance for 4kb
    requests even on my laptop running the Intel X25-M SSD for which the
    QUEUE_FLAG_NONROT was designed.
    If we ever get virtio-blk overhead low enough that this flag makes
    sense it should only be set based on a feature flag set by the host.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Rusty Russell

    Christoph Hellwig
     
  • …ied to the head buffer in the Vlan packets case

    Signed-off-by: Joyce Yu <joyce.yu@sun.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    Joyce Yu
     
  • * 'for-linus' of git://git.infradead.org/users/eparis/notify:
    dnotify: ignore FS_EVENT_ON_CHILD
    inotify: fix coalesce duplicate events into a single event in special case
    inotify: deprecate the inotify kernel interface
    fsnotify: do not set group for a mark before it is on the i_list

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: hp_sdc_rtc - fix test in hp_sdc_rtc_read_rt()
    Input: atkbd - consolidate force release quirks for volume keys
    Input: logips2pp - model 73 is actually TrackMan FX
    Input: i8042 - add Sony Vaio VGN-FZ240E to the nomux list
    Input: fix locking issue in /proc/bus/input/ handlers
    Input: atkbd - postpone restoring LED/repeat rate at resume
    Input: atkbd - restore resetting LED state at startup
    Input: i8042 - make pnp_data_busted variable boolean instead of int
    Input: synaptics - add another Protege M300 to rate blacklist

    Linus Torvalds
     
  • * 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: Prevent kvm_init from corrupting debugfs structures
    KVM: MMU: fix pointer cast
    KVM: use proper hrtimer function to retrieve expiration time

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
    dm snapshot: allow chunk size to be less than page size
    dm snapshot: use unsigned integer chunk size
    dm snapshot: lock snapshot while supplying status
    dm exception store: fix failed set_chunk_size error path
    dm snapshot: require non zero chunk size by end of ctr
    dm: dec_pending needs locking to save error value
    dm: add missing del_gendisk to alloc_dev error path
    dm log: userspace fix incorrect luid cast in userspace_ctr
    dm snapshot: free exception store on init failure
    dm snapshot: sort by chunk size to fix race

    Linus Torvalds
     
  • Increase TEST_SUSPEND_SECONDS to 10 so the warning in
    suspend_test_finish() doesn't annoy the users of slower systems so much.

    Also, make the warning print the suspend-resume cycle time, so that we
    know why the warning actually triggered.

    Patch prepared during the hacking session at the Kernel Summit in Tokyo.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • This fixes a compile bug introduced in

    6ef297f (ARM: 5720/1: Move MMCI header to amba include dir)

    That commit moved arch/arm/include/asm/mach/mmc.h to
    include/linux/amba/mmci.h. Just removing the include was enough.

    Signed-off-by: Uwe Kleine-König
    Acked-by: Linus Walleij
    Acked-by: Nicolas Ferre
    Acked-by: Bill Gatliff
    Cc: Catalin Marinas
    Cc: Russell King
    Cc: Pierre Ossman
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • * 'sh/for-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
    sh: Kill off stray HAVE_FTRACE_SYSCALLS reference.
    sh: Remove BKL from landisk gio.
    sh: disabled cache handling fix.
    sh: Fix up single page flushing to use PAGE_SIZE.

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: aesni-intel - Fix irq_fpu_usable usage
    crypto: padlock-sha - Fix stack alignment

    Linus Torvalds
     
  • Fix a (small) memory leak in one of the error paths of the NFS mount
    options parsing code.

    Regression introduced in 2.6.30 by commit a67d18f (NFS: load the
    rpc/rdma transport module automatically).

    Reported-by: Yinghai Lu
    Reported-by: Pekka Enberg
    Signed-off-by: Ingo Molnar
    Signed-off-by: Trond Myklebust
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • This patch fixes a null pointer exception in pipe_rdwr_open() which
    generates the stack trace:

    > Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
    > [] pipe_rdwr_open+0x35/0x70
    > [] __dentry_open+0x13c/0x230
    > [] do_filp_open+0x2d/0x40
    > [] do_sys_open+0x5a/0x100
    > [] sysenter_do_call+0x1b/0x67

    The failure mode is triggered by an attempt to open an anonymous
    pipe via /proc/pid/fd/* as exemplified by this script:

    =============================================================
    while : ; do
    { echo y ; sleep 1 ; } | { while read ; do echo z$REPLY; done ; } &
    PID=$!
    OUT=$(ps -efl | grep 'sleep 1' | grep -v grep |
    { read PID REST ; echo $PID; } )
    OUT="${OUT%% *}"
    DELAY=$((RANDOM * 1000 / 32768))
    usleep $((DELAY * 1000 + RANDOM % 1000 ))
    echo n > /proc/$OUT/fd/1 # Trigger defect
    done
    =============================================================

    Note that the failure window is quite small and I could only
    reliably reproduce the defect by inserting a small delay
    in pipe_rdwr_open(). For example:

    static int
    pipe_rdwr_open(struct inode *inode, struct file *filp)
    {
    msleep(100);
    mutex_lock(&inode->i_mutex);

    Although the defect was observed in pipe_rdwr_open(), I think it
    makes sense to replicate the change through all the pipe_*_open()
    functions.

    The core of the change is to verify that inode->i_pipe has not
    been released before attempting to manipulate it. If inode->i_pipe
    is no longer present, return ENOENT to indicate so.

    The comment about potentially using atomic_t for i_pipe->readers
    and i_pipe->writers has also been removed because it is no longer
    relevant in this context. The inode->i_mutex lock must be used so
    that inode->i_pipe can be dealt with correctly.

    Signed-off-by: Earl Chew
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Earl Chew
     

21 Oct, 2009

5 commits

  • In ks8851_set_rx_mode() the case handling IFF_MULTICAST was also setting
    the RXCR1_AE bit by accident. This meant that all unicast frames where
    being accepted by the device. Remove RXCR1_AE from this case.

    Note, RXCR1_AE was also masking a problem with setting the MAC address
    properly, so needs to be applied after fixing the MAC write order.

    Fixes a bug reported by Doong, Ping of Micrel. This version of the
    patch avoids setting RXCR1_ME for all cases.

    Signed-off-by: Ben Dooks
    Signed-off-by: David S. Miller

    Ben Dooks
     
  • The MAC address register was being written in the wrong order, so add
    a new address macro to convert mac-address byte to register address and
    a ks8851_wrreg8() function to write each byte without having to worry
    about any difficult byte swapping.

    Fixes a bug reported by Doong, Ping of Micrel.

    Signed-off-by: Ben Dooks
    Signed-off-by: David S. Miller

    Ben Dooks
     
  • Issue a full soft reset at probe time.

    This was reported by Doong Ping of Micrel, but no explanation of why this
    is necessary or what bug it is fixing. Add it as it does not seem to hurt
    the current driver and ensures that the device is in a known state when we
    start setting it up.

    Signed-off-by: Ben Dooks
    Signed-off-by: David S. Miller

    Ben Dooks
     
  • fec_enet_init is called by both fec_probe and fec_resume, so it
    shouldn't be marked as __init.

    Signed-off-by: Steven King
    Signed-off-by: David S. Miller

    Steven King
     
  • Mask off FS_EVENT_ON_CHILD in dnotify_handle_event(). Otherwise, when there
    is more than one watch on a directory and dnotify_should_send_event()
    succeeds, events with FS_EVENT_ON_CHILD set will trigger all watches and cause
    spurious events.

    This case was overlooked in commit e42e2773.

    #define _GNU_SOURCE

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static void create_event(int s, siginfo_t* si, void* p)
    {
    printf("create\n");
    }

    static void delete_event(int s, siginfo_t* si, void* p)
    {
    printf("delete\n");
    }

    int main (void) {
    struct sigaction action;
    char *tmpdir, *file;
    int fd1, fd2;

    sigemptyset (&action.sa_mask);
    action.sa_flags = SA_SIGINFO;

    action.sa_sigaction = create_event;
    sigaction (SIGRTMIN + 0, &action, NULL);

    action.sa_sigaction = delete_event;
    sigaction (SIGRTMIN + 1, &action, NULL);

    # define TMPDIR "/tmp/test.XXXXXX"
    tmpdir = malloc(strlen(TMPDIR) + 1);
    strcpy(tmpdir, TMPDIR);
    mkdtemp(tmpdir);

    # define TMPFILE "/file"
    file = malloc(strlen(tmpdir) + strlen(TMPFILE) + 1);
    sprintf(file, "%s/%s", tmpdir, TMPFILE);

    fd1 = open (tmpdir, O_RDONLY);
    fcntl(fd1, F_SETSIG, SIGRTMIN);
    fcntl(fd1, F_NOTIFY, DN_MULTISHOT | DN_CREATE);

    fd2 = open (tmpdir, O_RDONLY);
    fcntl(fd2, F_SETSIG, SIGRTMIN + 1);
    fcntl(fd2, F_NOTIFY, DN_MULTISHOT | DN_DELETE);

    if (fork()) {
    /* This triggers a create event */
    creat(file, 0600);
    /* This triggers a create and delete event (!) */
    unlink(file);
    } else {
    sleep(1);
    rmdir(tmpdir);
    }

    return 0;
    }

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Eric Paris

    Andreas Gruenbacher
     

20 Oct, 2009

10 commits

  • commit 9e337b0f (net: annotate inet_timewait_sock bitfields)
    added 4/8 bytes in struct inet_timewait_sock.

    Fix this by declaring tw_ipv6_offset in the 'flags' bitfield
    The 14 bits hole is named tw_pad to make it cleary apparent.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch tries to print out more information when we hit the
    MSG_PEEK bug in tcp_recvmsg. It's been around since at least
    2005 and it's about time that we finally fix it.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • When renaming kernel_fpu_using to irq_fpu_usable, the semantics of the
    function is changed too, from mesuring whether kernel is using FPU,
    that is, the FPU is NOT available, to measuring whether FPU is usable,
    that is, the FPU is available.

    But the usage of irq_fpu_usable in aesni-intel_glue.c is not changed
    accordingly. This patch fixes this.

    Signed-off-by: Huang Ying
    Signed-off-by: Herbert Xu

    Huang Ying
     
  • ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

    This function should be called only with RTNL or dev_base_lock held, or reader
    could see a corrupt hash chain and eventually enter an endless loop.

    Fix is to call dev_get_by_index()/dev_put().

    If this happens to be performance critical, we could define a new dev_exist_by_index()
    function to avoid touching dev refcount.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When shutdown ppp connection, lockdep waring about non-static key
    will happen, it is caused by the lock is not initialized properly
    at that time.

    Fix with tuning the lock/skb_queue_head init order

    [ 94.339261] INFO: trying to register non-static key.
    [ 94.342509] the code is fine but needs lockdep annotation.
    [ 94.342509] turning off the locking correctness validator.
    [ 94.342509] Pid: 0, comm: swapper Not tainted 2.6.31-mm1 #2
    [ 94.342509] Call Trace:
    [ 94.342509] [] register_lock_class+0x58/0x241
    [ 94.342509] [] ? __lock_acquire+0xb57/0xb73
    [ 94.342509] [] __lock_acquire+0xac/0xb73
    [ 94.342509] [] ? lock_release_non_nested+0x17b/0x1de
    [ 94.342509] [] lock_acquire+0x67/0x84
    [ 94.342509] [] ? skb_dequeue+0x15/0x41
    [ 94.342509] [] _spin_lock_irqsave+0x2f/0x3f
    [ 94.342509] [] ? skb_dequeue+0x15/0x41
    [ 94.342509] [] skb_dequeue+0x15/0x41
    [ 94.342509] [] ? _read_unlock+0x1d/0x20
    [ 94.342509] [] skb_queue_purge+0x14/0x1b
    [ 94.342509] [] l2cap_recv_frame+0xea1/0x115a [l2cap]
    [ 94.342509] [] ? __lock_acquire+0xb57/0xb73
    [ 94.342509] [] ? mark_lock+0x1e/0x1c7
    [ 94.342509] [] ? hci_rx_task+0xd2/0x1bc [bluetooth]
    [ 94.342509] [] l2cap_recv_acldata+0xb1/0x1c6 [l2cap]
    [ 94.342509] [] hci_rx_task+0x106/0x1bc [bluetooth]
    [ 94.342509] [] ? l2cap_recv_acldata+0x0/0x1c6 [l2cap]
    [ 94.342509] [] tasklet_action+0x69/0xc1
    [ 94.342509] [] __do_softirq+0x94/0x11e
    [ 94.342509] [] do_softirq+0x36/0x5a
    [ 94.342509] [] irq_exit+0x35/0x68
    [ 94.342509] [] do_IRQ+0x72/0x89
    [ 94.342509] [] common_interrupt+0x2e/0x34
    [ 94.342509] [] ? pm_qos_add_requirement+0x63/0x9d
    [ 94.342509] [] ? acpi_idle_enter_bm+0x209/0x238
    [ 94.342509] [] cpuidle_idle_call+0x5c/0x94
    [ 94.342509] [] cpu_idle+0x4e/0x6f
    [ 94.342509] [] rest_init+0x53/0x55
    [ 94.342509] [] start_kernel+0x2f0/0x2f5
    [ 94.342509] [] i386_start_kernel+0x91/0x96

    Reported-by: Oliver Hartkopp
    Signed-off-by: Dave Young
    Tested-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Dave Young
     
  • Due to driver core changes dev_set_drvdata will call kzalloc which should be
    in might_sleep context, but hci_conn_add will be called in atomic context

    Like dev_set_name move dev_set_drvdata to work queue function.

    oops as following:

    Oct 2 17:41:59 darkstar kernel: [ 438.001341] BUG: sleeping function called from invalid context at mm/slqb.c:1546
    Oct 2 17:41:59 darkstar kernel: [ 438.001345] in_atomic(): 1, irqs_disabled(): 0, pid: 2133, name: sdptool
    Oct 2 17:41:59 darkstar kernel: [ 438.001348] 2 locks held by sdptool/2133:
    Oct 2 17:41:59 darkstar kernel: [ 438.001350] #0: (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.+.}, at: [] lock_sock+0xa/0xc [l2cap]
    Oct 2 17:41:59 darkstar kernel: [ 438.001360] #1: (&hdev->lock){+.-.+.}, at: [] l2cap_sock_connect+0x103/0x26b [l2cap]
    Oct 2 17:41:59 darkstar kernel: [ 438.001371] Pid: 2133, comm: sdptool Not tainted 2.6.31-mm1 #2
    Oct 2 17:41:59 darkstar kernel: [ 438.001373] Call Trace:
    Oct 2 17:41:59 darkstar kernel: [ 438.001381] [] __might_sleep+0xde/0xe5
    Oct 2 17:41:59 darkstar kernel: [ 438.001386] [] __kmalloc+0x4a/0x15a
    Oct 2 17:41:59 darkstar kernel: [ 438.001392] [] ? kzalloc+0xb/0xd
    Oct 2 17:41:59 darkstar kernel: [ 438.001396] [] kzalloc+0xb/0xd
    Oct 2 17:41:59 darkstar kernel: [ 438.001400] [] device_private_init+0x15/0x3d
    Oct 2 17:41:59 darkstar kernel: [ 438.001405] [] dev_set_drvdata+0x18/0x26
    Oct 2 17:41:59 darkstar kernel: [ 438.001414] [] hci_conn_init_sysfs+0x40/0xd9 [bluetooth]
    Oct 2 17:41:59 darkstar kernel: [ 438.001422] [] ? hci_conn_add+0x128/0x186 [bluetooth]
    Oct 2 17:41:59 darkstar kernel: [ 438.001429] [] hci_conn_add+0x177/0x186 [bluetooth]
    Oct 2 17:41:59 darkstar kernel: [ 438.001437] [] hci_connect+0x3c/0xfb [bluetooth]
    Oct 2 17:41:59 darkstar kernel: [ 438.001442] [] l2cap_sock_connect+0x174/0x26b [l2cap]
    Oct 2 17:41:59 darkstar kernel: [ 438.001448] [] sys_connect+0x60/0x7a
    Oct 2 17:41:59 darkstar kernel: [ 438.001453] [] ? lock_release_non_nested+0x84/0x1de
    Oct 2 17:41:59 darkstar kernel: [ 438.001458] [] ? might_fault+0x47/0x81
    Oct 2 17:41:59 darkstar kernel: [ 438.001462] [] ? might_fault+0x47/0x81
    Oct 2 17:41:59 darkstar kernel: [ 438.001468] [] ? __copy_from_user_ll+0x11/0xce
    Oct 2 17:41:59 darkstar kernel: [ 438.001472] [] sys_socketcall+0x82/0x17b
    Oct 2 17:41:59 darkstar kernel: [ 438.001477] [] syscall_call+0x7/0xb

    Signed-off-by: Dave Young
    Signed-off-by: David S. Miller

    Dave Young
     
  • Fix TCP_DEFER_ACCEPT conversion between seconds and
    retransmission to match the TCP SYN-ACK retransmission periods
    because the time is converted to such retransmissions. The old
    algorithm selects one more retransmission in some cases. Allow
    up to 255 retransmissions.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
    users to not retransmit SYN-ACKs during the deferring period if
    ACK from client was received. The goal is to reduce traffic
    during the deferring period. When the period is finished
    we continue with sending SYN-ACKs (at least one) but this time
    any traffic from client will change the request to established
    socket allowing application to terminate it properly.
    Also, do not drop acked request if sending of SYN-ACK fails.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Willy Tarreau and many other folks in recent years
    were concerned what happens when the TCP_DEFER_ACCEPT period
    expires for clients which sent ACK packet. They prefer clients
    that actively resend ACK on our SYN-ACK retransmissions to be
    converted from open requests to sockets and queued to the
    listener for accepting after the deferring period is finished.
    Then application server can decide to wait longer for data
    or to properly terminate the connection with FIN if read()
    returns EAGAIN which is an indication for accepting after
    the deferring period. This change still can have side effects
    for applications that expect always to see data on the accepted
    socket. Others can be prepared to work in both modes (with or
    without TCP_DEFER_ACCEPT period) and their data processing can
    ignore the read=EAGAIN notification and to allocate resources for
    clients which proved to have no data to send during the deferring
    period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
    as a timeout) to wait for data will notice clients that didn't
    send data for 3 seconds but that still resend ACKs.
    Thanks to Willy Tarreau for the initial idea and to
    Eric Dumazet for the review and testing the change.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • This reverts commit 6d01a026b7d3009a418326bdcf313503a314f1ea.

    Julian Anastasov, Willy Tarreau and Eric Dumazet have come up
    with a more correct way to deal with this.

    Signed-off-by: David S. Miller

    David S. Miller
     

19 Oct, 2009

6 commits

  • I found a deadlock bug in UNIX domain socket, which makes able to DoS
    attack against the local machine by non-root users.

    How to reproduce:
    1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
    namespace(*), and shutdown(2) it.
    2. Repeat connect(2)ing to the listening socket from the other sockets
    until the connection backlog is full-filled.
    3. connect(2) takes the CPU forever. If every core is taken, the
    system hangs.

    PoC code: (Run as many times as cores on SMP machines.)

    int main(void)
    {
    int ret;
    int csd;
    int lsd;
    struct sockaddr_un sun;

    /* make an abstruct name address (*) */
    memset(&sun, 0, sizeof(sun));
    sun.sun_family = PF_UNIX;
    sprintf(&sun.sun_path[1], "%d", getpid());

    /* create the listening socket and shutdown */
    lsd = socket(AF_UNIX, SOCK_STREAM, 0);
    bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
    listen(lsd, 1);
    shutdown(lsd, SHUT_RDWR);

    /* connect loop */
    alarm(15); /* forcely exit the loop after 15 sec */
    for (;;) {
    csd = socket(AF_UNIX, SOCK_STREAM, 0);
    ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
    if (-1 == ret) {
    perror("connect()");
    break;
    }
    puts("Connection OK");
    }
    return 0;
    }

    (*) Make sun_path[0] = 0 to use the abstruct namespace.
    If a file-based socket is used, the system doesn't deadlock because
    of context switches in the file system layer.

    Why this happens:
    Error checks between unix_socket_connect() and unix_wait_for_peer() are
    inconsistent. The former calls the latter to wait until the backlog is
    processed. Despite the latter returns without doing anything when the
    socket is shutdown, the former doesn't check the shutdown state and
    just retries calling the latter forever.

    Patch:
    The patch below adds shutdown check into unix_socket_connect(), so
    connect(2) to the shutdown socket will return -ECONREFUSED.

    Signed-off-by: Tomoki Sekiyama
    Signed-off-by: Masanori Yoshida
    Signed-off-by: David S. Miller

    Tomoki Sekiyama
     
  • This patch fixed the problem of dropped packets due to lost of
    interrupt requests. We should only clear what was pending at the
    moment we read the irq source reg.

    Signed-off-by: Thomas Chou
    Signed-off-by: David S. Miller

    Thomas Chou
     
  • Signed-off-by: Thomas Chou
    Signed-off-by: David S. Miller

    Thomas Chou
     
  • If we do rename a dir entry, like this:

    rename("/tmp/ino7UrgoJ.rename1", "/tmp/ino7UrgoJ.rename2")
    rename("/tmp/ino7UrgoJ.rename2", "/tmp/ino7UrgoJ")

    The duplicate events should be coalesced into a single event. But those two
    events do not be coalesced into a single event, due to some bad check in
    event_compare(). It can not match the two NULL inodes as the same event.

    Signed-off-by: Wei Yongjun
    Signed-off-by: Eric Paris

    Wei Yongjun
     
  • In 2.6.33 there will be no users of the inotify interface. Mark it for
    removal as fsnotify is more generic and is easier to use.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • fsnotify_add_mark is supposed to add a mark to the g_list and i_list and to
    set the group and inode for the mark. fsnotify_destroy_mark_by_entry uses
    the fact that ->group != NULL to know if this group should be destroyed or
    if it's already been done.

    But fsnotify_add_mark sets the group and inode before it actually adds the
    mark to the i_list and g_list. This can result in a race in inotify, it
    requires 3 threads.

    sys_inotify_add_watch("file") sys_inotify_add_watch("file") sys_inotify_rm_watch([a])
    inotify_update_watch()
    inotify_new_watch()
    inotify_add_to_idr()
    ^--- returns wd = [a]
    inotfiy_update_watch()
    inotify_new_watch()
    inotify_add_to_idr()
    fsnotify_add_mark()
    ^--- returns wd = [b]
    returns to userspace;
    inotify_idr_find([a])
    ^--- gives us the pointer from task 1
    fsnotify_add_mark()
    ^--- this is going to set the mark->group and mark->inode fields, but will
    return -EEXIST because of the race with [b].
    fsnotify_destroy_mark()
    ^--- since ->group != NULL we call back
    into inotify_freeing_mark() which calls
    inotify_remove_from_idr([a])

    since fsnotify_add_mark() failed we call:
    inotify_remove_from_idr([a]) group until we are sure the mark is
    on the inode and fsnotify_add_mark will return success.

    Signed-off-by: Eric Paris

    Eric Paris
     

18 Oct, 2009

1 commit