28 Jul, 2008

1 commit

  • Piss-poor sysctl registration API strikes again, film at 11...

    What we really need is _pathname_ required to be present in already
    registered table, so that kernel could warn about bad order. That's the
    next target for sysctl stuff (and generally saner and more explicit
    order of initialization of ipv[46] internals wouldn't hurt either).

    For the time being, here are full fixups required by ..._rotable()
    stuff; we make per-net sysctl sets descendents of "ro" one and make sure
    that sufficient skeleton is there before we start registering per-net
    sysctls.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

27 Jul, 2008

13 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
    [PATCH] fix RLIM_NOFILE handling
    [PATCH] get rid of corner case in dup3() entirely
    [PATCH] remove remaining namei_{32,64}.h crap
    [PATCH] get rid of indirect users of namei.h
    [PATCH] get rid of __user_path_lookup_open
    [PATCH] f_count may wrap around
    [PATCH] dup3 fix
    [PATCH] don't pass nameidata to __ncp_lookup_validate()
    [PATCH] don't pass nameidata to gfs2_lookupi()
    [PATCH] new (local) helper: user_path_parent()
    [PATCH] sanitize __user_walk_fd() et.al.
    [PATCH] preparation to __user_walk_fd cleanup
    [PATCH] kill nameidata passing to permission(), rename to inode_permission()
    [PATCH] take noexec checks to very few callers that care
    Re: [PATCH 3/6] vfs: open_exec cleanup
    [patch 4/4] vfs: immutable inode checking cleanup
    [patch 3/4] fat: dont call notify_change
    [patch 2/4] vfs: utimes cleanup
    [patch 1/4] vfs: utimes: move owner check into inode_change_ok()
    [PATCH] vfs: use kstrdup() and check failing allocation
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    netns: fix ip_rt_frag_needed rt_is_expired
    netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences
    netfilter: fix double-free and use-after free
    netfilter: arptables in netns for real
    netfilter: ip{,6}tables_security: fix future section mismatch
    selinux: use nf_register_hooks()
    netfilter: ebtables: use nf_register_hooks()
    Revert "pkt_sched: sch_sfq: dump a real number of flows"
    qeth: use dev->ml_priv instead of dev->priv
    syncookies: Make sure ECN is disabled
    net: drop unused BUG_TRAP()
    net: convert BUG_TRAP to generic WARN_ON
    drivers/net: convert BUG_TRAP to generic WARN_ON

    Linus Torvalds
     
  • make it atomic_long_t; while we are at it, get rid of useless checks in affs,
    hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.

    Signed-off-by: Al Viro

    Al Viro
     
  • Massage ipv4 initialization - make sure that net.ipv4 appears as
    non-per-net-namespace before it shows up in per-net-namespace sysctls.
    That's the only change outside of sysctl.c needed to get sane ordering
    rules and data structures for sysctls (esp. for procfs side of that
    mess).

    Signed-off-by: Al Viro

    Al Viro
     
  • New object: set of sysctls [currently - root and per-net-ns].
    Contains: pointer to parent set, list of tables and "should I see this set?"
    method (->is_seen(set)).
    Current lists of tables are subsumed by that; net-ns contains such a beast.
    ->lookup() for ctl_table_root returns pointer to ctl_table_set instead of
    that to ->list of that ctl_table_set.

    [folded compile fixes by rdd for configs without sysctl]

    Signed-off-by: Al Viro

    Al Viro
     
  • Running recent kernels, and using a particular vpn gateway, I've been
    having to edit my mails down to get them accepted by the smtp server.

    Git bisect led to commit e84f84f276473dcc673f360e8ff3203148bdf0e2 -
    netns: place rt_genid into struct net. The conversion from a != test
    to rt_is_expired() put one negative too many: and now my mail works.

    Signed-off-by: Hugh Dickins
    Acked-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Hugh Dickins
     
  • As Linus points out, "ct->ext" and "new" are always equal, avoid unnecessary
    dereferences and use "new" directly.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • As suggested by Patrick McHardy, introduce a __krealloc() that doesn't
    free the original buffer to fix a double-free and use-after-free bug
    introduced by me in netfilter that uses RCU.

    Reported-by: Patrick McHardy
    Signed-off-by: Pekka Enberg
    Tested-by: Dieter Ries
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pekka Enberg
     
  • IN, FORWARD -- grab netns from in device, OUT -- from out device.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • Currently not visible, because NET_NS is mutually exclusive with SYSFS
    which is required by SECURITY.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • Signed-off-by: Alexey Dobriyan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
    architecture does:

    This enables us to cleanly fix the Calgary IOMMU issue that some devices
    are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).

    I think that per-device dma_mapping_ops support would be also helpful for
    KVM people to support PCI passthrough but Andi thinks that this makes it
    difficult to support the PCI passthrough (see the above thread). So I
    CC'ed this to KVM camp. Comments are appreciated.

    A pointer to dma_mapping_ops to struct dev_archdata is added. If the
    pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
    NULL, the system-wide dma_ops pointer is used as before.

    If it's useful for KVM people, I plan to implement a mechanism to register
    a hook called when a new pci (or dma capable) device is created (it works
    with hot plugging). It enables IOMMUs to set up an appropriate
    dma_mapping_ops per device.

    The major obstacle is that dma_mapping_error doesn't take a pointer to the
    device unlike other DMA operations. So x86 can't have dma_mapping_ops per
    device. Note all the POWER IOMMUs use the same dma_mapping_error function
    so this is not a problem for POWER but x86 IOMMUs use different
    dma_mapping_error functions.

    The first patch adds the device argument to dma_mapping_error. The patch
    is trivial but large since it touches lots of drivers and dma-mapping.h in
    all the architecture.

    This patch:

    dma_mapping_error() doesn't take a pointer to the device unlike other DMA
    operations. So we can't have dma_mapping_ops per device.

    Note that POWER already has dma_mapping_ops per device but all the POWER
    IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
    argument.

    [akpm@linux-foundation.org: fix sge]
    [akpm@linux-foundation.org: fix svc_rdma]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: fix bnx2x]
    [akpm@linux-foundation.org: fix s2io]
    [akpm@linux-foundation.org: fix pasemi_mac]
    [akpm@linux-foundation.org: fix sdhci]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: fix sparc]
    [akpm@linux-foundation.org: fix ibmvscsi]
    Signed-off-by: FUJITA Tomonori
    Cc: Muli Ben-Yehuda
    Cc: Andi Kleen
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    FUJITA Tomonori
     

26 Jul, 2008

7 commits

  • This reverts commit f867e6af94239a04ec23aeec2fcda5aa58e41db7.

    Based upon discussions between Jarek and Patrick McHardy
    this is field being set is more a config parameter than a
    statistic. And we should add a true statistic to provide
    this information if we really want it.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • ecn_ok is not initialized when a connection is established by cookies.
    The cookie syn-ack never sets ECN, so ecn_ok must be set to 0.

    Spotted using ns-3/network simulation cradle simulator and valgrind.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    ipsec: ipcomp - Decompress into frags if necessary
    ipsec: ipcomp - Merge IPComp implementations
    pkt_sched: Fix locking in shutdown_scheduler_queue()

    Linus Torvalds
     
  • Extend the permission check for networking sysctl's to allow modification
    when current process has CAP_NET_ADMIN capability and is not root. This
    version uses the until now unused permissions hook to override the mode
    value for /proc/sys/net if accessed by a user with capabilities.

    Found while working with Quagga. It is impossible to turn forwarding
    on/off through the command interface because Quagga uses secure coding
    practice of dropping privledges during initialization and only raising via
    capabilities when necessary. Since the dameon has reset real/effective
    uid after initialization, all attempts to access /proc/sys/net variables
    will fail.

    Signed-off-by: Stephen Hemminger
    Acked-by: "Eric W. Biederman"
    Cc: Chris Wright
    Cc: Alexey Dobriyan
    Cc: Andrew Morgan
    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     
  • All ratelimit user use same jiffies and burst params, so some messages
    (callbacks) will be lost.

    For example:
    a call printk_ratelimit(5 * HZ, 1)
    b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will
    will be supressed.

    - rewrite __ratelimit, and use a ratelimit_state as parameter. Thanks for
    hints from andrew.

    - Add WARN_ON_RATELIMIT, update rcupreempt.h

    - remove __printk_ratelimit

    - use __ratelimit in net_ratelimit

    Signed-off-by: Dave Young
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • All uses of list_for_each_rcu() can be profitably replaced by the
    easier-to-use list_for_each_entry_rcu(). This patch makes this change for
    networking, in preparation for removing the list_for_each_rcu() API
    entirely.

    Acked-by: David S. Miller
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     

25 Jul, 2008

9 commits

  • When decompressing extremely large packets allocating them through
    kmalloc is prone to failure. Therefore it's better to use page
    frags instead.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch merges the IPv4/IPv6 IPComp implementations since most
    of the code is identical. As a result future enhancements will no
    longer need to be duplicated.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Qdisc locks need to be held with BH disabled.

    Tested-by: Ingo Molnar
    Signed-off-by: David S. Miller

    David S. Miller
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    pkt_sched: sch_sfq: dump a real number of flows
    atm: [fore200e] use MODULE_FIRMWARE() and other suggested cleanups
    netfilter: make security table depend on NETFILTER_ADVANCED
    tcp: Clear probes_out more aggressively in tcp_ack().
    e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
    net: Update entry in af_family_clock_key_strings
    netdev: Remove warning from __netif_schedule().
    sky2: don't stop queue on shutdown

    Linus Torvalds
     
  • This patch adds test that ensure the boundary conditions for the various
    constants introduced in the previous patches is met. No code is generated.

    [akpm@linux-foundation.org: fix alpha]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch introduces support for the SOCK_NONBLOCK flag in socket,
    socketpair, and paccept. To do this the internal function sock_attach_fd
    gets an additional parameter which it uses to set the appropriate flag for
    the file descriptor.

    Given that in modern, scalable programs almost all socket connections are
    non-blocking and the minimal additional cost for the new functionality
    I see no reason not to add this code.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_NONBLOCK O_NONBLOCK

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    return NULL;
    }

    int
    main (void)
    {
    int fd;
    fd = socket (PF_INET, SOCK_STREAM, 0);
    if (fd == -1)
    {
    puts ("socket(0) failed");
    return 1;
    }
    int fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    puts ("socket(0) set non-blocking mode");
    return 1;
    }
    close (fd);

    fd = socket (PF_INET, SOCK_STREAM|SOCK_NONBLOCK, 0);
    if (fd == -1)
    {
    puts ("socket(SOCK_NONBLOCK) failed");
    return 1;
    }
    fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("socket(SOCK_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (fd);

    int fds[2];
    if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
    puts ("socketpair(0) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    fl = fcntl (fds[i], F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    printf ("socketpair(0) set non-blocking mode for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, fds) == -1)
    {
    puts ("socketpair(SOCK_NONBLOCK) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    fl = fcntl (fds[i], F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    printf ("socketpair(SOCK_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, NULL) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    fl = fcntl (s2, F_GETFL);
    if (fl & O_NONBLOCK)
    {
    puts ("paccept(0) set non-blocking mode");
    return 1;
    }
    close (s2);
    close (s);

    pthread_barrier_wait (&b);

    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_NONBLOCK);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_NONBLOCK) failed");
    return 1;
    }

    fl = fcntl (s2, F_GETFL);
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("paccept(SOCK_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (s2);
    close (s);

    pthread_barrier_wait (&b);
    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • Some platforms do not have support to restore the signal mask in the
    return path from a syscall. For those platforms syscalls like pselect are
    not defined at all. This is, I think, not a good choice for paccept()
    since paccept() adds more value on top of accept() than just the signal
    mask handling.

    Therefore this patch defines a scaled down version of the sys_paccept
    function for those platforms. It returns -EINVAL in case the signal mask
    is non-NULL but behaves the same otherwise.

    Note that I explicitly included . I saw that it is
    currently included but indirectly two levels down. There is too much risk
    in relying on this. The header might change and then suddenly the
    function definition would change without anyone immediately noticing.

    Signed-off-by: Ulrich Drepper
    Cc: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch is by far the most complex in the series. It adds a new syscall
    paccept. This syscall differs from accept in that it adds (at the userlevel)
    two additional parameters:

    - a signal mask
    - a flags value

    The flags parameter can be used to set flag like SOCK_CLOEXEC. This is
    imlpemented here as well. Some people argued that this is a property which
    should be inherited from the file desriptor for the server but this is against
    POSIX. Additionally, we really want the signal mask parameter as well
    (similar to pselect, ppoll, etc). So an interface change in inevitable.

    The flag value is the same as for socket and socketpair. I think diverging
    here will only create confusion. Similar to the filesystem interfaces where
    the use of the O_* constants differs, it is acceptable here.

    The signal mask is handled as for pselect etc. The mask is temporarily
    installed for the thread and removed before the call returns. I modeled the
    code after pselect. If there is a problem it's likely also in pselect.

    For architectures which use socketcall I maintained this interface instead of
    adding a system call. The symmetry shouldn't be broken.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_CLOEXEC O_CLOEXEC

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    sleep (2);
    pthread_kill ((pthread_t) arg, SIGUSR1);

    return NULL;
    }

    static void
    handler (int s)
    {
    }

    int
    main (void)
    {
    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, (void *) pthread_self ()) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    int coe = fcntl (s2, F_GETFD);
    if (coe & FD_CLOEXEC)
    {
    puts ("paccept(0) set close-on-exec-flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_CLOEXEC) failed");
    return 1;
    }

    coe = fcntl (s2, F_GETFD);
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    struct sigaction sa;
    sa.sa_handler = handler;
    sa.sa_flags = 0;
    sigemptyset (&sa.sa_mask);
    sigaction (SIGUSR1, &sa, NULL);

    sigset_t ss;
    pthread_sigmask (SIG_SETMASK, NULL, &ss);
    sigaddset (&ss, SIGUSR1);
    pthread_sigmask (SIG_SETMASK, &ss, NULL);

    sigdelset (&ss, SIGUSR1);
    alarm (4);
    pthread_barrier_wait (&b);

    errno = 0 ;
    s2 = paccept (s, NULL, 0, &ss, 0);
    if (s2 != -1 || errno != EINTR)
    {
    puts ("paccept did not fail with EINTR");
    return 1;
    }

    close (s);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: make it compile]
    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: "David S. Miller"
    Cc: Roland McGrath
    Cc: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds support for flag values which are ORed to the type passwd
    to socket and socketpair. The additional code is minimal. The flag
    values in this implementation can and must match the O_* flags. This
    avoids overhead in the conversion.

    The internal functions sock_alloc_fd and sock_map_fd get a new parameters
    and all callers are changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #define PORT 57392

    /* For Linux these must be the same. */
    #define SOCK_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd;
    fd = socket (PF_INET, SOCK_STREAM, 0);
    if (fd == -1)
    {
    puts ("socket(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("socket(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = socket (PF_INET, SOCK_STREAM|SOCK_CLOEXEC, 0);
    if (fd == -1)
    {
    puts ("socket(SOCK_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("socket(SOCK_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (fd);

    int fds[2];
    if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
    puts ("socketpair(0) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    coe = fcntl (fds[i], F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    printf ("socketpair(0) set close-on-exec flag for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, fds) == -1)
    {
    puts ("socketpair(SOCK_CLOEXEC) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    coe = fcntl (fds[i], F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    printf ("socketpair(SOCK_CLOEXEC) does not set close-on-exec flag for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

24 Jul, 2008

7 commits

  • Dump the "flows" number according to the number of active flows
    instead of repeating the "limit".

    Reported-by: Denys Fedoryshchenko
    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • * 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
    NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
    cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
    NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
    NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
    cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
    cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
    cpumask: Provide a generic set of CPUMASK_ALLOC macros
    cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
    cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
    cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
    cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
    cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
    cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
    Revert "cpumask: introduce new APIs"
    cpumask: make for_each_cpu_mask a bit smaller
    net: Pass reference to cpumask variable in net/sunrpc/svc.c
    ...

    Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually

    Linus Torvalds
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • This is based upon an excellent bug report from Eric Dumazet.

    tcp_ack() should clear ->icsk_probes_out even if there are packets
    outstanding. Otherwise if we get a sequence of ACKs while we do have
    packets outstanding over and over again, we'll never clear the
    probes_out value and eventually think the connection is too sick and
    we'll reset it.

    This appears to be some "optimization" added to tcp_ack() in the 2.4.x
    timeframe. In 2.2.x, probes_out is pretty much always cleared by
    tcp_ack().

    Here is Eric's original report:

    ----------------------------------------
    Apparently, we can in some situations reset TCP connections in a couple of seconds when some frames are lost.

    In order to reproduce the problem, please try the following program on linux-2.6.25.*

    Setup some iptables rules to allow two frames per second sent on loopback interface to tcp destination port 12000

    iptables -N SLOWLO
    iptables -A SLOWLO -m hashlimit --hashlimit 2 --hashlimit-burst 1 --hashlimit-mode dstip --hashlimit-name slow2 -j ACCEPT
    iptables -A SLOWLO -j DROP

    iptables -A OUTPUT -o lo -p tcp --dport 12000 -j SLOWLO

    Then run the attached program and see the output :

    # ./loop
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,1)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,3)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,5)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,7)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,9)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,11)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,201ms,13)
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,188ms,15)
    write(): Connection timed out
    wrote 890 bytes but was interrupted after 9 seconds
    ESTAB 0 0 127.0.0.1:12000 127.0.0.1:54455
    Exiting read() because no data available (4000 ms timeout).
    read 860 bytes

    While this tcp session makes progress (sending frames with 50 bytes of payload, every 500ms), linux tcp stack decides to reset it, when tcp_retries 2 is reached (default value : 15)

    tcpdump :

    15:30:28.856695 IP 127.0.0.1.56554 > 127.0.0.1.12000: S 33788768:33788768(0) win 32792
    15:30:28.856711 IP 127.0.0.1.12000 > 127.0.0.1.56554: S 33899253:33899253(0) ack 33788769 win 32792
    15:30:29.356947 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 1:61(60) ack 1 win 257
    15:30:29.356966 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 61 win 257
    15:30:29.866415 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 61:111(50) ack 1 win 257
    15:30:29.866427 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 111 win 257
    15:30:30.366516 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 111:161(50) ack 1 win 257
    15:30:30.366527 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 161 win 257
    15:30:30.876196 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 161:211(50) ack 1 win 257
    15:30:30.876207 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 211 win 257
    15:30:31.376282 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 211:261(50) ack 1 win 257
    15:30:31.376290 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 261 win 257
    15:30:31.885619 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 261:311(50) ack 1 win 257
    15:30:31.885631 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 311 win 257
    15:30:32.385705 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 311:361(50) ack 1 win 257
    15:30:32.385715 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 361 win 257
    15:30:32.895249 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 361:411(50) ack 1 win 257
    15:30:32.895266 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 411 win 257
    15:30:33.395341 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 411:461(50) ack 1 win 257
    15:30:33.395351 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 461 win 257
    15:30:33.918085 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 461:511(50) ack 1 win 257
    15:30:33.918096 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 511 win 257
    15:30:34.418163 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 511:561(50) ack 1 win 257
    15:30:34.418172 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 561 win 257
    15:30:34.927685 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 561:611(50) ack 1 win 257
    15:30:34.927698 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 611 win 257
    15:30:35.427757 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 611:661(50) ack 1 win 257
    15:30:35.427766 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 661 win 257
    15:30:35.937359 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 661:711(50) ack 1 win 257
    15:30:35.937376 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 711 win 257
    15:30:36.437451 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 711:761(50) ack 1 win 257
    15:30:36.437464 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 761 win 257
    15:30:36.947022 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 761:811(50) ack 1 win 257
    15:30:36.947039 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 811 win 257
    15:30:37.447135 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 811:861(50) ack 1 win 257
    15:30:37.447203 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 861 win 257
    15:30:41.448171 IP 127.0.0.1.12000 > 127.0.0.1.56554: F 1:1(0) ack 861 win 257
    15:30:41.448189 IP 127.0.0.1.56554 > 127.0.0.1.12000: R 33789629:33789629(0) win 0

    Source of program :

    /*
    * small producer/consumer program.
    * setup a listener on 127.0.0.1:12000
    * Forks a child
    * child connect to 127.0.0.1, and sends 10 bytes on this tcp socket every 100 ms
    * Father accepts connection, and read all data
    */
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int port = 12000;
    char buffer[4096];
    int main(int argc, char *argv[])
    {
    int lfd = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in socket_address;
    time_t t0, t1;
    int on = 1, sfd, res;
    unsigned long total = 0;
    socklen_t alen = sizeof(socket_address);
    pid_t pid;

    time(&t0);
    socket_address.sin_family = AF_INET;
    socket_address.sin_port = htons(port);
    socket_address.sin_addr.s_addr = htonl(INADDR_LOOPBACK);

    if (lfd == -1) {
    perror("socket()");
    return 1;
    }
    setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(int));
    if (bind(lfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
    perror("bind");
    close(lfd);
    return 1;
    }
    if (listen(lfd, 1) == -1) {
    perror("listen()");
    close(lfd);
    return 1;
    }
    pid = fork();
    if (pid == 0) {
    int i, cfd = socket(AF_INET, SOCK_STREAM, 0);
    close(lfd);
    if (connect(cfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
    perror("connect()");
    return 1;
    }
    for (i = 0 ; ;) {
    res = write(cfd, "blablabla\n", 10);
    if (res > 0) total += res;
    else if (res == -1) {
    perror("write()");
    break;
    } else break;
    usleep(100000);
    if (++i == 10) {
    system("ss -on dst 127.0.0.1:12000");
    i = 0;
    }
    }
    time(&t1);
    fprintf(stderr, "wrote %lu bytes but was interrupted after %g seconds\n", total, difftime(t1, t0));
    system("ss -on | grep 127.0.0.1:12000");
    close(cfd);
    return 0;
    }
    sfd = accept(lfd, (struct sockaddr *)&socket_address, &alen);
    if (sfd == -1) {
    perror("accept");
    return 1;
    }
    close(lfd);
    while (1) {
    struct pollfd pfd[1];
    pfd[0].fd = sfd;
    pfd[0].events = POLLIN;
    if (poll(pfd, 1, 4000) == 0) {
    fprintf(stderr, "Exiting read() because no data available (4000 ms timeout).\n");
    break;
    }
    res = read(sfd, buffer, sizeof(buffer));
    if (res > 0) total += res;
    else if (res == 0) break;
    else perror("read()");
    }
    fprintf(stderr, "read %lu bytes\n", total);
    close(sfd);
    return 0;
    }
    ----------------------------------------

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In the merge phase of the CAN subsystem the
    af_family_clock_key_strings[] have been added to sock.c in commit
    443aef0eddfa44c158d1b94ebb431a70638fcab4
    (lockdep: fixup sk_callback_lock annotation). This trivial patch adds
    the missing name for address family 29 (AF_CAN).

    Signed-off-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Oliver Hartkopp
     
  • It isn't helping anything and we aren't going to be able to change all
    the drivers that do queue wakeups in strange situations.

    Just letting a noop_qdisc get scheduled will work because when
    qdisc_run() executes via net_tx_work() it will simply find no packets
    pending when it makes the ->dequeue() call in qdisc_restart.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (24 commits)
    I/OAT: I/OAT version 3.0 support
    I/OAT: tcp_dma_copybreak default value dependent on I/OAT version
    I/OAT: Add watchdog/reset functionality to ioatdma
    iop_adma: cleanup iop_chan_xor_slot_count
    iop_adma: document how to calculate the minimum descriptor pool size
    iop_adma: directly reclaim descriptors on allocation failure
    async_tx: make async_tx_test_ack a boolean routine
    async_tx: remove depend_tx from async_tx_sync_epilog
    async_tx: export async_tx_quiesce
    async_tx: fix handling of the "out of descriptor" condition in async_xor
    async_tx: ensure the xor destination buffer remains dma-mapped
    async_tx: list_for_each_entry_rcu() cleanup
    dmaengine: Driver for the Synopsys DesignWare DMA controller
    dmaengine: Add slave DMA interface
    dmaengine: add DMA_COMPL_SKIP_{SRC,DEST}_UNMAP flags to control dma unmap
    dmaengine: Add dma_client parameter to device_alloc_chan_resources
    dmatest: Simple DMA memcpy test client
    dmaengine: DMA engine driver for Marvell XOR engine
    iop-adma: fix platform driver hotplug/coldplug
    dmaengine: track the number of clients using a channel
    ...

    Fixed up conflict in drivers/dca/dca-sysfs.c manually

    Linus Torvalds
     

23 Jul, 2008

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
    ipw2200: Call netif_*_queue() interfaces properly.
    netxen: Needs to include linux/vmalloc.h
    [netdrvr] atl1d: fix !CONFIG_PM build
    r6040: rework init_one error handling
    r6040: bump release number to 0.18
    r6040: handle RX fifo full and no descriptor interrupts
    r6040: change the default waiting time
    r6040: use definitions for magic values in descriptor status
    r6040: completely rework the RX path
    r6040: call napi_disable when puting down the interface and set lp->dev accordingly.
    mv643xx_eth: fix NETPOLL build
    r6040: rework the RX buffers allocation routine
    r6040: fix scheduling while atomic in r6040_tx_timeout
    r6040: fix null pointer access and tx timeouts
    r6040: prefix all functions with r6040
    rndis_host: support WM6 devices as modems
    at91_ether: use netstats in net_device structure
    sfc: Create one RX queue and interrupt per CPU package by default
    sfc: Use a separate workqueue for resets
    sfc: I2C adapter initialisation fixes
    ...

    Linus Torvalds
     
  • I/OAT DMA performance tuning showed different optimal values of
    tcp_dma_copybreak for different I/OAT versions (4096 for 1.2 and 2048
    for 2.0). This patch lets ioatdma driver set tcp_dma_copybreak value
    according to these results.

    [dan.j.williams@intel.com: remove some ifdefs]
    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Maciej Sosnowski
     
  • Change icmp6_dst_gc to return the one value the caller cares about rather
    than using call by reference.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger