16 Dec, 2008

1 commit

  • When a cgroup is removed, it's unlinked from its parent's children list,
    but not actually freed until the last dentry on it is released (at which
    point cgrp->root->number_of_cgroups is decremented).

    Currently rebind_subsystems checks for the top cgroup's child list being
    empty in order to rebind subsystems into or out of a hierarchy - this can
    result in the set of subsystems bound to a hierarchy being
    removed-but-not-freed cgroup.

    The simplest fix for this is to forbid remounts that change the set of
    subsystems on a hierarchy that has removed-but-not-freed cgroups. This
    bug can be reproduced via:

    mkdir /mnt/cg
    mount -t cgroup -o ns,freezer cgroup /mnt/cg
    mkdir /mnt/cg/foo
    sleep 1h < /mnt/cg/foo &
    rmdir /mnt/cg/foo
    mount -t cgroup -o remount,ns,devices,freezer cgroup /mnt/cg
    kill $!

    Though the above will cause oops in -mm only but not mainline, but the bug
    can cause memory leak in mainline (and even oops)

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

15 Dec, 2008

1 commit

  • This reverts commit 5b7dba4ff834259a5623e03a565748704a8fe449, which
    caused a regression in hibernate, reported by and bisected by Fabio
    Comolli.

    This revert fixes

    http://bugzilla.kernel.org/show_bug.cgi?id=12155
    http://bugzilla.kernel.org/show_bug.cgi?id=12149

    Bisected-by: Fabio Comolli
    Requested-by: Rafael J. Wysocki
    Acked-by: Dave Kleikamp
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Dec, 2008

4 commits

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: CPU remove deadlock fix

    Linus Torvalds
     
  • Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped
    test in 2.6.7! Bad bad bug, good good find.

    The i_mmap_writable count must be incremented for VM_SHARED (just as
    i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock)
    when dup_mmap() copies the vma for fork: it has its own more optimal
    version of __vma_link_file(), and I missed this out. So the count
    was later going down to 0 (dangerous) when one end unmapped, then
    wrapping negative (inefficient) when the other end unmapped.

    The only impact on x86 would have been that setting a mandatory lock on
    a file which has at some time been opened O_RDWR and mapped MAP_SHARED
    (but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN
    when it should succeed, or succeed when it should fail.

    But those architectures which rely on flush_dcache_page() to flush
    userspace modifications back into the page before the kernel reads it,
    may in some cases have skipped the flush after such a fork - though any
    repetitive test will soon wrap the count negative, in which case it will
    flush_dcache_page() unnecessarily.

    Fix would be a two-liner, but mapping variable added, and comment moved.

    Reported-by: Lee Schermerhorn
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
    to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use
    less stack exposing a bug in slub's list_locations() -
    kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
    beyond the end of page provided.

    The 100 slop which list_locations() allows at end of page looks roughly
    enough for all the other stuff it might print after the symbol before
    it checks again: break out KSYM_SYMBOL_LEN earlier than before.

    Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
    need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
    where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
    them.

    [akpm@linux-foundation.org: ftrace.h needs module.h]
    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc Miles Lane
    Acked-by: Pekka Enberg
    Acked-by: Steven Rostedt
    Acked-by: Frederic Weisbecker
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Running kmemtraced, which uses splice() on relayfs, causes a hard lock on
    x86-64 SMP. As described by Tom Zanussi:

    It looks like you hit the same problem as described here:

    commit 8191ecd1d14c6914c660dfa007154860a7908857

    splice: fix infinite loop in generic_file_splice_read()

    relay uses the same loop but it never got noticed or fixed.

    Cc: Mathieu Desnoyers
    Tested-by: Pekka Enberg
    Signed-off-by: Tom Zanussi
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Zanussi
     

10 Dec, 2008

1 commit

  • Impact: fix possible deadlock in CPU hot-remove path

    This patch fixes a possible deadlock scenario in the CPU remove path.
    migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
    with the lock held. Then one of the tasks on the migration queue ends up
    calling tg_shares_up which then also tries to acquire the same rq->lock.

    [c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
    [c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
    [c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
    [c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
    [c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
    [c000000058eab640] c00000000007ada4 .complete+0x54/0x80
    [c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
    [c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
    [c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
    [c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
    [c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
    [c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
    [c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
    [c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
    [c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
    [c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40

    Signed-off-by: Brian King
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Brian King
     

09 Dec, 2008

4 commits


05 Dec, 2008

3 commits


04 Dec, 2008

2 commits

  • it had been put there to mark the call of blkdev_put() that
    needed proper argument propagated to it; later patch in the
    same series had done just that.

    Signed-off-by: Al Viro

    Al Viro
     
  • Impact: fix time warp bug

    Alex Shi, along with Yanmin Zhang have been noticing occasional time
    inconsistencies recently. Through their great diagnosis, they found that
    the xtime_nsec value used in update_wall_time was occasionally going
    negative. After looking through the code for awhile, I realized we have
    the possibility for an underflow when three conditions are met in
    update_wall_time():

    1) We have accumulated a second's worth of nanoseconds, so we
    incremented xtime.tv_sec and appropriately decrement xtime_nsec.
    (This doesn't cause xtime_nsec to go negative, but it can cause it
    to be small).

    2) The remaining offset value is large, but just slightly less then
    cycle_interval.

    3) clocksource_adjust() is speeding up the clock, causing a
    corrective amount (compensating for the increase in the multiplier
    being multiplied against the unaccumulated offset value) to be
    subtracted from xtime_nsec.

    This can cause xtime_nsec to underflow.

    Unfortunately, since we notify the NTP subsystem via second_overflow()
    whenever we accumulate a full second, and this effects the error
    accumulation that has already occured, we cannot simply revert the
    accumulated second from xtime nor move the second accumulation to after
    the clocksource_adjust call without a change in behavior.

    This leaves us with (at least) two options:

    1) Simply return from clocksource_adjust() without making a change if we
    notice the adjustment would cause xtime_nsec to go negative.

    This would work, but I'm concerned that if a large adjustment was needed
    (due to the error being large), it may be possible to get stuck with an
    ever increasing error that becomes too large to correct (since it may
    always force xtime_nsec negative). This may just be paranoia on my part.

    2) Catch xtime_nsec if it is negative, then add back the amount its
    negative to both xtime_nsec and the error.

    This second method is consistent with how we've handled earlier rounding
    issues, and also has the benefit that the error being added is always in
    the oposite direction also always equal or smaller then the correction
    being applied. So the risk of a corner case where things get out of
    control is lessened.

    This patch fixes bug 11970, as tested by Yanmin Zhang
    http://bugzilla.kernel.org/show_bug.cgi?id=11970

    Reported-by: alex.shi@intel.com
    Signed-off-by: John Stultz
    Acked-by: "Zhang, Yanmin"
    Tested-by: "Zhang, Yanmin"
    Signed-off-by: Ingo Molnar

    john stultz
     

03 Dec, 2008

1 commit


02 Dec, 2008

2 commits

  • The description for 'D' was missing in the comment... (causing me a
    minute of WTF followed by looking at more of the code)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

01 Dec, 2008

6 commits


30 Nov, 2008

2 commits

  • Regarding the bug addressed in:

    4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task

    Linus points out that the fix is not complete:

    > There's nothing that keeps gcc from deciding not to reload
    > rq->nr_running.
    >
    > Of course, in _practice_, I don't think gcc ever will (if it decides
    > that it will spill, gcc is likely going to decide that it will
    > literally spill the local variable to the stack rather than decide to
    > reload off the pointer), but it's a valid compiler optimization, and
    > it even has a name (rematerialization).
    >
    > So I suspect that your patch does fix the bug, but it still leaves the
    > fairly unlikely _potential_ for it to re-appear at some point.
    >
    > We have ACCESS_ONCE() as a macro to guarantee that the compiler
    > doesn't rematerialize a pointer access. That also would clarify
    > the fact that we access something unsafe outside a lock.

    So make sure our nr_running value is immutable and cannot change
    after we check it for nonzero.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • this warning:

    kernel/cpuset.c: In function ‘generate_sched_domains’:
    kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function

    triggers because GCC does not recognize that ndoms stays uninitialized
    only if doms is NULL - but that flow is covered at the end of
    generate_sched_domains().

    Help out GCC by initializing this variable to 0. (that's prudent anyway)

    Also, this function needs a splitup and code flow simplification:
    with 160 lines length it's clearly too long.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

27 Nov, 2008

2 commits

  • Impact: fix divide by zero crash in scheduler rebalance irq

    While testing the branch profiler, I hit this crash:

    divide error: 0000 [#1] PREEMPT SMP
    [...]
    RIP: 0010:[] [] cpu_avg_load_per_task+0x50/0x7f
    [...]
    Call Trace:
    [] find_busiest_group+0x3e5/0xcaa
    [] rebalance_domains+0x2da/0xa21
    [] ? find_next_bit+0x1b2/0x1e6
    [] run_rebalance_domains+0x112/0x19f
    [] __do_softirq+0xa8/0x232
    [] call_softirq+0x1c/0x3e
    [] do_softirq+0x94/0x1cd
    [] irq_exit+0x6b/0x10e
    [] smp_apic_timer_interrupt+0xd3/0xff
    [] apic_timer_interrupt+0x13/0x20

    The code for cpu_avg_load_per_task has:

    if (rq->nr_running)
    rq->avg_load_per_task = rq->load.weight / rq->nr_running;

    The runqueue lock is not held here, and there is nothing that prevents
    the rq->nr_running from going to zero after it passes the if condition.

    The branch profiler simply made the race window bigger.

    This patch saves off the rq->nr_running to a local variable and uses that
    for both the condition and the division.

    Signed-off-by: Steven Rostedt
    Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • Impact: prevent unnecessary stack recursion

    if the resched flag was set before we entered, then don't reschedule.

    Signed-off-by: Lai Jiangshan
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Lai Jiangshan
     

24 Nov, 2008

2 commits

  • Since CLOCK_PROCESS_CPUTIME_ID is in fact translated to -6, the switch
    statement in cpu_clock_sample_group() must first mask off the irrelevant
    bits, similar to cpu_clock_sample().

    Signed-off-by: Petr Tesarik
    Signed-off-by: Thomas Gleixner

    --
    posix-cpu-timers.c | 2 +-
    1 file changed, 1 insertion(+), 1 deletion(-)

    Petr Tesarik
     
  • Impact: fix mmiotrace overrun tracing

    When ftrace framework moved to use the ring buffer facility, the buffer
    overrun detection was broken after 2.6.27 by commit

    | commit 3928a8a2d98081d1bc3c0a84a2d70e29b90ecf1c
    | Author: Steven Rostedt
    | Date: Mon Sep 29 23:02:41 2008 -0400
    |
    | ftrace: make work with new ring buffer
    |
    | This patch ports ftrace over to the new ring buffer.

    The detection is now fixed by using the ring buffer API.

    When mmiotrace detects a buffer overrun, it will report the number of
    lost events. People reading an mmiotrace log must know if something was
    missed, otherwise the data may not make sense.

    Signed-off-by: Pekka Paalanen
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Pekka Paalanen
     

23 Nov, 2008

1 commit


21 Nov, 2008

3 commits

  • Impact: prettify /proc/lockdep_info

    Just feel odd that not all lines of lockdep info are aligned.

    Signed-off-by: Li Zefan
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • Impact: make output of stack_trace complete if buffer overruns

    When read buffer overruns, the output of stack_trace isn't complete.

    When printing records with seq_printf in t_show, if the read buffer
    has overruned by the current record, then this record won't be
    printed to user space through read buffer, it will just be dropped in
    this printing.

    When next printing, t_start should return the "*pos"th record, which
    is the one dropped by previous printing, but it just returns
    (m->private + *pos)th record.

    Here we use a more sane method to implement seq_operations which can
    be found in kernel code. Thus we needn't initialize m->private.

    About testing, it's not easy to overrun read buffer, but we can use
    seq_printf to print more padding bytes in t_show, then it's easy to
    check whether or not records are lost.

    This commit has been tested on both condition of overrun and non
    overrun.

    Signed-off-by: Liming Wang
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Liming Wang
     
  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    ftrace: fix dyn ftrace filter selection
    ftrace: make filtered functions effective on setting
    ftrace: fix set_ftrace_filter
    trace: introduce missing mutex_unlock()
    tracing: kernel/trace/trace.c: introduce missing kfree()

    Linus Torvalds
     

20 Nov, 2008

5 commits

  • Try this, and you'll get oops immediately:
    # cd Documentation/accounting/
    # gcc -o getdelays getdelays.c
    # mount -t cgroup -o debug xxx /mnt
    # ./getdelays -C /mnt/tasks

    Because a normal file's dentry->d_fsdata is a pointer to struct cftype,
    not struct cgroup.

    After the patch, it returns EINVAL if we try to get cgroupstats
    from a normal file.

    Cc: Balbir Singh
    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • sprint_symbol(), itself used when dumping stacks, has been wasting 128
    bytes of stack: lookup the symbol directly into the buffer supplied by the
    caller, instead of using a locally declared namebuf.

    I believe the name != buffer strcpy() is obsolete: the design here dates
    from when module symbol lookup pointed into a supposedly const but sadly
    volatile table; nowadays it copies, but an uncalled strcpy() looks better
    here than the risk of a recursive BUG_ON().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • As Balbir pointed out, memcg's pre_destroy handler has potential deadlock.

    It has following lock sequence.

    cgroup_mutex (cgroup_rmdir)
    -> pre_destroy -> mem_cgroup_pre_destroy-> force_empty
    -> cpu_hotplug.lock. (lru_add_drain_all->
    schedule_work->
    get_online_cpus)

    But, cpuset has following.
    cpu_hotplug.lock (call notifier)
    -> cgroup_mutex. (within notifier)

    Then, this lock sequence should be fixed.

    Considering how pre_destroy works, it's not necessary to holding
    cgroup_mutex() while calling it.

    As a side effect, we don't have to wait at this mutex while memcg's
    force_empty works.(it can be long when there are tons of pages.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • After adding a node into the machine, top cpuset's mems isn't updated.

    By reviewing the code, we found that the update function

    cpuset_track_online_nodes()

    was invoked after node_states[N_ONLINE] changes. It is wrong because
    N_ONLINE just means node has pgdat, and if node has/added memory, we use
    N_HIGH_MEMORY. So, We should invoke the update function after
    node_states[N_HIGH_MEMORY] changes, just like its commit says.

    This patch fixes it. And we use notifier of memory hotplug instead of
    direct calling of cpuset_track_online_nodes().

    Signed-off-by: Miao Xie
    Acked-by: Yasunori Goto
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Introduce a new accept4() system call. The addition of this system call
    matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
    inotify_init1(), epoll_create1(), pipe2()) which added new system calls
    that differed from analogous traditional system calls in adding a flags
    argument that can be used to access additional functionality.

    The accept4() system call is exactly the same as accept(), except that
    it adds a flags bit-mask argument. Two flags are initially implemented.
    (Most of the new system calls in 2.6.27 also had both of these flags.)

    SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
    for the new file descriptor returned by accept4(). This is a useful
    security feature to avoid leaking information in a multithreaded
    program where one thread is doing an accept() at the same time as
    another thread is doing a fork() plus exec(). More details here:
    http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
    Ulrich Drepper).

    The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
    to be enabled on the new open file description created by accept4().
    (This flag is merely a convenience, saving the use of additional calls
    fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.

    Here's a test program. Works on x86-32. Should work on x86-64, but
    I (mtk) don't have a system to hand to test with.

    It tests accept4() with each of the four possible combinations of
    SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
    that the appropriate flags are set on the file descriptor/open file
    description returned by accept4().

    I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
    and it passes according to my test program.

    /* test_accept4.c

    Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk

    Licensed under the GNU GPLv2 or later.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PORT_NUM 33333

    #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

    /**********************************************************************/

    /* The following is what we need until glibc gets a wrapper for
    accept4() */

    /* Flags for socket(), socketpair(), accept4() */
    #ifndef SOCK_CLOEXEC
    #define SOCK_CLOEXEC O_CLOEXEC
    #endif
    #ifndef SOCK_NONBLOCK
    #define SOCK_NONBLOCK O_NONBLOCK
    #endif

    #ifdef __x86_64__
    #define SYS_accept4 288
    #elif __i386__
    #define USE_SOCKETCALL 1
    #define SYS_ACCEPT4 18
    #else
    #error "Sorry -- don't know the syscall # on this architecture"
    #endif

    static int
    accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
    {
    printf("Calling accept4(): flags = %x", flags);
    if (flags != 0) {
    printf(" (");
    if (flags & SOCK_CLOEXEC)
    printf("SOCK_CLOEXEC");
    if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
    printf(" ");
    if (flags & SOCK_NONBLOCK)
    printf("SOCK_NONBLOCK");
    printf(")");
    }
    printf("\n");

    #if USE_SOCKETCALL
    long args[6];

    args[0] = fd;
    args[1] = (long) sockaddr;
    args[2] = (long) addrlen;
    args[3] = flags;

    return syscall(SYS_socketcall, SYS_ACCEPT4, args);
    #else
    return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
    #endif
    }

    /**********************************************************************/

    static int
    do_test(int lfd, struct sockaddr_in *conn_addr,
    int closeonexec_flag, int nonblock_flag)
    {
    int connfd, acceptfd;
    int fdf, flf, fdf_pass, flf_pass;
    struct sockaddr_in claddr;
    socklen_t addrlen;

    printf("=======================================\n");

    connfd = socket(AF_INET, SOCK_STREAM, 0);
    if (connfd == -1)
    die("socket");
    if (connect(connfd, (struct sockaddr *) conn_addr,
    sizeof(struct sockaddr_in)) == -1)
    die("connect");

    addrlen = sizeof(struct sockaddr_in);
    acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
    closeonexec_flag | nonblock_flag);
    if (acceptfd == -1) {
    perror("accept4()");
    close(connfd);
    return 0;
    }

    fdf = fcntl(acceptfd, F_GETFD);
    if (fdf == -1)
    die("fcntl:F_GETFD");
    fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
    printf("Close-on-exec flag is %sset (%s); ",
    (fdf & FD_CLOEXEC) ? "" : "not ",
    fdf_pass ? "OK" : "failed");

    flf = fcntl(acceptfd, F_GETFL);
    if (flf == -1)
    die("fcntl:F_GETFD");
    flf_pass = ((flf & O_NONBLOCK) != 0) ==
    ((nonblock_flag & SOCK_NONBLOCK) !=0);
    printf("nonblock flag is %sset (%s)\n",
    (flf & O_NONBLOCK) ? "" : "not ",
    flf_pass ? "OK" : "failed");

    close(acceptfd);
    close(connfd);

    printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
    return fdf_pass && flf_pass;
    }

    static int
    create_listening_socket(int port_num)
    {
    struct sockaddr_in svaddr;
    int lfd;
    int optval;

    memset(&svaddr, 0, sizeof(struct sockaddr_in));
    svaddr.sin_family = AF_INET;
    svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    svaddr.sin_port = htons(port_num);

    lfd = socket(AF_INET, SOCK_STREAM, 0);
    if (lfd == -1)
    die("socket");

    optval = 1;
    if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
    sizeof(optval)) == -1)
    die("setsockopt");

    if (bind(lfd, (struct sockaddr *) &svaddr,
    sizeof(struct sockaddr_in)) == -1)
    die("bind");

    if (listen(lfd, 5) == -1)
    die("listen");

    return lfd;
    }

    int
    main(int argc, char *argv[])
    {
    struct sockaddr_in conn_addr;
    int lfd;
    int port_num;
    int passed;

    passed = 1;

    port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;

    memset(&conn_addr, 0, sizeof(struct sockaddr_in));
    conn_addr.sin_family = AF_INET;
    conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    conn_addr.sin_port = htons(port_num);

    lfd = create_listening_socket(port_num);

    if (!do_test(lfd, &conn_addr, 0, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
    passed = 0;

    close(lfd);

    exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
    }

    [mtk.manpages@gmail.com: rewrote changelog, updated test program]
    Signed-off-by: Ulrich Drepper
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper