01 Aug, 2008

2 commits


27 Jul, 2008

3 commits

  • * dup2() should return -EBADF on exceeded sysctl_nr_open
    * dup() should *not* return -EINVAL even if you have rlimit set to 0;
    it should get -EMFILE instead.

    Check for orig_start exceeding rlimit taken to sys_fcntl().
    Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
    Consequently, remaining checks for rlimit are taken to expand_files().

    Signed-off-by: Al Viro

    Al Viro
     
  • Since Ulrich is OK with getting rid of dup3(fd, fd, flags) completely,
    to hell the damn thing goes. Corner case for dup2() is handled in
    sys_dup2() (complete with -EBADF if dup2(fd, fd) is called with fd
    that is not open), the rest is done in dup3().

    Signed-off-by: Al Viro

    Al Viro
     
  • Al Viro notice one cornercase that the new dup3() code. The dup2()
    function, as a special case, handles dup-ing to the same file
    descriptor. In this case the current dup3() code does nothing at
    all. I.e., it ingnores the flags parameter. This shouldn't happen,
    the close-on-exec flag should be set if requested.

    In case the O_CLOEXEC bit in the flags parameter is not set the
    dup3() function should behave in this respect identical to dup2().
    This means dup3(fd, fd, 0) should not actively reset the c-o-e
    flag.

    The patch below implements this minor change.

    [AV: credits to Artur Grabowski for bringing that up as potential subtle point
    in dup2() behaviour]

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Al Viro

    Ulrich Drepper
     

25 Jul, 2008

1 commit

  • This patch adds the new dup3 syscall. It extends the old dup2 syscall by one
    parameter which is meant to hold a flag value. Support for the O_CLOEXEC flag
    is added in this patch.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_dup3
    # ifdef __x86_64__
    # define __NR_dup3 292
    # elif defined __i386__
    # define __NR_dup3 330
    # else
    # error "need __NR_dup3"
    # endif
    #endif

    int
    main (void)
    {
    int fd = syscall (__NR_dup3, 1, 4, 0);
    if (fd == -1)
    {
    puts ("dup3(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("dup3(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_dup3, 1, 4, O_CLOEXEC);
    if (fd == -1)
    {
    puts ("dup3(O_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("dup3(O_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

03 Jul, 2008

1 commit


02 May, 2008

1 commit


25 Apr, 2008

1 commit

  • * 'file' argument is unused; lose it.
    * move setting flags from the caller (dupfd()) to locate_fd();
    pass cloexec flag as new argument. Note that files_fdtable()
    that used to be in dupfd() isn't needed in the place in
    locate_fd() where the moved code ends up - we know that ->file_lock
    hadn't been dropped since the last time we calculated fdt because
    we can get there only if expand_files() returns 0 and it doesn't
    drop/reacquire in that case.
    * move getting/dropping ->file_lock into locate_fd(). Now the caller
    doesn't need to do anything with files_struct *files anymore and
    we can move that inside locate_fd() as well, killing the
    struct files_struct * argument.

    At that point locate_fd() is extremely similar to get_unused_fd_flags()
    and the next patches will merge those two.

    Signed-off-by: Al Viro

    Al Viro
     

09 Feb, 2008

2 commits

  • [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
    _all_ converted to operate on the current pid namespace. After this each call
    like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
    one.

    Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
    appropriate.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

20 Oct, 2007

1 commit

  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

17 Oct, 2007

1 commit

  • One more small change to extend the availability of creation of file
    descriptors with FD_CLOEXEC set. Adding a new command to fcntl() requires
    no new system call and the overall impact on code size if minimal.

    If this patch gets accepted we will also add this change to the next
    revision of the POSIX spec.

    To test the patch, use the following little program. Adjust the value of
    F_DUPFD_CLOEXEC appropriately.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef F_DUPFD_CLOEXEC
    # define F_DUPFD_CLOEXEC 12
    #endif

    int
    main (int argc, char *argv[])
    {
    if (argc > 1)
    {
    if (fcntl (3, F_GETFD) == 0)
    {
    puts ("descriptor not closed");
    exit (1);
    }
    if (errno != EBADF)
    {
    puts ("error not EBADF");
    exit (1);
    }

    exit (0);
    }
    int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0);
    if (fd == -1 && errno == EINVAL)
    {
    puts ("F_DUPFD_CLOEXEC not supported");
    return 0;
    }
    if (fd != 3)
    {
    puts ("program called with descriptors other than 0,1,2");
    return 1;
    }

    execl ("/proc/self/exe", "/proc/self/exe", "1", NULL);
    puts ("execl failed");
    return 1;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Cc: Kyle McMartin
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

18 Jul, 2007

1 commit

  • Introduce is_owner_or_cap() macro in fs.h, and convert over relevant
    users to it. This is done because we want to avoid bugs in the future
    where we check for only effective fsuid of the current task against a
    file's owning uid, without simultaneously checking for CAP_FOWNER as
    well, thus violating its semantics.
    [ XFS uses special macros and structures, and in general looked ...
    untouchable, so we leave it alone -- but it has been looked over. ]

    The (current->fsuid != inode->i_uid) check in generic_permission() and
    exec_permission_lite() is left alone, because those operations are
    covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations
    falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone.

    Signed-off-by: Satyam Sharma
    Cc: Al Viro
    Acked-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Satyam Sharma
     

11 Dec, 2006

1 commit

  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

08 Dec, 2006

2 commits

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Oct, 2006

2 commits

  • This has been needed for a long time, but now with the advent of a
    reference counted struct pid there are real consequences for getting this
    wrong.

    Someone I think it was Oleg Nesterov pointed out that this construct was
    missing locking, when I introduced struct pid. After taking time to review
    the locking construct already present I figured out which lock needs to be
    taken. The other paths that access f_owner.pid take either the f_owner
    read or the write lock.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • File handles can be requested to send sigio and sigurg to processes. By
    tracking the destination processes using struct pid instead of pid_t we make
    the interface safe from all potential pid wrap around problems.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

02 Apr, 2006

1 commit


27 Mar, 2006

1 commit

  • I discovered on oprofile hunting on a SMP platform that dentry lookups were
    slowed down because d_hash_mask, d_hash_shift and dentry_hashtable were in
    a cache line that contained inodes_stat. So each time inodes_stats is
    changed by a cpu, other cpus have to refill their cache line.

    This patch moves some variables to the __read_mostly section, in order to
    avoid false sharing. RCU dentry lookups can go full speed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

23 Mar, 2006

1 commit

  • 1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
    platforms, lowering kmalloc() allocated space by 50%.

    2) Reduce the size of (files_struct), using a special 32 bits (or
    64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
    close_on_exec_init and open_fds_init fields. This save some ram (248
    bytes per task) as most tasks dont open more than 32 files. D-Cache
    footprint for such tasks is also reduced to the minimum.

    3) Reduce size of allocated fdset. Currently two full pages are
    allocated, that is 32768 bits on x86 for example, and way too much. The
    minimum is now L1_CACHE_BYTES.

    UP and SMP should benefit from this patch, because most tasks will touch
    only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
    (next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the
    same cache line)

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

04 Feb, 2006

1 commit

  • There is code in setfl() which attempts to preserve the O_APPEND flag on
    IS_APPEND files... however IS_APPEND files could also be opened O_RDONLY
    and in that case setfl() should not require O_APPEND...

    coreutils 5.93 tail -f attempts to set O_NONBLOCK even on regular files...
    unfortunately if you try this on an append-only log file the result is
    this:

    fcntl64(3, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
    fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = -1 EPERM (Operation not permitted)

    I offer up the patch below as one way of fixing the problem... i've tested
    it fixes the problem with tail -f but haven't really tested beyond that.

    (I also reported the coreutils bug upstream... it shouldn't fail imho...
    )

    Signed-off-by: dean gaudet
    Cc: Al Viro
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    dean gaudet
     

15 Jan, 2006

1 commit


12 Jan, 2006

1 commit


09 Jan, 2006

1 commit

  • The only user of send_sigio_to_task() already holds tasklist_lock, so it is
    better not to send the signal via send_group_sig_info() (which takes
    tasklist recursively) but use group_send_sig_info().

    The same change in send_sigurg()->send_sigurg_to_task().

    Signed-off-by: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

10 Sep, 2005

3 commits

  • With the use of RCU in files structure, the look-up of files using fds can now
    be lock-free. The lookup is protected by rcu_read_lock()/rcu_read_unlock().
    This patch changes the readers to use lock-free lookup.

    Signed-off-by: Maneesh Soni
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • Patch to eliminate struct files_struct.file_lock spinlock on the reader side
    and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
    updates to the fdtable are done by allocating a new fdtable structure and
    setting files->fdt to point to the new structure. The fdtable structure is
    protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
    are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
    global list of fdtable to be freed is not scalable, so we use a per-cpu list.
    If keventd is already handling the current cpu's work, we use a timer to defer
    queueing of that work.

    Since the last publication, this patch has been re-written to avoid using
    explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
    premitives instead. This required that the fd information is kept in a
    separate structure (fdtable) and updated atomically.

    Signed-off-by: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • In order for the RCU to work, the file table array, sets and their sizes must
    be updated atomically. Instead of ensuring this through too many memory
    barriers, we put the arrays and their sizes in a separate structure. This
    patch takes the first step of putting the file table elements in a separate
    structure fdtable that is embedded withing files_struct. It also changes all
    the users to refer to the file table using files_fdtable() macro. Subsequent
    applciation of RCU becomes easier after this.

    Signed-off-by: Dipankar Sarma
    Signed-Off-By: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     

28 Jul, 2005

1 commit

  • I believe that there is a problem with the handling of POSIX locks, which
    the attached patch should address.

    The problem appears to be a race between fcntl(2) and close(2). A
    multithreaded application could close a file descriptor at the same time as
    it is trying to acquire a lock using the same file descriptor. I would
    suggest that that multithreaded application is not providing the proper
    synchronization for itself, but the OS should still behave correctly.

    SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
    a file descriptor is closed, that all POSIX locks on the file, owned by the
    process which closed the file descriptor, should be released.

    The trick here is when those locks are released. The current code releases
    all locks which exist when close is processing, but any locks in progress
    are handled when the last reference to the open file is released.

    There are three cases to consider.

    One is the simple case, a multithreaded (mt) process has a file open and
    races to close it and acquire a lock on it. In this case, the close will
    release one reference to the open file and when the fcntl is done, it will
    release the other reference. For this situation, no locks should exist on
    the file when both the close and fcntl operations are done. The current
    system will handle this case because the last reference to the open file is
    being released.

    The second case is when the mt process has dup(2)'d the file descriptor.
    The close will release one reference to the file and the fcntl, when done,
    will release another, but there will still be at least one more reference
    to the open file. One could argue that the existence of a lock on the file
    after the close has completed is okay, because it was acquired after the
    close operation and there is still a way for the application to release the
    lock on the file, using an existing file descriptor.

    The third case is when the mt process has forked, after opening the file
    and either before or after becoming an mt process. In this case, each
    process would hold a reference to the open file. For each process, this
    degenerates to first case above. However, the lock continues to exist
    until both processes have released their references to the open file. This
    lock could block other lock requests.

    The changes to release the lock when the last reference to the open file
    aren't quite right because they would allow the lock to exist as long as
    there was a reference to the open file. This is too long.

    The new proposed solution is to add support in the fcntl code path to
    detect a race with close and then to release the lock which was just
    acquired when such as race is detected. This causes locks to be released
    in a timely fashion and for the system to conform to the POSIX semantic
    specification.

    This was tested by instrumenting a kernel to detect the handling locks and
    then running a program which generates case #3 above. A dangling lock
    could be reliably generated. When the changes to detect the close/fcntl
    race were added, a dangling lock could no longer be generated.

    Cc: Matthew Wilcox
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Staubach
     

01 May, 2005

1 commit


17 Apr, 2005

2 commits

  • A question on sigwaitinfo based IO mechanism in multithreaded applications.

    I am trying to use RT signals to notify me of IO events using RT signals
    instead of SIGIO in a multithreaded applications. I noticed that there was
    some discussion on lkml during november 1999 with the subject of the
    discussion as "Signal driven IO". In the thread I noticed that RT signals
    were being delivered to the worker thread. I am running 2.6.10 kernel and
    I am trying to use the very same mechanism and I find that only SIGIO being
    propogated to the worker threads and RT signals only being propogated to
    the main thread and not the worker threads where I actually want them to be
    propogated too. On further inspection I found that the following patch
    which I have attached solves the problem.

    I am not sure if this is a bug or feature in the kernel.

    Roland McGrath said:

    This relates only to fcntl F_SETSIG, which is a Linux extension. So there is
    no POSIX issue. When changing various things like the normal SIGIO signalling
    to do group signals, I was concerned strictly with the POSIX semantics and
    generally avoided touching things in the domain of Linux inventions. That's
    why I didn't change this when I changed the call right next to it. There is
    no reason I can see that F_SETSIG-requested signals shouldn't use a group
    signal like normal SIGIO does. I'm happy to ACK this patch, there is nothing
    wrong with its change to the semantics in my book. But neither POSIX nor I
    care a whit what F_SETSIG does.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bharath Ramesh
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds