24 Mar, 2011

1 commit


15 Mar, 2011

1 commit

  • New flag for open(2) - O_PATH. Semantics:
    * pathname is resolved, but the file itself is _NOT_ opened
    as far as filesystem is concerned.
    * almost all operations on the resulting descriptors shall
    fail with -EBADF. Exceptions are:
    1) operations on descriptors themselves (i.e.
    close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
    fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
    fcntl(fd, F_SETFD, ...))
    2) fcntl(fd, F_GETFL), for a common non-destructive way to
    check if descriptor is open
    3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
    points of pathname resolution
    * closing such descriptor does *NOT* affect dnotify or
    posix locks.
    * permissions are checked as usual along the way to file;
    no permission checks are applied to the file itself. Of course,
    giving such thing to syscall will result in permission checks (at
    the moment it means checking that starting point of ....at() is
    a directory and caller has exec permissions on it).

    fget() and fget_light() return NULL on such descriptors; use of
    fget_raw() and fget_raw_light() is needed to get them. That protects
    existing code from dealing with those things.

    There are two things still missing (they come in the next commits):
    one is handling of symlinks (right now we refuse to open them that
    way; see the next commit for semantics related to those) and another
    is descriptor passing via SCM_RIGHTS datagrams.

    Signed-off-by: Al Viro

    Al Viro
     

03 Feb, 2011

1 commit

  • FMODE_EXEC is a constant type of fmode_t but was used with normal integer
    constants. This results in following warnings from sparse. Fix it using
    new macro __FMODE_EXEC.

    fs/exec.c:116:58: warning: restricted fmode_t degrades to integer
    fs/exec.c:689:58: warning: restricted fmode_t degrades to integer
    fs/fcntl.c:777:9: warning: restricted fmode_t degrades to integer

    Signed-off-by: Namhyung Kim
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

28 Oct, 2010

2 commits

  • In commit f7347ce4ee7c ("fasync: re-organize fasync entry insertion to
    allow it under a spinlock") Arnd took an earlier patch of mine that had
    the comment about the FASYNC flag above the wrong function.

    When the fasync_add_entry() function was split to introduce the new
    fasync_insert_entry() helper function, the code that actually cares
    about the FASYNC bit moved to that new helper.

    So just move the comment to the right point.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • You currently cannot use "fasync_helper()" in an atomic environment to
    insert a new fasync entry, because it will need to allocate the new
    "struct fasync_struct".

    Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
    the process of being converted from the BKL to a spinlock.

    In order to fix this, this abstracts out the actual fasync list
    insertion and the fasync allocations into functions of their own, and
    teaches fs/locks.c to pre-allocate the fasync_struct entry. That way
    the actual list insertion can happen while holding the required
    spinlock.

    Signed-off-by: Linus Torvalds
    [bfields@redhat.com: rebase on top of my changes to Arnd's patch]
    Tested-by: J. Bruce Fields
    Signed-off-by: Arnd Bergmann

    Linus Torvalds
     

10 Sep, 2010

1 commit

  • O_NONBLOCK on parisc has a dual value:

    #define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */

    It is caught by the O_* bits uniqueness check and leads to a parisc
    compile error. The fix would be to take O_NONBLOCK out.

    Signed-off-by: Wu Fengguang
    Signed-off-by: James Bottomley
    Cc: Jamie Lokier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Bottomley
     

11 Aug, 2010

1 commit

  • The O_* bit numbers are defined in 20+ arch/*, and can silently overlap.
    Add a compile time check to ensure the uniqueness as suggested by David
    Miller.

    Signed-off-by: Wu Fengguang
    Cc: David Miller
    Cc: Stephen Rothwell
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Eric Paris
    Cc: Roland Dreier
    Cc: Jamie Lokier
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

30 Jun, 2010

1 commit

  • Fix a lockdep-splat-causing regression introduced by commit 989a2979205d
    ("fasync: RCU and fine grained locking").

    kill_fasync() can be called from both process and hard-irq context, so
    fa_lock must be taken with IRQs disabled.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16230

    Reported-by: Sergey Senozhatsky
    Reported-by: Dominik Brodowski
    Tested-by: Dominik Brodowski
    Cc: Maciej Rutecki
    Acked-by: Eric Dumazet
    Cc: Paul E. McKenney
    Cc: Lai Jiangshan
    Cc: "David S. Miller"
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

05 Jun, 2010

1 commit

  • copy_to_user() returns the number of bytes remaining, but we want to
    return -EFAULT.
    ret = fcntl(fd, F_SETOWN_EX, NULL);
    With the original code ret would be 8 here.

    V2: Takuya Yoshikawa pointed out a similar issue in f_getown_ex()

    Signed-off-by: Dan Carpenter
    Signed-off-by: Al Viro

    Dan Carpenter
     

22 May, 2010

2 commits


22 Apr, 2010

1 commit

  • kill_fasync() uses a central rwlock, candidate for RCU conversion, to
    avoid cache line ping pongs on SMP.

    fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
    section instead during whole list scan.

    Use a spinlock per fasync_struct to synchronize kill_fasync_rcu() and
    fasync_{remove|add}_entry(). This spinlock is IRQ safe, so sock_fasync()
    doesnt need its own implementation and can use fasync_helper(), to
    reduce code size and complexity.

    We can remove __kill_fasync() direct use in net/socket.c, and rename it
    to kill_fasync_rcu().

    Signed-off-by: Eric Dumazet
    Cc: Paul E. McKenney
    Cc: Lai Jiangshan
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

08 Feb, 2010

1 commit

  • This reverts commit 703625118069 ("tty: fix race in tty_fasync") and
    commit b04da8bfdfbb ("fnctl: f_modown should call write_lock_irqsave/
    restore") that tried to fix up some of the fallout but was incomplete.

    It turns out that we really cannot hold 'tty->ctrl_lock' over calling
    __f_setown, because not only did that cause problems with interrupt
    disables (which the second commit fixed), it also causes a potential
    ABBA deadlock due to lock ordering.

    Thanks to Tetsuo Handa for following up on the issue, and running
    lockdep to show the problem. It goes roughly like this:

    - f_getown gets filp->f_owner.lock for reading without interrupts
    disabled, so an interrupt that happens while that lock is held can
    cause a lockdep chain from f_owner.lock -> sighand->siglock.

    - at the same time, the tty->ctrl_lock -> f_owner.lock chain that
    commit 703625118069 introduced, together with the pre-existing
    sighand->siglock -> tty->ctrl_lock chain means that we have a lock
    dependency the other way too.

    So instead of extending tty->ctrl_lock over the whole __f_setown() call,
    we now just take a reference to the 'pid' structure while holding the
    lock, and then release it after having done the __f_setown. That still
    guarantees that 'struct pid' won't go away from under us, which is all
    we really ever needed.

    Reported-and-tested-by: Tetsuo Handa
    Acked-by: Greg Kroah-Hartman
    Acked-by: Américo Wang
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Jan, 2010

1 commit

  • Commit 703625118069f9f8960d356676662d3db5a9d116 exposed that f_modown()
    should call write_lock_irqsave instead of just write_lock_irq so that
    because a caller could have a spinlock held and it would not be good to
    renable interrupts.

    Cc: Eric W. Biederman
    Cc: Al Viro
    Cc: Alan Cox
    Cc: Tavis Ormandy
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

17 Dec, 2009

1 commit

  • Yes, the add and remove cases do share the same basic loop and the
    locking, but the compiler can inline and then CSE some of the end result
    anyway. And splitting it up makes the code way easier to follow,
    and makes it clearer exactly what the semantics are.

    In particular, we must make sure that the FASYNC flag in file->f_flags
    exactly matches the state of "is this file on any fasync list", since
    not only is that flag visible to user space (F_GETFL), but we also use
    that flag to check whether we need to remove any fasync entries on file
    close.

    We got that wrong for the case of a mixed use of file locking (which
    tries to remove any fasync entries for file leases) and fasync.

    Splitting the function up also makes it possible to do some future
    optimizations without making the function even messier. In particular,
    since the FASYNC flag has to match the state of "is this on a list", we
    can do the following future optimizations:

    - on remove, we don't even need to get the locks and traverse the list
    if FASYNC isn't set, since we can know a priori that there is no
    point (this is effectively the same optimization that we already do
    in __fput() wrt removing fasync on file close)

    - on add, we can use the FASYNC flag to decide whether we are changing
    an existing entry or need to allocate a new one.

    but this is just the cleanup + fix for the FASYNC flag.

    Acked-by: Al Viro
    Tested-by: Tavis Ormandy
    Cc: Jeff Dike
    Cc: Matt Mackall
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Nov, 2009

1 commit

  • This is for consistency with various ioctl() operations that include the
    suffix "PGRP" in their names, and also for consistency with PRIO_PGRP,
    used with setpriority() and getpriority(). Also, using PGRP instead of
    GID avoids confusion with the common abbreviation of "group ID".

    I'm fine with anything that makes it more consistent, and if PGRP is what
    is the predominant abbreviation then I see no need to further confuse
    matters by adding a third one.

    Signed-off-by: Peter Zijlstra
    Acked-by: Michael Kerrisk
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 Sep, 2009

2 commits

  • In order to direct the SIGIO signal to a particular thread of a
    multi-threaded application we cannot, like suggested by the manpage, put a
    TID into the regular fcntl(F_SETOWN) call. It will still be send to the
    whole process of which that thread is part.

    Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.

    The need to direct SIGIO comes from self-monitoring profiling such as with
    perf-counters. Perf-counters uses SIGIO to notify that new sample data is
    available. If the signal is delivered to the same task that generated the
    new sample it can augment that data by inspecting the task's user-space
    state right after it returns from the kernel. This is esp. convenient
    for interpreted or virtual machine driven environments.

    Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
    as argument:

    struct f_owner_ex {
    int type;
    pid_t pid;
    };

    Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Tested-by: stephane eranian
    Cc: Michael Kerrisk
    Cc: Roland McGrath
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • group_send_sig_info()->check_kill_permission() assumes that current is the
    sender and uses current_cred().

    This is not true in send_sigio_to_task() case. From the security pov the
    sender is not current, but the task which did fcntl(F_SETOWN), that is why
    we have sigio_perm() which uses the right creds to check.

    Fortunately, send_sigio() always sends either SEND_SIG_PRIV or
    SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But
    still it would be tidier to avoid this bogus security check and save a
    couple of cycles.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: stephane eranian
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 Jul, 2009

1 commit

  • * Remove smp_lock.h from files which don't need it (including some headers!)
    * Add smp_lock.h to files which do need it
    * Make smp_lock.h include conditional in hardirq.h
    It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

    This will make hardirq.h inclusion cheaper for every PREEMPT=n config
    (which includes allmodconfig/allyesconfig, BTW)

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

17 Jun, 2009

2 commits

  • send_sigio_to_task() reads fown->signum several times, we can race with
    F_SETSIG which changes ->signum lockless. In theory, this can fool
    security checks or we can call group_send_sig_info() with the wrong
    ->si_signo which does not match "int sig".

    Change the code to cache ->signum.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Shift current_cred() from __f_setown() to f_modown(). This reduces
    the number of arguments and saves 48 bytes from fs/fcntl.o.

    [ Note: this doesn't clear euid/uid when pid is set to NULL. But if
    f_owner.pid == NULL we never use f_owner.uid/euid. Otherwise we'd
    have a bug anyway: we must not send signals if pid was reset to NULL. ]

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 May, 2009

1 commit

  • The return value of dup2 when oldfd == newfd and the fd isn't valid is
    not getting properly sign extended. We end up with 4294967287 instead
    of -EBADF.

    I've reproduced this on SLE11 (2.6.27.21), openSUSE Factory
    (2.6.29-rc5), and Ubuntu 9.04 (2.6.28).

    This patch uses a signed int for the error value so it is properly
    extended.

    Commit 6c5d0512a091480c9f981162227fdb1c9d70e555 introduced this
    regression.

    Reported-by: Jiri Dluhos
    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

30 Mar, 2009

1 commit

  • Lockdep gripes if file->f_lock is taken in a no-IRQ situation, since that
    is not always the case. We don't really want to disable IRQs for every
    acquisition of f_lock; instead, just move it outside of fasync_lock.

    Reported-by: Bartlomiej Zolnierkiewicz
    Reported-by: Larry Finger
    Reported-by: Wu Fengguang
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

16 Mar, 2009

3 commits

  • Most fasync implementations do something like:

    return fasync_helper(...);

    But fasync_helper() will return a positive value at times - a feature used
    in at least one place. Thus, a number of other drivers do:

    err = fasync_helper(...);
    if (err < 0)
    return err;
    return 0;

    In the interests of consistency and more concise code, it makes sense to
    map positive return values onto zero where ->fasync() is called.

    Cc: Al Viro
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     
  • Removing the BKL from FASYNC handling ran into the challenge of keeping the
    setting of the FASYNC bit in filp->f_flags atomic with regard to calls to
    the underlying fasync() function. Andi Kleen suggested moving the handling
    of that bit into fasync(); this patch does exactly that. As a result, we
    have a couple of internal API changes: fasync() must now manage the FASYNC
    bit, and it will be called without the BKL held.

    As it happens, every fasync() implementation in the kernel with one
    exception calls fasync_helper(). So, if we make fasync_helper() set the
    FASYNC bit, we can avoid making any changes to the other fasync()
    functions - as long as those functions, themselves, have proper locking.
    Most fasync() implementations do nothing but call fasync_helper() - which
    has its own lock - so they are easily verified as correct. The BKL had
    already been pushed down into the rest.

    The networking code has its own version of fasync_helper(), so that code
    has been augmented with explicit FASYNC bit handling.

    Cc: Al Viro
    Cc: David Miller
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     
  • Traditionally, changes to struct file->f_flags have been done under BKL
    protection, or with no protection at all. This patch causes all f_flags
    changes after file open/creation time to be done under protection of
    f_lock. This allows the removal of some BKL usage and fixes a number of
    longstanding (if microscopic) races.

    Reviewed-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

14 Jan, 2009

1 commit


25 Dec, 2008

1 commit


06 Dec, 2008

1 commit

  • Changeset a238b790d5f99c7832f9b73ac8847025815b85f7 (Call fasync()
    functions without the BKL) introduced a race which could leave
    file->f_flags in a state inconsistent with what the underlying
    driver/filesystem believes. Revert that change, and also fix the same
    races in ioctl_fioasync() and ioctl_fionbio().

    This is a minimal, short-term fix; the real fix will not involve the
    BKL.

    Reported-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Al Viro
    Cc: stable@kernel.org
    Signed-off-by: Jonathan Corbet
    Signed-off-by: Linus Torvalds

    Jonathan Corbet
     

14 Nov, 2008

4 commits

  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap current->cred and a few other accessors to hide their actual
    implementation.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Signed-off-by: James Morris

    David Howells
     

01 Aug, 2008

2 commits


27 Jul, 2008

3 commits

  • * dup2() should return -EBADF on exceeded sysctl_nr_open
    * dup() should *not* return -EINVAL even if you have rlimit set to 0;
    it should get -EMFILE instead.

    Check for orig_start exceeding rlimit taken to sys_fcntl().
    Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
    Consequently, remaining checks for rlimit are taken to expand_files().

    Signed-off-by: Al Viro

    Al Viro
     
  • Since Ulrich is OK with getting rid of dup3(fd, fd, flags) completely,
    to hell the damn thing goes. Corner case for dup2() is handled in
    sys_dup2() (complete with -EBADF if dup2(fd, fd) is called with fd
    that is not open), the rest is done in dup3().

    Signed-off-by: Al Viro

    Al Viro
     
  • Al Viro notice one cornercase that the new dup3() code. The dup2()
    function, as a special case, handles dup-ing to the same file
    descriptor. In this case the current dup3() code does nothing at
    all. I.e., it ingnores the flags parameter. This shouldn't happen,
    the close-on-exec flag should be set if requested.

    In case the O_CLOEXEC bit in the flags parameter is not set the
    dup3() function should behave in this respect identical to dup2().
    This means dup3(fd, fd, 0) should not actively reset the c-o-e
    flag.

    The patch below implements this minor change.

    [AV: credits to Artur Grabowski for bringing that up as potential subtle point
    in dup2() behaviour]

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Al Viro

    Ulrich Drepper
     

25 Jul, 2008

1 commit

  • This patch adds the new dup3 syscall. It extends the old dup2 syscall by one
    parameter which is meant to hold a flag value. Support for the O_CLOEXEC flag
    is added in this patch.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_dup3
    # ifdef __x86_64__
    # define __NR_dup3 292
    # elif defined __i386__
    # define __NR_dup3 330
    # else
    # error "need __NR_dup3"
    # endif
    #endif

    int
    main (void)
    {
    int fd = syscall (__NR_dup3, 1, 4, 0);
    if (fd == -1)
    {
    puts ("dup3(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("dup3(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_dup3, 1, 4, O_CLOEXEC);
    if (fd == -1)
    {
    puts ("dup3(O_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("dup3(O_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper