09 Feb, 2008

2 commits


15 Nov, 2007

1 commit

  • sys_open / sys_read were used in the early 1.2 days to load firmware from
    disk inside drivers. Since 2.0 or so this was deprecated behavior, but
    several drivers still were using this. Since a few years we have a
    request_firmware() API that implements this in a nice, consistent way.
    Only some old ISA sound drivers (pre-ALSA) still straggled along for some
    time.... however with commit c2b1239a9f22f19c53543b460b24507d0e21ea0c the
    last user is now gone.

    This is a good thing, since using sys_open / sys_read etc for firmware is a
    very buggy to dangerous thing to do; these operations put an fd in the
    process file descriptor table.... which then can be tampered with from
    other threads for example. For those who don't want the firmware loader,
    filp_open()/vfs_read are the better APIs to use, without this security
    issue.

    The patch below marks sys_open and sys_read unused now that they're
    really not used anymore, and for deletion in the 2.6.25 timeframe.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

21 Oct, 2007

1 commit


17 Oct, 2007

3 commits

  • Implement file posix capabilities. This allows programs to be given a
    subset of root's powers regardless of who runs them, without having to use
    setuid and giving the binary all of root's powers.

    This version works with Kaigai Kohei's userspace tools, found at
    http://www.kaigai.gr.jp/index.php. For more information on how to use this
    patch, Chris Friedhoff has posted a nice page at
    http://www.friedhoff.org/fscaps.html.

    Changelog:
    Nov 27:
    Incorporate fixes from Andrew Morton
    (security-introduce-file-caps-tweaks and
    security-introduce-file-caps-warning-fix)
    Fix Kconfig dependency.
    Fix change signaling behavior when file caps are not compiled in.

    Nov 13:
    Integrate comments from Alexey: Remove CONFIG_ ifdef from
    capability.h, and use %zd for printing a size_t.

    Nov 13:
    Fix endianness warnings by sparse as suggested by Alexey
    Dobriyan.

    Nov 09:
    Address warnings of unused variables at cap_bprm_set_security
    when file capabilities are disabled, and simultaneously clean
    up the code a little, by pulling the new code into a helper
    function.

    Nov 08:
    For pointers to required userspace tools and how to use
    them, see http://www.friedhoff.org/fscaps.html.

    Nov 07:
    Fix the calculation of the highest bit checked in
    check_cap_sanity().

    Nov 07:
    Allow file caps to be enabled without CONFIG_SECURITY, since
    capabilities are the default.
    Hook cap_task_setscheduler when !CONFIG_SECURITY.
    Move capable(TASK_KILL) to end of cap_task_kill to reduce
    audit messages.

    Nov 05:
    Add secondary calls in selinux/hooks.c to task_setioprio and
    task_setscheduler so that selinux and capabilities with file
    cap support can be stacked.

    Sep 05:
    As Seth Arnold points out, uid checks are out of place
    for capability code.

    Sep 01:
    Define task_setscheduler, task_setioprio, cap_task_kill, and
    task_setnice to make sure a user cannot affect a process in which
    they called a program with some fscaps.

    One remaining question is the note under task_setscheduler: are we
    ok with CAP_SYS_NICE being sufficient to confine a process to a
    cpuset?

    It is a semantic change, as without fsccaps, attach_task doesn't
    allow CAP_SYS_NICE to override the uid equivalence check. But since
    it uses security_task_setscheduler, which elsewhere is used where
    CAP_SYS_NICE can be used to override the uid equivalence check,
    fixing it might be tough.

    task_setscheduler
    note: this also controls cpuset:attach_task. Are we ok with
    CAP_SYS_NICE being used to confine to a cpuset?
    task_setioprio
    task_setnice
    sys_setpriority uses this (through set_one_prio) for another
    process. Need same checks as setrlimit

    Aug 21:
    Updated secureexec implementation to reflect the fact that
    euid and uid might be the same and nonzero, but the process
    might still have elevated caps.

    Aug 15:
    Handle endianness of xattrs.
    Enforce capability version match between kernel and disk.
    Enforce that no bits beyond the known max capability are
    set, else return -EPERM.
    With this extra processing, it may be worth reconsidering
    doing all the work at bprm_set_security rather than
    d_instantiate.

    Aug 10:
    Always call getxattr at bprm_set_security, rather than
    caching it at d_instantiate.

    [morgan@kernel.org: file-caps clean up for linux/capability.h]
    [bunk@kernel.org: unexport cap_inode_killpriv]
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morgan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • The early LFS work that Linux uses favours EFBIG in various places. SuSv3
    specifically uses EOVERFLOW for this as noted by Michael (Bug 7253)

    [EOVERFLOW]
    The named file is a regular file and the size of the file cannot be
    represented correctly in an object of type off_t. We should therefore
    transition to the proper error return code

    Signed-off-by: Alan Cox
    Cc: Theodore Tso
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • It reduces the selinux overhead on read/write by only revalidating
    permissions in selinux_file_permission if the task or inode labels have
    changed or the policy has changed since the open-time check. A new LSM
    hook, security_dentry_open, is added to capture the necessary state at open
    time to allow this optimization.

    (see http://marc.info/?l=selinux&m=118972995207740&w=2)

    Signed-off-by: Yuichi Nakamura
    Acked-by: Stephen Smalley
    Signed-off-by: James Morris

    Yuichi Nakamura
     

01 Aug, 2007

1 commit

  • It is possible that another process could acquire a new file lease right
    after break_lease() is called during a truncate, but before lease-granting
    is disabled by the subsequent get_write_access(). Merely switching the
    order of the break_lease() and get_write_access() calls prevents this race.

    Signed-off-by: David M. Richter
    Signed-off-by: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    david m. richter
     

25 Jul, 2007

1 commit

  • The fallocate syscall returns ENOSYS in case the filesystem does not support
    the operation and expects the userlevel code to fill in. This is good in
    concept.

    The problem is that the libc code for old kernels should be able to
    distinguish the case where the syscall is not at all available vs not
    functioning for a specific mount point. As is this is not possible and we
    always have to invoke the syscall even if the kernel doesn't support it.

    I suggest the following patch. Using EOPNOTSUPP is IMO the right thing to do.

    Cc: Amit Arora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

18 Jul, 2007

1 commit

  • fallocate() is a new system call being proposed here which will allow
    applications to preallocate space to any file(s) in a file system.
    Each file system implementation that wants to use this feature will need
    to support an inode operation called ->fallocate().
    Applications can use this feature to avoid fragmentation to certain
    level and thus get faster access speed. With preallocation, applications
    also get a guarantee of space for particular file(s) - even if later the
    the system becomes full.

    Currently, glibc provides an interface called posix_fallocate() which
    can be used for similar cause. Though this has the advantage of working
    on all file systems, but it is quite slow (since it writes zeroes to
    each block that has to be preallocated). Without a doubt, file systems
    can do this more efficiently within the kernel, by implementing
    the proposed fallocate() system call. It is expected that
    posix_fallocate() will be modified to call this new system call first
    and incase the kernel/filesystem does not implement it, it should fall
    back to the current implementation of writing zeroes to the new blocks.
    ToDos:
    1. Implementation on other architectures (other than i386, x86_64,
    and ppc). Patches for s390(x) and ia64 are already available from
    previous posts, but it was decided that they should be added later
    once fallocate is in the mainline. Hence not including those patches
    in this take.
    2. Changes to glibc,
    a) to support fallocate() system call
    b) to make posix_fallocate() and posix_fallocate64() call fallocate()

    Signed-off-by: Amit Arora

    Amit Arora
     

17 Jul, 2007

2 commits

  • Part two in the O_CLOEXEC saga: adding support for file descriptors received
    through Unix domain sockets.

    The patch is once again pretty minimal, it introduces a new flag for recvmsg
    and passes it just like the existing MSG_CMSG_COMPAT flag. I think this bit
    is not used otherwise but the networking people will know better.

    This new flag is not recognized by recvfrom and recv. These functions cannot
    be used for that purpose and the asymmetry this introduces is not worse than
    the already existing MSG_CMSG_COMPAT situations.

    The patch must be applied on the patch which introduced O_CLOEXEC. It has to
    remove static from the new get_unused_fd_flags function but since scm.c cannot
    live in a module the function still hasn't to be exported.

    Here's a test program to make sure the code works. It's so much longer than
    the actual patch...

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef O_CLOEXEC
    # define O_CLOEXEC 02000000
    #endif
    #ifndef MSG_CMSG_CLOEXEC
    # define MSG_CMSG_CLOEXEC 0x40000000
    #endif

    int
    main (int argc, char *argv[])
    {
    if (argc > 1)
    {
    int fd = atol (argv[1]);
    printf ("child: fd = %d\n", fd);
    if (fcntl (fd, F_GETFD) == 0 || errno != EBADF)
    {
    puts ("file descriptor valid in child");
    return 1;
    }
    return 0;

    }

    struct sockaddr_un sun;
    strcpy (sun.sun_path, "./testsocket");
    sun.sun_family = AF_UNIX;

    char databuf[] = "hello";
    struct iovec iov[1];
    iov[0].iov_base = databuf;
    iov[0].iov_len = sizeof (databuf);

    union
    {
    struct cmsghdr hdr;
    char bytes[CMSG_SPACE (sizeof (int))];
    } buf;
    struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1,
    .msg_control = buf.bytes,
    .msg_controllen = sizeof (buf) };
    struct cmsghdr *cmsg = CMSG_FIRSTHDR (&msg);

    cmsg->cmsg_level = SOL_SOCKET;
    cmsg->cmsg_type = SCM_RIGHTS;
    cmsg->cmsg_len = CMSG_LEN (sizeof (int));

    msg.msg_controllen = cmsg->cmsg_len;

    pid_t child = fork ();
    if (child == -1)
    error (1, errno, "fork");
    if (child == 0)
    {
    int sock = socket (PF_UNIX, SOCK_STREAM, 0);
    if (sock < 0)
    error (1, errno, "socket");

    if (bind (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0)
    error (1, errno, "bind");
    if (listen (sock, SOMAXCONN) < 0)
    error (1, errno, "listen");

    int conn = accept (sock, NULL, NULL);
    if (conn == -1)
    error (1, errno, "accept");

    *(int *) CMSG_DATA (cmsg) = sock;
    if (sendmsg (conn, &msg, MSG_NOSIGNAL) < 0)
    error (1, errno, "sendmsg");

    return 0;
    }

    /* For a test suite this should be more robust like a
    barrier in shared memory. */
    sleep (1);

    int sock = socket (PF_UNIX, SOCK_STREAM, 0);
    if (sock < 0)
    error (1, errno, "socket");

    if (connect (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0)
    error (1, errno, "connect");
    unlink (sun.sun_path);

    *(int *) CMSG_DATA (cmsg) = -1;

    if (recvmsg (sock, &msg, MSG_CMSG_CLOEXEC) < 0)
    error (1, errno, "recvmsg");

    int fd = *(int *) CMSG_DATA (cmsg);
    if (fd == -1)
    error (1, 0, "no descriptor received");

    char fdname[20];
    snprintf (fdname, sizeof (fdname), "%d", fd);
    execl ("/proc/self/exe", argv[0], fdname, NULL);
    puts ("execl failed");
    return 1;
    }

    [akpm@linux-foundation.org: Fix fastcall inconsistency noted by Michael Buesch]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Ulrich Drepper
    Cc: Ingo Molnar
    Cc: Michael Buesch
    Cc: Michael Kerrisk
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • The problem is as follows: in multi-threaded code (or more correctly: all
    code using clone() with CLONE_FILES) we have a race when exec'ing.

    thread #1 thread #2

    fd=open()

    fork + exec

    fcntl(fd,F_SETFD,FD_CLOEXEC)

    In some applications this can happen frequently. Take a web browser. One
    thread opens a file and another thread starts, say, an external PDF viewer.
    The result can even be a security issue if that open file descriptor
    refers to a sensitive file and the external program can somehow be tricked
    into using that descriptor.

    Just adding O_CLOEXEC support to open() doesn't solve the whole set of
    problems. There are other ways to create file descriptors (socket,
    epoll_create, Unix domain socket transfer, etc). These can and should be
    addressed separately though. open() is such an easy case that it makes not
    much sense putting the fix off.

    The test program:

    #include
    #include
    #include
    #include

    #ifndef O_CLOEXEC
    # define O_CLOEXEC 02000000
    #endif

    int
    main (int argc, char *argv[])
    {
    int fd;
    if (argc > 1)
    {
    fd = atol (argv[1]);
    printf ("child: fd = %d\n", fd);
    if (fcntl (fd, F_GETFD) == 0 || errno != EBADF)
    {
    puts ("file descriptor valid in child");
    return 1;
    }
    return 0;
    }

    fd = open ("/proc/self/exe", O_RDONLY | O_CLOEXEC);
    printf ("in parent: new fd = %d\n", fd);
    char buf[20];
    snprintf (buf, sizeof (buf), "%d", fd);
    execl ("/proc/self/exe", argv[0], buf, NULL);
    puts ("execl failed");
    return 1;
    }

    [kyle@parisc-linux.org: parisc fix]
    Signed-off-by: Ulrich Drepper
    Acked-by: Ingo Molnar
    Cc: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: Chris Zankel
    Signed-off-by: Kyle McMartin
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

09 May, 2007

2 commits


11 Dec, 2006

1 commit

  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

09 Dec, 2006

2 commits

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     
  • Fix the locking of signal->tty.

    Use ->sighand->siglock to protect ->signal->tty; this lock is already used
    by most other members of ->signal/->sighand. And unless we are 'current'
    or the tasklist_lock is held we need ->siglock to access ->signal anyway.

    (NOTE: sys_unshare() is broken wrt ->sighand locking rules)

    Note that tty_mutex is held over tty destruction, so while holding
    tty_mutex any tty pointer remains valid. Otherwise the lifetime of ttys
    are governed by their open file handles. This leaves some holes for tty
    access from signal->tty (or any other non file related tty access).

    It solves the tty SLAB scribbles we were seeing.

    (NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
    be examined by someone familiar with the security framework, I think
    it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
    invocations)

    [schwidefsky@de.ibm.com: 3270 fix]
    [akpm@osdl.org: various post-viro fixes]
    Signed-off-by: Peter Zijlstra
    Acked-by: Alan Cox
    Cc: Oleg Nesterov
    Cc: Prarit Bhargava
    Cc: Chris Wright
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Jan Kara
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

01 Oct, 2006

2 commits


30 Sep, 2006

2 commits

  • The problem is that close() syscalls can call a file system's flush
    handler, which in turn might sleep interruptibly and ultimately pass back
    an -ERESTARTSYS return value. This happens for files backed by an
    interruptible NFS mount under nfs_file_flush() when a large file has just
    been written and nfs_wait_bit_interruptible() detects that there is a
    signal pending.

    I have a test case where the "strace" command is used to attach to a
    process sleeping in such a close(). Since the SIGSTOP is forced onto the
    victim process (removing it from the thread's "blocked" mask in
    force_sig_info()), the RPC wait is interrupted and the close() is
    terminated early.

    But the file table entry has already been cleared before the flush handler
    was called. Thus, when the syscall is restarted, the file descriptor
    appears closed and an EBADF error is returned (which is wrong). What's
    worse, there is the hypothetical case where another thread of a
    multi-threaded application might have reused the file descriptor, in which
    case that file would be mistakenly closed.

    The bottom line is that close() syscalls are not restartable, and thus
    -ERESTARTSYS return values should be mapped to -EINTR. This is consistent
    with the close(2) manual page. The fix is below.

    Signed-off-by: Ernie Petrides
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ernie Petrides
     
  • In the "operation does permission checking" model used by fuse, chdir
    permission is not checked, since there's no chdir method.

    For this case set a lookup flag, which will be passed to ->permission(), so
    fuse can distinguish it from permission checks for other operations.

    Signed-off-by: Miklos Szeredi
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

26 Jun, 2006

1 commit

  • In the course of trying to track down a bug where a file mtime was not
    being updated correctly, it was discovered that the m/ctime updates were
    not quite being handled correctly for ftruncate() calls.

    Quoth SUSv3:

    open(2):

    If O_TRUNC is set and the file did previously exist, upon
    successful completion, open() shall mark for update the st_ctime
    and st_mtime fields of the file.

    truncate(2):

    Upon successful completion, if the file size is changed, this
    function shall mark for update the st_ctime and st_mtime fields
    of the file, and the S_ISUID and S_ISGID bits of the file mode
    may be cleared.

    ftruncate(2):

    Upon successful completion, if fildes refers to a regular file,
    the ftruncate() function shall mark for update the st_ctime and
    st_mtime fields of the file and the S_ISUID and S_ISGID bits of
    the file mode may be cleared. If the ftruncate() function is
    unsuccessful, the file is unaffected.

    The open(O_TRUNC) and truncate cases were being handled correctly, but the
    ftruncate case was being handled like the truncate case. The semantics of
    truncate and ftruncate don't quite match, so ftruncate needs to be handled
    slightly differently.

    The attached patch addresses this issue for ftruncate(2).

    My thanx to Stephen Tweedie and Trond Myklebust for their help in
    understanding the situation and semantics.

    Signed-off-by: Peter Staubach
    Cc: "Stephen C. Tweedie"
    Cc: Trond Myklebust
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Staubach
     

23 Jun, 2006

2 commits

  • Pass the POSIX lock owner ID to the flush operation.

    This is useful for filesystems which don't want to store any locking state
    in inode->i_flock but want to handle locking/unlocking POSIX locks
    internally. FUSE is one such filesystem but I think it possible that some
    network filesystems would need this also.

    Also add a flag to indicate that a POSIX locking request was generated by
    close(), so filesystems using the above feature won't send an extra locking
    request in this case.

    Signed-off-by: Miklos Szeredi
    Cc: Trond Myklebust
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

20 Jun, 2006

1 commit

  • When an audit event involves changes to a directory entry, include
    a PATH record for the directory itself. A few other notable changes:

    - fixed audit_inode_child() hooks in fsnotify_move()
    - removed unused flags arg from audit_inode()
    - added audit log routines for logging a portion of a string

    Here's some sample output.

    before patch:
    type=SYSCALL msg=audit(1149821605.320:26): arch=40000003 syscall=39 success=yes exit=0 a0=bf8d3c7c a1=1ff a2=804e1b8 a3=bf8d3c7c items=1 ppid=739 pid=800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
    type=CWD msg=audit(1149821605.320:26): cwd="/root"
    type=PATH msg=audit(1149821605.320:26): item=0 name="foo" parent=164068 inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

    after patch:
    type=SYSCALL msg=audit(1149822032.332:24): arch=40000003 syscall=39 success=yes exit=0 a0=bfdd9c7c a1=1ff a2=804e1b8 a3=bfdd9c7c items=2 ppid=714 pid=777 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
    type=CWD msg=audit(1149822032.332:24): cwd="/root"
    type=PATH msg=audit(1149822032.332:24): item=0 name="/root" inode=164068 dev=03:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_dir_t:s0
    type=PATH msg=audit(1149822032.332:24): item=1 name="foo" inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

    Signed-off-by: Amy Griffis
    Signed-off-by: Al Viro

    Amy Griffis
     

16 May, 2006

1 commit


19 Apr, 2006

2 commits

  • Came up through a quick grep for other cases similar to the ftruncate()
    one in commit 0a489cb3b6a7b277030cdbc97c2c65905db94536.

    Also, add a comment, so that people who read the code understand why we
    do what looks like a no-op.

    (Again, this won't actually matter to any sane user, since libc will
    save and restore the register gcc stomps on, but it's still wrong to
    stomp on it)

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Gcc thinks it owns the incoming argument stack, but that's not true for
    "asmlinkage" functions, and it corrupts the caller-set-up argument stack
    when it pushes the third argument onto the stack. Which can result in
    %ebx getting corrupted in user space.

    Now, normally nobody sane would ever notice, since libc will save and
    restore %ebx anyway over the system call, but it's still wrong.

    I'd much rather have "asmlinkage" tell gcc directly that it doesn't own
    the stack, but no such attribute exists, so we're stuck with our hacky
    manual "prevent_tail_call()" macro once more (we've had the same issue
    before with sys_waitpid() and sys_wait4()).

    Thanks to Hans-Werner Hilse for reporting
    the issue and testing the fix.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Mar, 2006

2 commits

  • * 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits)
    [PATCH] fix audit_init failure path
    [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
    [PATCH] sem2mutex: audit_netlink_sem
    [PATCH] simplify audit_free() locking
    [PATCH] Fix audit operators
    [PATCH] promiscuous mode
    [PATCH] Add tty to syscall audit records
    [PATCH] add/remove rule update
    [PATCH] audit string fields interface + consumer
    [PATCH] SE Linux audit events
    [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
    [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
    [PATCH] Fix IA64 success/failure indication in syscall auditing.
    [PATCH] Miscellaneous bug and warning fixes
    [PATCH] Capture selinux subject/object context information.
    [PATCH] Exclude messages by message type
    [PATCH] Collect more inode information during syscall processing.
    [PATCH] Pass dentry, not just name, in fsnotify creation hooks.
    [PATCH] Define new range of userspace messages.
    [PATCH] Filter rule comparators
    ...

    Fixed trivial conflict in security/selinux/hooks.c

    Linus Torvalds
     
  • I think it would be nice to put an usage warning in header of
    lookup_instantiate_filp() to indicate it is unsafe to use it on anything
    but regular files (even that is potentially unsafe, but there your ->open()
    is usually in your hands anyway), so that others won't fall into the same
    trap I did.

    Signed-off-by: Oleg Drokin
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Drokin
     

23 Mar, 2006

1 commit

  • 1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
    platforms, lowering kmalloc() allocated space by 50%.

    2) Reduce the size of (files_struct), using a special 32 bits (or
    64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
    close_on_exec_init and open_fds_init fields. This save some ram (248
    bytes per task) as most tasks dont open more than 32 files. D-Cache
    footprint for such tasks is also reduced to the minimum.

    3) Reduce size of allocated fdset. Currently two full pages are
    allocated, that is 32768 bits on x86 for example, and way too much. The
    minimum is now L1_CACHE_BYTES.

    UP and SMP should benefit from this patch, because most tasks will touch
    only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
    (next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the
    same cache line)

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

21 Mar, 2006

1 commit

  • This patch augments the collection of inode info during syscall
    processing. It represents part of the functionality that was provided
    by the auditfs patch included in RHEL4.

    Specifically, it:

    - Collects information for target inodes created or removed during
    syscalls. Previous code only collects information for the target
    inode's parent.

    - Adds the audit_inode() hook to syscalls that operate on a file
    descriptor (e.g. fchown), enabling audit to do inode filtering for
    these calls.

    - Modifies filtering code to check audit context for either an inode #
    or a parent inode # matching a given rule.

    - Modifies logging to provide inode # for both parent and child.

    - Protect debug info from NULL audit_names.name.

    [AV: folded a later typo fix from the same author]

    Signed-off-by: Amy Griffis
    Signed-off-by: David Woodhouse
    Signed-off-by: Al Viro

    Amy Griffis
     

19 Jan, 2006

1 commit

  • Here is a series of patches which introduce in total 13 new system calls
    which take a file descriptor/filename pair instead of a single file
    name. These functions, openat etc, have been discussed on numerous
    occasions. They are needed to implement race-free filesystem traversal,
    they are necessary to implement a virtual per-thread current working
    directory (think multi-threaded backup software), etc.

    We have in glibc today implementations of the interfaces which use the
    /proc/self/fd magic. But this code is rather expensive. Here are some
    results (similar to what Jim Meyering posted before).

    The test creates a deep directory hierarchy on a tmpfs filesystem. Then
    rm -fr is used to remove all directories. Without syscall support I get
    this:

    real 0m31.921s
    user 0m0.688s
    sys 0m31.234s

    With syscall support the results are much better:

    real 0m20.699s
    user 0m0.536s
    sys 0m20.149s

    The interfaces are for obvious reasons currently not much used. But they'll
    be used. coreutils (and Jeff's posixutils) are already using them.
    Furthermore, code like ftw/fts in libc (maybe even glob) will also start using
    them. I expect a patch to make follow soon. Every program which is walking
    the filesystem tree will benefit.

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Alexey Dobriyan
    Cc: Christoph Hellwig
    Cc: Al Viro
    Acked-by: Ingo Molnar
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

12 Jan, 2006

1 commit


10 Jan, 2006

1 commit


09 Jan, 2006

2 commits

  • uninline some open.c functions

    add/remove: 3/0 grow/shrink: 0/6 up/down: 679/-1166 (-487)
    function old new delta
    do_sys_truncate - 336 +336
    do_sys_ftruncate - 317 +317
    __put_unused_fd - 26 +26
    put_unused_fd 57 49 -8
    sys_close 150 119 -31
    sys_ftruncate64 260 26 -234
    sys_ftruncate 272 24 -248
    sys_truncate 339 25 -314
    sys_truncate64 336 5 -331

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • SUS requires that when truncating a file to the size that it currently
    is:
    truncate and ftruncate should NOT modify ctime or mtime
    O_TRUNC SHOULD modify ctime and mtime.

    Currently mtime and ctime are always modified on most local
    filesystems (side effect of ->truncate) or never modified (on NFS).

    With this patch:
    ATTR_CTIME|ATTR_MTIME are sent with ATTR_SIZE precisely when
    an update of these times is required whether size changes or not
    (via a new argument to do_truncate). This allows NFS to do
    the right thing for O_TRUNC.
    inode_setattr nolonger forces ATTR_MTIME|ATTR_CTIME when the ATTR_SIZE
    sets the size to it's current value. This allows local filesystems
    to do the right thing for f?truncate.

    Also, the logic in inode_setattr is changed a bit so there are two return
    points. One returns the error from vmtruncate if it failed, the other
    returns 0 (there can be no other failure).

    Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change
    requested, we now fall-through and mark_inode_dirty. If a filesystem did
    not have a ->truncate function, then vmtruncate will have changed i_size,
    without marking the inode as 'dirty', and I think this is wrong.

    Signed-off-by: Neil Brown
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

09 Nov, 2005

2 commits

  • A few more callers of permission() just want to check for a different access
    pattern on an already open file. This patch adds a wrapper for permission()
    that takes a file in preparation of per-mount read-only support and to clean
    up the callers a little. The helper is not intended for new code, everything
    without the interface set in stone should use vfs_permission()

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Most permission() calls have a struct nameidata * available. This helper
    takes that as an argument and thus makes sure we pass it down for lookup
    intents and prepares for per-mount read-only support where we need a struct
    vfsmount for checking whether a file is writeable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

07 Nov, 2005

1 commit