11 Dec, 2006

1 commit

  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

08 Dec, 2006

2 commits

  • Signed-off-by: Heiko Carstens
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • OpenVZ Linux kernel team has found a problem with mounting in compat mode.

    Simple command "mount -t smbfs ..." on Fedora Core 5 distro in 32-bit mode
    leads to oops:

    Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: compat_sys_mount+0xd6/0x290
    Process mount (pid: 14656, veid=300, threadinfo ffff810034d30000, task ffff810034c86bc0)
    Call Trace: ia32_sysret+0x0/0xa

    The problem is that data_page pointer can be NULL, so we should skip data
    conversion in this case.

    Signed-off-by: Andrey Mirkin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Mirkin
     

04 Dec, 2006

2 commits


04 Nov, 2006

1 commit

  • 758333458aa719bfc26ec16eafd4ad3a9e96014d fixes the not checked copy_to_user
    return value of compat_sys_pselect7. I ran into this too because of an old
    source tree, but my fix would look quite a bit different to Andi's fix.

    The reason is that the compat function IMHO should behave the very same as
    the non-compat function if possible. Since sys_pselect7 does not return
    -EFAULT in this specific case, change the compat code so it behaves like
    sys_pselect7.

    Cc: David Woodhouse
    Cc: Andi Kleen
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

11 Oct, 2006

1 commit


03 Oct, 2006

1 commit

  • These patches make the kernel pass 64-bit inode numbers internally when
    communicating to userspace, even on a 32-bit system. They are required
    because some filesystems have intrinsic 64-bit inode numbers: NFS3+ and XFS
    for example. The 64-bit inode numbers are then propagated to userspace
    automatically where the arch supports it.

    Problems have been seen with userspace (eg: ld.so) using the 64-bit inode
    number returned by stat64() or getdents64() to differentiate files, and
    failing because the 64-bit inode number space was compressed to 32-bits, and
    so overlaps occur.

    This patch:

    Make filldir_t take a 64-bit inode number and struct kstat carry a 64-bit
    inode number so that 64-bit inode numbers can be passed back to userspace.

    The stat functions then returns the full 64-bit inode number where
    available and where possible. If it is not possible to represent the inode
    number supplied by the filesystem in the field provided by userspace, then
    error EOVERFLOW will be issued.

    Similarly, the getdents/readdir functions now pass the full 64-bit inode
    number to userspace where possible, returning EOVERFLOW instead when a
    directory entry is encountered that can't be properly represented.

    Note that this means that some inodes will not be stat'able on a 32-bit
    system with old libraries where they were before - but it does mean that
    there will be no ambiguity over what a 32-bit inode number refers to.

    Note similarly that directory scans may be cut short with an error on a
    32-bit system with old libraries where the scan would work before for the
    same reasons.

    It is judged unlikely that this situation will occur because modern glibc
    uses 64-bit capable versions of stat and getdents class functions
    exclusively, and that older systems are unlikely to encounter
    unrepresentable inode numbers anyway.

    [akpm: alpha build fix]
    Signed-off-by: David Howells
    Cc: Trond Myklebust
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

02 Oct, 2006

1 commit

  • Revert Andrew Morton's patch to temporarily hack around the lack of a
    declaration of sigset_t in linux/compat.h to make the block-disablement
    patches build on IA64. This got accidentally pushed to Linus and should
    be fixed in a different manner.

    Also make linux/compat.h #include asm/signal.h to gain a definition of
    sigset_t so that it can externally declare sigset_from_compat().

    This has been compile-tested for i386, x86_64, ia64, mips, mips64, frv, ppc and
    ppc64 and run-tested on frv.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

01 Oct, 2006

3 commits


26 Sep, 2006

1 commit

  • Fix

    linux/fs/compat.c: In function compat_sys_pselect7
    linux/fs/compat.c:1869: warning: ignoring return value of copy_to_user, declared with attribute warn_unused_result

    To make it easier to handle I changed to semantics to not try to
    write out a timespec if an error occurred. I hope that's ok.

    Cc: dwmw2@infradead.org

    Signed-off-by: Andi Kleen

    Andi Kleen
     

27 Jun, 2006

1 commit


23 Jun, 2006

1 commit

  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

22 May, 2006

1 commit

  • Functions compat_nfs_svc_trans, compat_nfs_clnt_trans,
    compat_nfs_exp_trans, compat_nfs_getfd_trans and compat_nfs_getfs_trans,
    which are called by compat_sys_nfsservctl(fs/compat.c), don't handle the
    return value of access_ok properly. access_ok return 1 when the addr is
    valid, and 0 when it's not, but these functions have the reversed
    understanding. When the address is valid, they always return -EFAULT to
    compat_sys_nfsservctl.

    An example is to run /usr/sbin/rpc.nfsd(32bit program on Power5). It
    doesn't function as expected. strace showes that nfsservctl returns
    -EFAULT.

    The patch fixes this by correcting the error handling on the return value
    of access_ok in the five functions.

    Signed-off-by: Lin Feng Shen
    Cc: Trond Myklebust
    Acked-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lin Feng Shen
     

16 May, 2006

1 commit


04 May, 2006

1 commit


02 May, 2006

1 commit


26 Apr, 2006

1 commit

  • This patch addresses a flaw in LSM, where there is no mediation of readv()
    and writev() in for 32-bit compatible apps using a 64-bit kernel.

    This bug was discovered and fixed initially in the native readv/writev
    code [1], but was not fixed in the compat code. Thanks to Al for spotting
    this one.

    [1] http://lwn.net/Articles/154282/

    Signed-off-by: James Morris
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    James Morris
     

29 Mar, 2006

1 commit


26 Mar, 2006

2 commits


24 Mar, 2006

1 commit


18 Feb, 2006

1 commit

  • I got all of these backwards. We want to return

    min(input timeout, new timeout)

    to userspace to prevent increasing the time-remaining value.

    Thanks to Ernst Herzberg for reporting and diagnosing.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

12 Feb, 2006

1 commit

  • With David Woodhouse

    select() presently has a habit of increasing the value of the user's
    `timeout' argument on return.

    We were writing back a timeout larger than the original. We _deliberately_
    round up, since we know we must wait at _least_ as long as the caller asks
    us to.

    The patch adds a couple of helper functions for magnitude comparison of
    timespecs and of timevals, and uses them to prevent the various poll and
    select functions from returning a timeout which is larger than the one which
    was passed in.

    The patch also fixes a bug in compat_sys_pselect7(): it was adding the new
    timeout value to the old one and was returning that. It should just return
    the new timeout value.

    (We have various handy timespec/timeval-to-from-nsec conversion functions in
    time.h. But this code open-codes it all).

    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: Ulrich Drepper
    Cc: Thomas Gleixner
    Cc: george anzinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

02 Feb, 2006

2 commits

  • Most of the 64 bit architectures will zero extend the first argument to
    compat_sys_{openat,newfstatat,futimesat} which will fail if the 32 bit
    syscall was passed AT_FDCWD (which is a small negative number). Declare
    the first argument to be an unsigned int which will force the correct
    sign extension when the internal functions are called in each case.

    Also, do some small white space cleanups in fs/compat.c.

    Signed-off-by: Stephen Rothwell
    Acked-by: David S. Miller
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • fs/compat.c: In function `compat_sys_pselect7':
    fs/compat.c:1820: warning: passing arg 5 of `compat_core_sys_select' from incompatible pointer type

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

20 Jan, 2006

1 commit

  • The compat layer timeout handling changes in:

    9f72949f679df06021c9e43886c9191494fdb007

    are busted. This is most easily seen with an X application
    that uses sub-second select/poll timeout such as emacs. You
    hit a key and it takes a second or so before the app responds.

    The two ROUND_UP() calls upon entry are using {tv,ts}_sec where it
    should instead be using {tv_usec,ts_nsec}, which perfectly explains
    the observed incorrect behavior.

    Another bug shot down with git bisect.

    Signed-off-by: David S. Miller
    Signed-off-by: Linus Torvalds

    David S. Miller
     

19 Jan, 2006

2 commits

  • The following implementation of ppoll() and pselect() system calls
    depends on the architecture providing a TIF_RESTORE_SIGMASK flag in the
    thread_info.

    These system calls have to change the signal mask during their
    operation, and signal handlers must be invoked using the new, temporary
    signal mask. The old signal mask must be restored either upon successful
    exit from the system call, or upon returning from the invoked signal
    handler if the system call is interrupted. We can't simply restore the
    original signal mask and return to userspace, since the restored signal
    mask may actually block the signal which interrupted the system call.

    The TIF_RESTORE_SIGMASK flag deals with this by causing the syscall exit
    path to trap into do_signal() just as TIF_SIGPENDING does, and by
    causing do_signal() to use the saved signal mask instead of the current
    signal mask when setting up the stack frame for the signal handler -- or
    by causing do_signal() to simply restore the saved signal mask in the
    case where there is no handler to be invoked.

    The first patch implements the sys_pselect() and sys_ppoll() system
    calls, which are present only if TIF_RESTORE_SIGMASK is defined. That
    #ifdef should go away in time when all architectures have implemented
    it. The second patch implements TIF_RESTORE_SIGMASK for the PowerPC
    kernel (in the -mm tree), and the third patch then removes the
    arch-specific implementations of sys_rt_sigsuspend() and replaces them
    with generic versions using the same trick.

    The fourth and fifth patches, provided by David Howells, implement
    TIF_RESTORE_SIGMASK for FR-V and i386 respectively, and the sixth patch
    adds the syscalls to the i386 syscall table.

    This patch:

    Add the pselect() and ppoll() system calls, providing core routines usable by
    the original select() and poll() system calls and also the new calls (with
    their semantics w.r.t timeouts).

    Signed-off-by: David Woodhouse
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Woodhouse
     
  • Here is a series of patches which introduce in total 13 new system calls
    which take a file descriptor/filename pair instead of a single file
    name. These functions, openat etc, have been discussed on numerous
    occasions. They are needed to implement race-free filesystem traversal,
    they are necessary to implement a virtual per-thread current working
    directory (think multi-threaded backup software), etc.

    We have in glibc today implementations of the interfaces which use the
    /proc/self/fd magic. But this code is rather expensive. Here are some
    results (similar to what Jim Meyering posted before).

    The test creates a deep directory hierarchy on a tmpfs filesystem. Then
    rm -fr is used to remove all directories. Without syscall support I get
    this:

    real 0m31.921s
    user 0m0.688s
    sys 0m31.234s

    With syscall support the results are much better:

    real 0m20.699s
    user 0m0.536s
    sys 0m20.149s

    The interfaces are for obvious reasons currently not much used. But they'll
    be used. coreutils (and Jeff's posixutils) are already using them.
    Furthermore, code like ftw/fts in libc (maybe even glob) will also start using
    them. I expect a patch to make follow soon. Every program which is walking
    the filesystem tree will benefit.

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Alexey Dobriyan
    Cc: Christoph Hellwig
    Cc: Al Viro
    Acked-by: Ingo Molnar
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

15 Jan, 2006

1 commit


09 Jan, 2006

1 commit

  • When making an fctl locking call through compat_sys_fcntl64 (i.e. a 32bit
    app on a 64bit kernel), the syscall can return a locking range that is in
    conflict with the queried lock.

    If some aspect of this range does not fit in the 32bit structure, something
    needs to be done.

    The current code is wrong in several respects:

    - It returns data to userspace even if no conflict was found
    i.e. it should check l_type for F_UNLCK
    - It returns -EOVERFLOW too agressively. A lock range covering
    the last possible byte of the file (start = COMPAT_OFF_T_MAX,
    len = 1) should be possible, but is rejected with the current test.
    - A extra-long 'len' should not be a problem. If only that part
    of the conflicting lock that would be visible to the 32bit
    app needs to be reported to the 32bit app anyway.

    This patch addresses those three issues and adds a comment to (hopefully)
    record it for posterity.

    Note: this patch mainly affects test-cases. Real applications rarely is
    ever see the problems.

    This patch has been tested (LSB test suite), and works.

    Signed-off-by: Neil Brown
    Cc: Arnd Bergmann
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

05 Jan, 2006

1 commit

  • In particular, allow over-large read- or write-requests to be downgraded
    to a more reasonable range, rather than considering them outright errors.

    We want to protect lower layers from (the sadly all too common) overflow
    conditions, but prefer to do so by chopping the requests up, rather than
    just refusing them outright.

    Cc: Peter Anvin
    Cc: Ulrich Drepper
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Nov, 2005

1 commit

  • In fs/compat.c, whenever put_compat_statfs() returns an error, the
    containing syscall returns -EFAULT. This is presumably by analogy with the
    non-compat case, where any non-zero code from copy_to_user() should be
    translated into an EFAULT. However, put_compat_statfs() is also return
    -EOVERFLOW. The same applies for put_compat_statfs64().

    This bug can be observed with a statfs() on a hugetlbfs directory.
    hugetlbfs, when mounted without limits reports available, free and total
    blocks as -1 (itself a bug, another patch coming). statfs() will
    mysteriously return EFAULT although it's parameters are perfectly valid
    addresses.

    This patch causes the compat versions of statfs() and statfs64() to
    correctly propogate the return values from put_compat_statfs() and
    put_compat_statfs64().

    Signed-off-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

21 Nov, 2005

1 commit

  • Originally for 2.6.16, but the semaphore causes problems for some
    people so get rid of it now.

    It's not needed anymore because the ioctl hash table is never changed
    at run time now.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

30 Oct, 2005

1 commit

  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 Sep, 2005

1 commit


10 Sep, 2005

1 commit