28 Jul, 2010

1 commit

  • fanotify, the upcoming notification system actually needs a struct path so it can
    do opens in the context of listeners, and it needs a file so it can get f_flags
    from the original process. Close was the only operation that already was passing
    a struct file to the notification hook. This patch passes a file for access,
    modify, and open as well as they are easily available to these hooks.

    Signed-off-by: Eric Paris

    Eric Paris
     

28 May, 2010

1 commit

  • This is an implementation of ->llseek useable for the rare special case
    when userspace expects the seek to succeed but the (device) file is
    actually not able to perform the seek. In this case you use noop_llseek()
    instead of falling back to the default implementation of ->llseek.

    Signed-off-by: Jan Blunck
    Cc: Frederic Weisbecker
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jan Blunck
     

25 Mar, 2010

1 commit


04 Nov, 2009

1 commit

  • sendfile(2) was reworked with the splice infrastructure, but it still
    checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although
    if f_op.sendpage() exists, f_op.splice_write() always exists at the same
    time currently, the assumption will be broken in future silently. This
    patch also brings a side effect: sendfile(2) can work with any output
    file. Some security checks related to f_op are added too.

    Signed-off-by: Changli Gao
    Signed-off-by: Jens Axboe

    Changli Gao
     

24 Sep, 2009

1 commit

  • As Johannes Weiner pointed out, one of the range checks in do_sendfile
    is redundant and is already checked in rw_verify_area.

    Signed-off-by: Jeff Layton
    Reviewed-by: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Robert Love
    Cc: Mandeep Singh Baines
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Jeff Layton
     

11 May, 2009

1 commit

  • If f_op->splice_read() is not implemented, fall back to a plain read.
    Use vfs_readv() to read into previously allocated pages.

    This will allow splice and functions using splice, such as the loop
    device, to work on all filesystems. This includes "direct_io" files
    in fuse which bypass the page cache.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

05 Apr, 2009

1 commit

  • Instead of always splitting the file offset into 32-bit 'high' and 'low'
    parts, just split them into the largest natural word-size - which in C
    terms is 'unsigned long'.

    This allows 64-bit architectures to avoid the unnecessary 32-bit
    shifting and masking for native format (while the compat interfaces will
    obviously always have to do it).

    This also changes the order of 'high' and 'low' to be "low first". Why?
    Because when we have it like this, the 64-bit system calls now don't use
    the "pos_high" argument at all, and it makes more sense for the native
    system call to simply match the user-mode prototype.

    This results in a much more natural calling convention, and allows the
    compiler to generate much more straightforward code. On x86-64, we now
    generate

    testq %rcx, %rcx # pos_l
    js .L122 #,
    movq %rcx, -48(%rbp) # pos_l, pos

    from the C source

    loff_t pos = pos_from_hilo(pos_h, pos_l);
    ...
    if (pos < 0)
    return -EINVAL;

    and the 'pos_h' register isn't even touched. It used to generate code
    like

    mov %r8d, %r8d # pos_low, pos_low
    salq $32, %rcx #, tmp71
    movq %r8, %rax # pos_low, pos.386
    orq %rcx, %rax # tmp71, pos.386
    js .L122 #,
    movq %rax, -48(%rbp) # pos.386, pos

    which isn't _that_ horrible, but it does show how the natural word size
    is just a more sensible interface (same arguments will hold in the user
    level glibc wrapper function, of course, so the kernel side is just half
    of the equation!)

    Note: in all cases the user code wrapper can again be the same. You can
    just do

    #define HALF_BITS (sizeof(unsigned long)*4)
    __syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);

    or something like that. That way the user mode wrapper will also be
    nicely passing in a zero (it won't actually have to do the shifts, the
    compiler will understand what is going on) for the last argument.

    And that is a good idea, even if nobody will necessarily ever care: if
    we ever do move to a 128-bit lloff_t, this particular system call might
    be left alone. Of course, that will be the least of our worries if we
    really ever need to care, so this may not be worth really caring about.

    [ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]

    Acked-by: Gerd Hoffmann
    Cc: H. Peter Anvin
    Cc: Andrew Morton
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Ralf Baechle >
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Apr, 2009

1 commit

  • This patch adds preadv and pwritev system calls. These syscalls are a
    pretty straightforward combination of pread and readv (same for write).
    They are quite useful for doing vectored I/O in threaded applications.
    Using lseek+readv instead opens race windows you'll have to plug with
    locking.

    Other systems have such system calls too, for example NetBSD, check
    here: http://www.daemon-systems.org/man/preadv.2.html

    The application-visible interface provided by glibc should look like
    this to be compatible to the existing implementations in the *BSD family:

    ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
    ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);

    This prototype has one problem though: On 32bit archs is the (64bit)
    offset argument unaligned, which the syscall ABI of several archs doesn't
    allow to do. At least s390 needs a wrapper in glibc to handle this. As
    we'll need a wrappers in glibc anyway I've decided to push problem to
    glibc entriely and use a syscall prototype which works without
    arch-specific wrappers inside the kernel: The offset argument is
    explicitly splitted into two 32bit values.

    The patch sports the actual system call implementation and the windup in
    the x86 system call tables. Other archs follow as separate patches.

    Signed-off-by: Gerd Hoffmann
    Cc: Arnd Bergmann
    Cc: Al Viro
    Cc:
    Cc:
    Cc: Ralf Baechle
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerd Hoffmann
     

14 Jan, 2009

5 commits


06 Jan, 2009

1 commit

  • This patch fixes a race condition in lseek. While it is expected that
    unpredictable behaviour may result while repositioning the offset of a
    file descriptor concurrently with reading/writing to the same file
    descriptor, this should not happen when merely *reading* the file
    descriptor's offset.

    Unfortunately, the only portable way in Unix to read a file
    descriptor's offset is lseek(fd, 0, SEEK_CUR); however executing this
    concurrently with read/write may mess up the position.

    [with fixes from akpm]

    Signed-off-by: Alain Knaff
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Alain Knaff
     

23 Oct, 2008

1 commit


03 Jul, 2008

1 commit

  • - Replace remote_llseek with generic_file_llseek_unlocked (to force compilation
    failures in all users)
    - Change all users to either use generic_file_llseek_unlocked directly or
    take the BKL around. I changed the file systems who don't use the BKL
    for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
    take the BKL, but explicitely in their own source now.

    I moved them all over in a single patch to avoid unbisectable sections.

    Open problem: 32bit kernels can corrupt fpos because its modification
    is not atomic, but they can do that anyways because there's other paths who
    modify it without BKL.

    Do we need a special lock for the pos/f_version = 0 checks?

    Trond says the NFS BKL is likely not needed, but keep it for now
    until his full audit.

    v2: Use generic_file_llseek_unlocked instead of remote_llseek_unlocked
    and factor duplicated code (suggested by hch)

    Cc: Trond.Myklebust@netapp.com
    Cc: swhiteho@redhat.com
    Cc: sfrench@samba.org
    Cc: vandrove@vc.cvut.cz

    Signed-off-by: Andi Kleen
    Signed-off-by: Andi Kleen
    Signed-off-by: Jonathan Corbet

    Andi Kleen
     

23 Apr, 2008

1 commit


09 Feb, 2008

1 commit

  • These exports (which aren't used and which are in fact dangerous to use
    because they pretty much form a security hole to use) have been marked
    _UNUSED since 2.6.24 with removal in 2.6.25. This patch is their final
    departure from the Linux kernel tree.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

29 Jan, 2008

1 commit


25 Jan, 2008

1 commit


15 Nov, 2007

1 commit

  • sys_open / sys_read were used in the early 1.2 days to load firmware from
    disk inside drivers. Since 2.0 or so this was deprecated behavior, but
    several drivers still were using this. Since a few years we have a
    request_firmware() API that implements this in a nice, consistent way.
    Only some old ISA sound drivers (pre-ALSA) still straggled along for some
    time.... however with commit c2b1239a9f22f19c53543b460b24507d0e21ea0c the
    last user is now gone.

    This is a good thing, since using sys_open / sys_read etc for firmware is a
    very buggy to dangerous thing to do; these operations put an fd in the
    process file descriptor table.... which then can be tampered with from
    other threads for example. For those who don't want the firmware loader,
    filp_open()/vfs_read are the better APIs to use, without this security
    issue.

    The patch below marks sys_open and sys_read unused now that they're
    really not used anymore, and for deletion in the 2.6.25 timeframe.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

10 Oct, 2007

1 commit

  • The combination of S_ISGID bit set and S_IXGRP bit unset is used to mark the
    inode as "mandatory lockable" and there's a macro for this check called
    MANDATORY_LOCK(inode). However, fs/locks.c and some filesystems still perform
    the explicit i_mode checking. Besides, Andrew pointed out, that this macro is
    buggy itself, as it dereferences the inode arg twice.

    Convert this macro into static inline function and switch its users to it,
    making the code shorter and more readable.

    The __mandatory_lock() helper is to be used in places where the IS_MANDLOCK()
    for superblock is already known to be true.

    Signed-off-by: Pavel Emelyanov
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: David Howells
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton

    Pavel Emelyanov
     

10 Jul, 2007

3 commits


09 May, 2007

2 commits


13 Feb, 2007

1 commit

  • oprofile hunting showed a stall in rw_verify_area(), because of triple
    indirection and potential cache misses.
    (file->f_path.dentry->d_inode->i_flock)

    By moving initialization of 'struct inode' pointer before the pos/count
    sanity tests, we allow the compiler and processor to perform two loads by
    anticipation, reducing stall, without prefetch() hints. Even x86 arch has
    enough registers to not use temporary variables and not increase text size.

    I validated this patch running a bench and studied oprofile changes, and
    absolute perf of the test program.

    Results of my epoll_pipe_bench (source available on request) on a Pentium-M
    1.6 GHz machine

    Before :
    # ./epoll_pipe_bench -l 30 -t 20
    Avg: 436089 evts/sec read_count=8843037 write_count=8843040 21.218390 samples
    per call
    (best value out of 10 runs)

    After :
    # ./epoll_pipe_bench -l 30 -t 20
    Avg: 470980 evts/sec read_count=9549871 write_count=9549894 21.216694 samples
    per call
    (best value out of 10 runs)

    oprofile CPU_CLK_UNHALTED events gave a reduction from 5.3401 % to 2.5851 %
    for the rw_verify_area() function.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

12 Feb, 2007

1 commit

  • They are fat: 4x8 bytes in task_struct.
    They are uncoditionally updated in every fork, read, write and sendfile.
    They are used only if you have some "extended acct fields feature".

    And please, please, please, read(2) knows about bytes, not characters,
    why it is called "rchar"?

    Signed-off-by: Alexey Dobriyan
    Cc: Jay Lan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

14 Dec, 2006

1 commit


09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

01 Oct, 2006

4 commits

  • This work is initially done by Zach Brown to add support for vectored aio.
    These are the core changes for AIO to support
    IOCB_CMD_PREADV/IOCB_CMD_PWRITEV.

    [akpm@osdl.org: huge build fix]
    Signed-off-by: Zach Brown
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Badari Pulavarty
    Acked-by: Benjamin LaHaise
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • This patch cleans up generic_file_*_read/write() interfaces. Christoph
    Hellwig gave me the idea for this clean ups.

    In a nutshell, all filesystems should set .aio_read/.aio_write methods and use
    do_sync_read/ do_sync_write() as their .read/.write methods. This allows us
    to cleanup all variants of generic_file_* routines.

    Final available interfaces:

    generic_file_aio_read() - read handler
    generic_file_aio_write() - write handler
    generic_file_aio_write_nolock() - no lock write handler

    __generic_file_aio_write_nolock() - internal worker routine

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • This patch removes readv() and writev() methods and replaces them with
    aio_read()/aio_write() methods.

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • This patch vectorizes aio_read() and aio_write() methods to prepare for
    collapsing all aio & vectored operations into one interface - which is
    aio_read()/aio_write().

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Christoph Hellwig
    Cc: Michael Holzheu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

11 Jul, 2006

1 commit


11 Apr, 2006

1 commit


29 Mar, 2006

1 commit

  • This is a conversion to make the various file_operations structs in fs/
    const. Basically a regexp job, with a few manual fixups

    The goal is both to increase correctness (harder to accidentally write to
    shared datastructures) and reducing the false sharing of cachelines with
    things that get dirty in .data (while .rodata is nicely read only and thus
    cache clean)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

26 Mar, 2006

1 commit


10 Jan, 2006

1 commit