02 May, 2008

4 commits


01 May, 2008

12 commits

  • clamp() exists for this use.

    Signed-off-by: Harvey Harrison
    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Here are some more places where path_{get,put}() can be used instead of
    dput()/mntput() pair. Besides that it fixes a bug in autofs4_mount_busy()
    where mntput() was called before dput().

    Signed-off-by: Jan Blunck
    Cc: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • Jeff Moyer has identified a case where the autofs4 function
    root.c:try_to_fill_dentry() can return -EBUSY when it should return 0.

    Jeff's description of the way this happens is:

    "automount starts an expire for directory d. after the callout to the daemon,
    but before the rmdir, another process tries to walk into the same directory.
    It puts itself onto the waitq, pending the expiration.

    When the expire finishes, the second process is woken up. In
    try_to_fill_dentry, it does this check:

    status = d_invalidate(dentry);
    if (status != -EBUSY)
    return -EAGAIN;

    And status is EBUSY. The dentry still has a non-zero d_inode, and the
    flags do not contain LOOKUP_CONTINUE or LOOKUP_DIRECTORY

    So, we fall through and return -EBUSY to the caller."

    Signed-off-by: Jeff Moyer
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Jeff Moyer has identified a race in due to an execution order dependency
    in the autofs4 function root.c:try_to_fill_dentry().

    Jeff's description of this race is:

    "P1 does a lookup of /mount/submount/foo. Since the VFS can't find an entry
    for "foo" under /mount/submount, it calls into the autofs4 kernel module to
    allocate a new dentry, D1. The kernel creates a new waitq for this lookup and
    calls the daemon to perform the mount.

    The daemon performs a mkdir of the "foo" directory under /mount/submount,
    which ends up creating a *new* dentry, D2.

    Then, P2 does a lookup of /mount/submount/foo. The VFS path walking logic
    finds a dentry in the dcache, D2, and calls the revalidate function with this.
    In the autofs4 revalidate code, we then trigger a mount, since the dentry is
    an empty directory that isn't a mountpoint, and so set DCACHE_AUTOFS_PENDING
    and call into the wait code to trigger the mount.

    The wait code finds our existing waitq entry (since it is keyed off of the
    directory name) and adds itself to the list of waiters.

    After the daemon finishes the mount, it calls back into the kernel to release
    the waiters. When this happens, P1 is woken up and goes about clearing the
    DCACHE_AUTOFS_PENDING flag, but it does this in D1! So, given that P1 in our
    case is a program that will immediately try to access a file under
    /mount/submount/foo, we end up finding the dentry D2 which still has the
    pending flag set, and we set out to wait for a mount *again*!

    So, one way to address this is to re-do the lookup at the end of
    try_to_fill_dentry, and to clear the pending flag on the hashed dentry. This
    seems a sane approach to me."

    And Jeff's patch does this.

    Signed-off-by: Jeff Moyer
    Signed-off-by-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Catch invalid dentry when calculating its path.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Re-order some code in expire.c:autofs4_expire_indirect() to avoid compile
    warning, reported by Harvey Harrison:

    CHECK fs/autofs4/expire.c
    fs/autofs4/expire.c:383:2: warning: context imbalance in
    'autofs4_expire_indirect' - unexpected unlock

    Signed-off-by: Ian Kent
    Reviewed-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • If utimensat() is called with both times set to UTIME_NOW or one of them to
    UTIME_NOW and the other to UTIME_OMIT, then it will update the file time
    without any permission checking.

    I don't think this can be used for anything other than a local DoS, but could
    be quite bewildering at that (e.g. "Why was that large source tree rebuilt
    when I didn't modify anything???")

    This affects all kernels from 2.6.22, when the utimensat() syscall was
    introduced.

    Fix by doing the same permission checking as for the "times == NULL" case.

    Thanks to Michael Kerrisk, whose utimensat-non-conformances-and-fixes.patch in
    -mm also fixes this (and breaks other stuff), only he didn't realize the
    security implications of this bug.

    Signed-off-by: Miklos Szeredi
    Cc: Ulrich Drepper
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Don't hold f->sem while calling into jffs2_do_create(). It makes lockdep
    unhappy, and we don't really need it -- the _reason_ it's a false
    positive is because nobody else can see this inode yet and so nobody
    will be trying to lock it anyway.

    Signed-off-by: David Woodhouse

    David Woodhouse
     
  • Ditch a couple of pointless casts from void *, and use the normal
    variable name 'f' for jffs2_inode_info pointers -- especially since
    it actually shows up in lockdep reports.

    Signed-off-by: David Woodhouse

    David Woodhouse
     
  • We have a race between fcntl() and close() that can lead to
    dnotify_struct inserted into inode's list *after* the last descriptor
    had been gone from current->files.

    Since that's the only point where dnotify_struct gets evicted, we are
    screwed - it will stick around indefinitely. Even after struct file in
    question is gone and freed. Worse, we can trigger send_sigio() on it at
    any later point, which allows to send an arbitrary signal to arbitrary
    process if we manage to apply enough memory pressure to get the page
    that used to host that struct file and fill it with the right pattern...

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Greg Kroah-Hartman

    Robert P. J. Day
     
  • sysfs allows attribute files to be truncated, e.g. using ftruncate(), with the
    expected effect on their inode. For most attributes, this doesn't change the
    "real" size of the file i.e. how much can be read from it. However, the
    parameter validation for reading and writing binary attribute files is based
    on the inode size and not the size specified in the file's bin_attribute, so it
    can be broken by this. For example, if we try using dd to write to such a file:

    # pwd
    /sys/bus/pci/devices/0000:08:00.0
    # ls -l config
    -rw-r--r-- 1 root root 4096 Feb 1 17:35 config
    # dd if=/dev/zero of=config bs=4 count=1
    1+0 records in
    1+0 records out
    # ls -l config
    -rw-r--r-- 1 root root 0 Feb 1 17:50 config
    # dd if=/dev/zero of=config bs=4 count=1 seek=128
    dd: writing `config': No space left on device
    1+0 records in
    0+0 records out

    Also, after truncation to 0, parameter validation for read and write is
    disabled. Most bin_attribute read and write methods also validate the size and
    offset, but for some this will allow out-of-range access. This may be a
    security issue, though access to such files is often limited to root. In any
    case, the validation should remain for safety's sake!)

    This was previously reported in Bugzilla as bug 9867.

    sysfs should ignore size changes or else refuse them (by returning -EINVAL).
    This patch makes it ignore them.

    Signed-off-by: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings
     

30 Apr, 2008

24 commits

  • * 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
    [S390] Update default configuration.
    [S390] use generic sys_ptrace
    [S390] Remove self ptrace IEEE_IP hack.
    [S390] Convert to SPARSEMEM & SPARSEMEM_VMEMMAP
    [S390] System z large page support.
    [S390] Convert machine feature detection code to C.
    [S390] vmemmap: use clear_table to initialise page tables.
    [S390] Move stfl to system.h and delete duplicated version.
    [S390] uaccess_mvcos: #ifdef config dependent code.
    [S390] cpu topology: Fix possible deadlock.
    [S390] Add topology_core_siblings to topology.h
    [S390] cio: Make isc handling more robust.
    [S390] remove -traditional
    [S390] Automatically detect added cpus.
    [S390] smp: Fix locking order.
    [S390] Add missing ifndef/define to include/asm-s390/sysinfo.h.
    [S390] Move show_regs to traps.c.
    [S390] cio: Use strict_strtoul() for attributes.

    Linus Torvalds
     
  • __FUNCTION__ is gcc-specific, use __func__

    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • __FUNCTION__ is gcc-specific, use __func__

    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Add calls to the generic object debugging infrastructure and provide fixup
    functions which allow to keep the system alive when recoverable problems have
    been detected by the object debugging core code.

    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • fs/hfsplus/btree.c: In function 'hfsplus_bmap_alloc':
    fs/hfsplus/btree.c:239: warning: comparison is always false due to limited range of data type

    But this might hide a real bug?

    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • fs/hfs/btree.c: In function 'hfs_bmap_alloc':
    fs/hfs/btree.c:263: warning: comparison is always false due to limited range of data type

    The patch makes the warning go away, but the code might actually be buggy?

    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • replace all:
    big/little_endian_variable = cpu_to_[bl]eX([bl]eX_to_cpu(big/little_endian_variable) +
    expression_in_cpu_byteorder);
    with:
    [bl]eX_add_cpu(&big/little_endian_variable, expression_in_cpu_byteorder);
    generated with semantic patch

    Signed-off-by: Marcin Slusarz
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • replace all:
    little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
    expression_in_cpu_byteorder);
    with:
    leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
    generated with semantic patch

    Signed-off-by: Marcin Slusarz
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • replace all:
    big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) +
    expression_in_cpu_byteorder);
    with:
    beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder);
    generated with semantic patch

    Signed-off-by: Marcin Slusarz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • replace all:
    big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) +
    expression_in_cpu_byteorder);
    with:
    beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder);
    generated with semantic patch

    Signed-off-by: Marcin Slusarz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • Use the proper helper to open a blockdevice by name for filesystem use,
    this makes sure it's properly claimed (also added for open-by-number) and
    gets rid of the struct file abuse.

    Tested by mounting a reiserfs filesystem with external journal.

    Signed-off-by: Christoph Hellwig
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Acked-by: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • fs/fuse/dev.c:306:2: warning: context imbalance in 'wait_answer_interruptible' - unexpected unlock
    fs/fuse/dev.c:361:2: warning: context imbalance in 'request_wait_answer' - unexpected unlock
    fs/fuse/dev.c:1002:4: warning: context imbalance in 'end_io_requests' - unexpected unlock

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fuse doesn't use i_mutex to protect setting i_size, and so
    generic_file_llseek() can be racy: it doesn't use i_size_read().

    So do a fuse specific llseek method, which does use i_size_read().

    [akpm@linux-foundation.org: make `retval' loff_t]
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Node ID is 64bit but it is passed as unsigned long to some functions. This
    breakage wasn't noticed, because libfuse uses unsigned long too.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fix a bug that Werner Baumann reported: fuse can send a bigger write request
    than the maximum specified. This only affected direct_io operation.

    In addition set a sane minimum for the max_read and max_write tunables, so I/O
    always makes some progress.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • If the READ request returned a short count, then either

    - cached size is incorrect
    - filesystem is buggy, as short reads are only allowed on EOF

    So assume that the size is wrong and refresh it, so that cached read() doesn't
    zero fill the missing chunk.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Introduce fuse_perform_write. With fusexmp (a passthrough filesystem), large
    (1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times
    (256MB/s vs 71MB/s).

    [mszeredi@suse.cz]:

    - split into smaller functions
    - testing
    - duplicate generic_file_aio_write(), so that there's no need to add a
    new ->perform_write() a_op. Comment from hch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Extract common code for setting i_size in write functions into a common
    helper.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Quoting Linus (3 years ago, FUSE inclusion discussions):

    "User-space filesystems are hard to get right. I'd claim that they
    are almost impossible, unless you limit them somehow (shared
    writable mappings are the nastiest part - if you don't have those,
    you can reasonably limit your problems by limiting the number of
    dirty pages you accept through normal "write()" calls)."

    Instead of attempting the impossible, I've just waited for the dirty page
    accounting infrastructure to materialize (thanks to Peter Zijlstra and
    others). This nicely solved the biggest problem: limiting the number of pages
    used for write caching.

    Some small details remained, however, which this largish patch attempts to
    address. It provides a page writeback implementation for fuse, which is
    completely safe against VM related deadlocks. Performance may not be very
    good for certain usage patterns, but generally it should be acceptable.

    It has been tested extensively with fsx-linux and bash-shared-mapping.

    Fuse page writeback design
    --------------------------

    fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
    It copies the contents of the original page, and queues a WRITE request to the
    userspace filesystem using this temp page.

    The writeback is finished instantly from the MM's point of view: the page is
    removed from the radix trees, and the PageDirty and PageWriteback flags are
    cleared.

    For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
    incremented. The per-bdi writeback count is not decremented until the actual
    write completes.

    On dirtying the page, fuse waits for a previous write to finish before
    proceeding. This makes sure, there can only be one temporary page used at a
    time for one cached page.

    This approach is wasteful in both memory and CPU bandwidth, so why is this
    complication needed?

    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write. It may be buggy or even
    malicious, and fail to complete WRITE requests. We don't want unrelated parts
    of the system to grind to a halt in such cases.

    Also a filesystem may need additional resources (particularly memory) to
    complete a WRITE request. There's a great danger of a deadlock if that
    allocation may wait for the writepage to finish.

    Currently there are several cases where the kernel can block on page
    writeback:

    - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
    - page migration
    - throttle_vm_writeout (through NR_WRITEBACK)
    - sync(2)

    Of course in some cases (fsync, msync) we explicitly want to allow blocking.
    So for these cases new code has to be added to fuse, since the VM is not
    tracking writeback pages for us any more.

    As an extra safetly measure, the maximum dirty ratio allocated to a single
    fuse filesystem is set to 1% by default. This way one (or several) buggy or
    malicious fuse filesystems cannot slow down the rest of the system by hogging
    dirty memory.

    With appropriate privileges, this limit can be raised through
    '/sys/class/bdi//max_ratio'.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fuse will use temporary buffers to write back dirty data from memory mappings
    (normal writes are done synchronously). This is needed, because there cannot
    be any guarantee about the time in which a write will complete.

    By using temporary buffers, from the MM's point if view the page is written
    back immediately. If the writeout was due to memory pressure, this
    effectively migrates data from a full zone to a less full zone.

    This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
    as temporary buffers.

    [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Register FUSE's backing_dev_info under sysfs with the name "fuse-MAJOR:MINOR"

    Make the fuse control filesystem use s_dev instead of a fuse specific ID.
    This makes it easier to match directories under /sys/fs/fuse/connections/ with
    directories under /sys/class/bdi, and with actual mounts.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Register NFS' backing_dev_info under sysfs with the name "nfs-MAJOR:MINOR"

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Factor out the code used to allocate/free a pts index into new interfaces,
    devpts_new_index() and devpts_kill_index(). This localizes the external data
    structures used in managing the pts indices.

    [akpm@linux-foundation.org: undo accidental mutex2sem conversion]
    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Serge Hallyn
    Signed-off-by: Matt Helsley
    Acked-by: H. Peter Anvin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu