28 May, 2010

26 commits

  • We don't name our generic fsync implementations very well currently.
    The no-op implementation for in-memory filesystems currently is called
    simple_sync_file which doesn't make too much sense to start with,
    the the generic one for simple filesystems is called simple_fsync
    which can lead to some confusion.

    This patch renames the generic file fsync method to generic_file_fsync
    to match the other generic_file_* routines it is supposed to be used
    with, and the no-op implementation to noop_fsync to make it obvious
    what to expect. In addition add some documentation for both methods.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a mutex_unlock missing on the error path. At other exists from the
    function that return an error flag, the mutex is unlocked, so do the same
    here.

    The semantic match that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression E1;
    @@

    * mutex_lock(E1,...);

    * mutex_unlock(E1,...);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Al Viro

    Julia Lawall
     
  • __aio_put_req() plays sick games with file refcount. What
    it wants is fput() from atomic context; it's almost always
    done with f_count > 1, so they only have to deal with delayed
    work in rare cases when their reference happens to be the
    last one. Current code decrements f_count and if it hasn't
    hit 0, everything is fine. Otherwise it keeps a pointer
    to struct file (with zero f_count!) around and has delayed
    work do __fput() on it.

    Better way to do it: use atomic_long_add_unless( , -1, 1)
    instead of !atomic_long_dec_and_test(). IOW, decrement it
    only if it's not the last reference, leave refcount alone
    if it was. And use normal fput() in delayed work.

    I've made that atomic_long_add_unless call a new helper -
    fput_atomic(). Drops a reference to file if it's safe to
    do in atomic (i.e. if that's not the last one), tells if
    it had been able to do that. aio.c converted to it, __fput()
    use is gone. req->ki_file *always* contributes to refcount
    now. And __fput() became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.

    In particular, before this patch, the command
    ls -l
    in an NFS mounted directory would always check if the directory on the server
    had changed and if so would flush and refill the pagecache for the dir.
    After this patch, the same "ls -l" will repeatedly return stale date until
    the cached attributes for the directory time out.

    The following patch fixes this by ensuring the d_revalidate is called by
    do_last when "." is being looked-up.
    link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
    is not set so nfs_lookup_verify_inode chooses not to do any validation.

    The following patch restores the original behaviour.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Al Viro

    Neil Brown
     
  • This reverts commit a7cf4145bb86aaf85d4d4d29a69b50b688e2e49d.

    Al Viro
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
    ext4: Make fsync sync new parent directories in no-journal mode
    ext4: Drop whitespace at end of lines
    ext4: Fix compat EXT4_IOC_ADD_GROUP
    ext4: Conditionally define compat ioctl numbers
    tracing: Convert more ext4 events to DEFINE_EVENT
    ext4: Add new tracepoints to track mballoc's buddy bitmap loads
    ext4: Add a missing trace hook
    ext4: restart ext4_ext_remove_space() after transaction restart
    ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
    ext4: Avoid crashing on NULL ptr dereference on a filesystem error
    ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
    ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
    ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
    ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
    ext4: Use our own write_cache_pages()
    ext4: Show journal_checksum option
    ext4: Fix for ext4_mb_collect_stats()
    ext4: check for a good block group before loading buddy pages
    ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
    ext4: Remove extraneous newlines in ext4_msg() calls
    ...

    Fixed up trivial conflict in fs/ext4/fsync.c

    Linus Torvalds
     
  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    NFS: Fix another nfs_wb_page() deadlock
    NFS: Ensure that we mark the inode as dirty if we exit early from commit
    NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker
    sunrpc: fix leak on error on socket xprt setup

    Linus Torvalds
     
  • Do not use the fallback default_llseek() if the readdir operation of the
    filesystem still uses the big kernel lock.

    Since llseek() modifies
    file->f_pos of the directory directly it may need locking to not confuse
    readdir which usually uses file->f_pos directly as well

    Since the special characteristics of the BKL (unlocked on schedule) are
    not necessary in this case, the inode mutex can be used for locking as
    provided by generic_file_llseek(). This is only possible since all
    filesystems, except reiserfs, either use a directory as a flat file or
    with disk address offsets. Reiserfs on the other hand uses a 32bit hash
    off the filename as the offset so generic_file_llseek() can get used as
    well since the hash is always smaller than sb->s_maxbytes (= (512 << 32) -
    blocksize).

    Signed-off-by: Jan Blunck
    Acked-by: Jan Kara
    Acked-by: Anders Larsen
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jan Blunck
     
  • This is an implementation of ->llseek useable for the rare special case
    when userspace expects the seek to succeed but the (device) file is
    actually not able to perform the seek. In this case you use noop_llseek()
    instead of falling back to the default implementation of ->llseek.

    Signed-off-by: Jan Blunck
    Cc: Frederic Weisbecker
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jan Blunck
     
  • The aio compat code was not converting the struct iovecs from 32bit to
    64bit pointers, causing either EINVAL to be returned from io_getevents, or
    EFAULT as the result of the I/O. This patch passes a compat flag to
    io_submit to signal that pointer conversion is necessary for a given iocb
    array.

    A variant of this was tested by Michael Tokarev. I have also updated the
    libaio test harness to exercise this code path with good success.
    Further, I grabbed a copy of ltp and ran the
    testcases/kernel/syscall/readv and writev tests there (compiled with -m32
    on my 64bit system). All seems happy, but extra eyes on this would be
    welcome.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
    Signed-off-by: Jeff Moyer
    Reported-by: Michael Tokarev
    Cc: Zach Brown
    Cc: [2.6.35.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and
    writev AIO operations were not functioning properly. It turns out that
    the code to convert the 32bit io vectors to 64 bits was never written.
    The results of that can be pretty bad, but in my testing, it mostly ended
    up in generating EFAULT as we walked off the list of I/O vectors provided.

    This patch set fixes the problem in my environment. are greatly
    appreciated.

    This patch:

    Factor out code that will be used by both compat_do_readv_writev and the
    compat aio submission code paths.

    Signed-off-by: Jeff Moyer
    Reported-by: Michael Tokarev
    Cc: Zach Brown
    Cc: [2.6.35.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
    clear what is the purpose of the operation, which otherwise looks like a
    no-op.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    type T;
    T x;
    identifier f;
    @@

    T f (...) { }

    @@
    expression x;
    @@

    - ERR_PTR(PTR_ERR(x))
    + ERR_CAST(x)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Extend KCORE_TEXT to cover the pages between _text and _stext, to allow
    examining some important page table pages.

    `readelf -a` output on x86_64 before and after patch:
    Type Offset VirtAddr PhysAddr
    before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000
    after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000

    The newly covered pages are:

    0xffffffff81000000 etc.
    0xffffffff81001000
    0xffffffff81002000
    0xffffffff81003000
    0xffffffff81004000
    0xffffffff81005000
    0xffffffff81006000
    0xffffffff81007000
    0xffffffff81008000

    Before patch, /proc/kcore shows outdated contents for the above page
    table pages, for example:

    (gdb) p level3_ident_pgt
    $1 = {} 0xffffffff81002000
    (gdb) p/x *((pud_t *)&level3_ident_pgt)@512
    $2 = {{pud = 0x1006063}, {pud = 0x0} }

    while the real content is:

    root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem
    1002000 6063 0100 0000 0000 8067 0000 0000 0000
    1002010 0000 0000 0000 0000 0000 0000 0000 0000
    *
    1003000

    That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB
    identity mapping before/after patch:

    (gdb) p/x *((pud_t *)&level3_ident_pgt)@512
    before $1 = {{pud = 0x1006063}, {pud = 0x0} }
    after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} }

    Obviously the content before patch is wrong.

    Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • A quick test shows these comments are obsolete, so just remove them.

    Signed-off-by: WANG Cong
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • I removed 3 unused assignments. The first two get reset on the first
    statement of their functions. For "err" in root.c we don't return an
    error and we don't use the variable again.

    Signed-off-by: Dan Carpenter
    Cc: Oleg Nesterov
    Acked-by: Serge Hallyn
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • Now that task->signal can't go away get_nr_threads() doesn't need
    ->siglock to read signal->count.

    Also, make it inline, move into sched.h, and convert 2 other proc users of
    signal->count to use this (now trivial) helper.

    Henceforth get_nr_threads() is the only valid user of signal->count, we
    are ready to turn it into "int nr_threads" or, perhaps, kill it.

    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - move the cprm.mm_flags checks up, before we take mmap_sem

    - move down_write(mmap_sem) and ->core_state check from do_coredump()
    to coredump_wait()

    This simplifies the code and makes the locking symmetrical.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
    to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - kill "int dump_count", argv_split(argcp) accepts argcp == NULL.

    - move "int dump_count" under " if (ispipe)" branch, fail_dropcount
    can check ispipe.

    - move "char **helper_argv" as well, change the code to do argv_free()
    right after call_usermodehelper_fns().

    - If call_usermodehelper_fns() fails goto close_fail label instead
    of closing the file by hand.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_coredump() does a lot of file checks after it opens the file or calls
    usermode helper. But all of these checks are only needed in !ispipe case.

    Move this code into the "else" branch and kill the ugly repetitive ispipe
    checks.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The first patch in this series introduced an init function to the
    call_usermodehelper api so that processes could be customized by caller.
    This patch takes advantage of that fact, by customizing the helper in
    do_coredump to create the pipe and set its core limit to one (for our
    recusrsion check). This lets us clean up the previous uglyness in the
    usermodehelper internals and factor call_usermodehelper out entirely.
    While I'm at it, we can also modify the helper setup to look for a core
    limit value of 1 rather than zero for our recursion check

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • I recently had to recover some files from an old broken machine that was
    running BorderWare Document Gateway. It's basically a drop in web server
    for sharing files. From the look of the init process and using strings on
    of a few files it seems to be based on FreeBSD 3.3.

    The process turned out to be more difficult than I imagined, but to cut a
    long story short BorderWare in their wisdom use a nonstandard magic number
    in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount
    the file systems in order to recover the data. After a bit of hunting I
    was able to make a quick fix to fs/ufs/super.c in order to detect the new
    magic number.

    I assume that this number is the same for all installations. It's quite
    easy to find out from ufs_fs.h. The superblock sits 8k into the block
    device and the magic number its 1372 bytes into the superblock struct.

    # dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null | hd
    00000000 97 26 24 0f |.&$.|
    #

    Signed-off-by: Thomas Stewart
    Cc: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Stewart
     
  • Use memdup_user when user data is immediately copied into the allocated
    region. Elimination of the variable ads, which is no longer useful.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression from,to,size,flag;
    position p;
    identifier l1,l2;
    @@

    - to = \(kmalloc@p\|kzalloc@p\)(size,flag);
    + to = memdup_user(from,size);
    if (
    - to==NULL
    + IS_ERR(to)
    || ...) {

    }
    - if (copy_from_user(to, from, size) != 0) {
    -
    - }
    //

    Signed-off-by: Julia Lawall
    Cc: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     

27 May, 2010

5 commits


26 May, 2010

9 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
    squashfs: update documentation to include description of xattr layout
    squashfs: fix name reading in squashfs_xattr_get
    squashfs: constify xattr handlers
    squashfs: xattr fix sparse warnings
    squashfs: xattr_lookup sparse fix
    squashfs: add xattr support configure option
    squashfs: add new extended inode types
    squashfs: add support for xattr reading
    squashfs: add xattr id support

    Linus Torvalds
     
  • fs/fscache/object-list.c: In function 'fscache_objlist_lookup':
    fs/fscache/object-list.c:105: warning: cast to pointer from integer of different size

    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • btrfs_dirty_inode tries to sneak in without much waiting or
    space reservation, mostly for performance reasons. This
    usually works well but can cause problems when there are
    many many writers.

    When btrfs_update_inode fails with ENOSPC, we fallback
    to a slower btrfs_start_transaction call that will reserve
    some space.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This moves the delalloc space reservation done for O_DIRECT
    into btrfs_direct_IO. This way we don't leak reserved space
    if the generic O_DIRECT write code errors out before it
    calls into btrfs_direct_IO.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • J.R. Okajima reports that the call to sync_inode() in nfs_wb_page() can
    deadlock with other writeback flush calls. It boils down to the fact
    that we cannot ever call writeback_single_inode() while holding a page
    lock (even if we do set nr_to_write to zero) since another process may
    already be waiting in the call to do_writepages(), and so will deny us
    the I_SYNC lock.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If we exit from nfs_commit_inode() without ensuring that the COMMIT rpc
    call has been completed, we must re-mark the inode as dirty. Otherwise,
    future calls to sync_inode() with the WB_SYNC_ALL flag set will fail to
    ensure that the data is on the disk.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Commit 9c7e7e23371e629dbb3b341610a418cdf1c19d91 (NFS: Don't call iput() in
    nfs_access_cache_shrinker) unintentionally removed the spin unlock for the
    inode->i_lock.

    Reported-by: David Howells
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This changes O_DIRECT write code to mark extents as delalloc
    while it is processing them. Yan Zheng has reworked the
    enospc accounting based on tracking delalloc extents and
    this makes it much easier to track enospc in the O_DIRECT code.

    There are a few space cases with the O_DIRECT code though,
    it only sets the EXTENT_DELALLOC bits, instead of doing
    EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
    we don't want to mess with clearing the dirty and uptodate
    bits when things go wrong. This is important because there
    are no pages in the page cache, so any extent state structs
    that we put in the tree won't get freed by releasepage. We have
    to clear them ourselves as the DIO ends.

    With this commit, we reserve space at in btrfs_file_aio_write,
    and then as each btrfs_direct_IO call progresses it sets
    EXTENT_DELALLOC on the range.

    btrfs_get_blocks_direct is responsible for clearing the delalloc
    at the same time it drops the extent lock.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This adds:
    alias: devname:
    to some common kernel modules, which will allow the on-demand loading
    of the kernel module when the device node is accessed.

    Ideally all these modules would be compiled-in, but distros seems too
    much in love with their modularization that we need to cover the common
    cases with this new facility. It will allow us to remove a bunch of pretty
    useless init scripts and modprobes from init scripts.

    The static device node aliases will be carried in the module itself. The
    program depmod will extract this information to a file in the module directory:
    $ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
    # Device nodes to trigger on-demand module loading.
    microcode cpu/microcode c10:184
    fuse fuse c10:229
    ppp_generic ppp c108:0
    tun net/tun c10:200
    dm_mod mapper/control c10:235

    Udev will pick up the depmod created file on startup and create all the
    static device nodes which the kernel modules specify, so that these modules
    get automatically loaded when the device node is accessed:
    $ /sbin/udevd --debug
    ...
    static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
    static_dev_create_from_modules: mknod '/dev/fuse' c10:229
    static_dev_create_from_modules: mknod '/dev/ppp' c108:0
    static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
    static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
    udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
    udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666

    A few device nodes are switched to statically allocated numbers, to allow
    the static nodes to work. This might also useful for systems which still run
    a plain static /dev, which is completely unsafe to use with any dynamic minor
    numbers.

    Note:
    The devname aliases must be limited to the *common* and *single*instance*
    device nodes, like the misc devices, and never be used for conceptually limited
    systems like the loop devices, which should rather get fixed properly and get a
    control node for losetup to talk to, instead of creating a random number of
    device nodes in advance, regardless if they are ever used.

    This facility is to hide the mess distros are creating with too modualized
    kernels, and just to hide that these modules are not compiled-in, and not to
    paper-over broken concepts. Thanks! :)

    Cc: Greg Kroah-Hartman
    Cc: David S. Miller
    Cc: Miklos Szeredi
    Cc: Chris Mason
    Cc: Alasdair G Kergon
    Cc: Tigran Aivazian
    Cc: Ian Kent
    Signed-Off-By: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers