31 May, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    mm: export generic_pipe_buf_*() to modules
    fuse: support splice() reading from fuse device
    fuse: allow splice to move pages
    mm: export remove_from_page_cache() to modules
    mm: export lru_cache_add_*() to modules
    fuse: support splice() writing to fuse device
    fuse: get page reference for readpages
    fuse: use get_user_pages_fast()
    fuse: remove unneeded variable

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
    quota: Convert quota statistics to generic percpu_counter
    ext3 uses rb_node = NULL; to zero rb_root.
    quota: Fixup dquot_transfer
    reiserfs: Fix resuming of quotas on remount read-write
    pohmelfs: Remove dead quota code
    ufs: Remove dead quota code
    udf: Remove dead quota code
    quota: rename default quotactl methods to dquot_
    quota: explicitly set ->dq_op and ->s_qcop
    quota: drop remount argument to ->quota_on and ->quota_off
    quota: move unmount handling into the filesystem
    quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
    quota: move remount handling into the filesystem
    ocfs2: Fix use after free on remount read-only

    Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c

    Linus Torvalds
     

30 May, 2010

11 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: clean up on forwarded aborted mds request
    ceph: fix leak of osd authorizer
    ceph: close out mds, osd connections before stopping auth
    ceph: make lease code DN specific
    fs/ceph: Use ERR_CAST
    ceph: renew auth tickets before they expire
    ceph: do not resend mon requests on auth ticket renewal
    ceph: removed duplicated #includes
    ceph: avoid possible null dereference
    ceph: make mds requests killable, not interruptible
    sched: add wait_for_completion_killable_timeout

    Linus Torvalds
     
  • If an mds request is aborted (timeout, SIGKILL), it is left registered to
    keep our state in sync with the mds. If we get a forward notification,
    though, we know the request didn't succeed and we can unregister it
    safely. We were trying to resend it, but then bailing out (and not
    unregistering) in __do_request.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Release the ceph_authorizer when releasing osd state.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • The auth module (part of the mon_client) is needed to free any
    ceph_authorizer(s) used by the mds and osd connections. Flush the msgr
    workqueue before stopping monc to ensure that the destroy_authorizer
    auth op is available when those connections are closed out.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • The lease code includes a mask in the CEPH_LOCK_* namespace, but that
    namespace is changing, and only one mask (formerly _DN == 1) is used, so
    hard code for that value for now.

    If we ever extend this code to handle leases over different data types we
    can extend it accordingly.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
    clear what is the purpose of the operation, which otherwise looks like a
    no-op.

    In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
    the returned value is the same as the type of the enclosing function.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    type T;
    T x;
    identifier f;
    @@

    T f (...) { }

    @@
    expression x;
    @@

    - ERR_PTR(PTR_ERR(x))
    + ERR_CAST(x)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Sage Weil

    Julia Lawall
     
  • We were only requesting renewal after our tickets expire; do so before
    that. Most of the low-level logic for this was already there; just use
    it.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We only want to send pending mon requests when we successfully
    authenticate. If we are already authenticated, like when we renew our
    ticket, there is no need to resend pending requests.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • fs/ceph/auth.c: linux/slab.h is included more than once.
    fs/ceph/super.h: linux/slab.h is included more than once.

    Acked-by: Christoph Lameter
    Signed-off-by: Andrea Gelmini
    Signed-off-by: Sage Weil

    Andrea Gelmini
     
  • ac->ops may be null; use protocol id in error message instead.

    Reported-by: Dan Carpenter
    Signed-off-by: Sage Weil

    Sage Weil
     
  • The underlying problem is that many mds requests can't be restarted. For
    example, a restarted create() would return -EEXIST if the original request
    succeeds. However, we do not want a hung MDS to hang the client too. So,
    use the _killable wait_for_completion variants to abort on SIGKILL but
    nothing else.

    Signed-off-by: Sage Weil

    Sage Weil
     

29 May, 2010

1 commit

  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (27 commits)
    ACPI: Don't let acpi_pad needlessly mark TSC unstable
    drivers/acpi/sleep.h: Checkpatch cleanup
    ACPI: Minor cleanup eliminating redundant PMTIMER_TICKS to NS conversion
    ACPI: delete unused c-state promotion/demotion data strucutures
    ACPI: video: fix acpi_backlight=video
    ACPI: EC: Use kmemdup
    drivers/acpi: use kasprintf
    ACPI, APEI, EINJ injection parameters support
    Add x64 support to debugfs
    ACPI, APEI, Use ERST for persistent storage of MCE
    ACPI, APEI, Error Record Serialization Table (ERST) support
    ACPI, APEI, Generic Hardware Error Source memory error support
    ACPI, APEI, UEFI Common Platform Error Record (CPER) header
    Unified UUID/GUID definition
    ACPI Hardware Error Device (PNP0C33) support
    ACPI, APEI, PCIE AER, use general HEST table parsing in AER firmware_first setup
    ACPI, APEI, Document for APEI
    ACPI, APEI, EINJ support
    ACPI, APEI, HEST table parsing
    ACPI, APEI, APEI supporting infrastructure
    ...

    Linus Torvalds
     

28 May, 2010

26 commits

  • gets minix get_dir_page() in sync with its analogs; back in 2007
    Nick has switched read_cache_page() and friends to sync behaviour
    (i.e. they wait for the page to get unlocked, check if it's uptodate
    and if it isn't return ERR_PTR(-EIO) instead) and removed the
    duplicate logics from the callers. In case of fs/minix/dir.c he'd
    removed only half of that...

    Signed-off-by: Al Viro

    Al Viro
     
  • got broken on ->sync_fs() conversion a year ago, nobody noticed...

    Signed-off-by: Al Viro

    Al Viro
     
  • Cc: OGAWA Hirofumi
    Cc: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • I also have commented a possible bug in existing ext2 code, marked with XXX.

    Cc: linux-ext4@vger.kernel.org
    Cc: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • Convert simple filesystems: ramfs, configfs, sysfs, block_dev to new truncate
    sequence.

    Cc: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • Lots of filesystems calls vmtruncate despite not implementing the old
    ->truncate method. Switch them to use simple_setsize and add some
    comments about the truncate code where it seems fitting.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • Fix fs/super.c kernel-doc warning and function notation:
    Warning(fs/super.c:957): No description found for parameter 'sb'

    Signed-off-by: Randy Dunlap
    Cc: Alexander Viro
    Signed-off-by: Al Viro

    Randy Dunlap
     
  • The MINIX filesystem driver used a constant number of indirect block
    pointers in an indirect block. This worked only for filesystems with 1kb
    block, while the MINIX default block size is now 4kb. As a consequence,
    large files were read incorrectly on such filesystems and writing a
    large file would cause the filesystem to become corrupted. This patch
    computes the number of indirect block pointers based on the block size,
    making the driver work for each block size.

    I would like to thank Feiran Zheng ('Fam') for pointing out the cause
    of the corruption.

    Signed-off-by: Erik van der Kouwe
    Signed-off-by: Al Viro

    Erik van der Kouwe
     
  • We don't name our generic fsync implementations very well currently.
    The no-op implementation for in-memory filesystems currently is called
    simple_sync_file which doesn't make too much sense to start with,
    the the generic one for simple filesystems is called simple_fsync
    which can lead to some confusion.

    This patch renames the generic file fsync method to generic_file_fsync
    to match the other generic_file_* routines it is supposed to be used
    with, and the no-op implementation to noop_fsync to make it obvious
    what to expect. In addition add some documentation for both methods.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a mutex_unlock missing on the error path. At other exists from the
    function that return an error flag, the mutex is unlocked, so do the same
    here.

    The semantic match that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression E1;
    @@

    * mutex_lock(E1,...);

    * mutex_unlock(E1,...);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Al Viro

    Julia Lawall
     
  • __aio_put_req() plays sick games with file refcount. What
    it wants is fput() from atomic context; it's almost always
    done with f_count > 1, so they only have to deal with delayed
    work in rare cases when their reference happens to be the
    last one. Current code decrements f_count and if it hasn't
    hit 0, everything is fine. Otherwise it keeps a pointer
    to struct file (with zero f_count!) around and has delayed
    work do __fput() on it.

    Better way to do it: use atomic_long_add_unless( , -1, 1)
    instead of !atomic_long_dec_and_test(). IOW, decrement it
    only if it's not the last reference, leave refcount alone
    if it was. And use normal fput() in delayed work.

    I've made that atomic_long_add_unless call a new helper -
    fput_atomic(). Drops a reference to file if it's safe to
    do in atomic (i.e. if that's not the last one), tells if
    it had been able to do that. aio.c converted to it, __fput()
    use is gone. req->ki_file *always* contributes to refcount
    now. And __fput() became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.

    In particular, before this patch, the command
    ls -l
    in an NFS mounted directory would always check if the directory on the server
    had changed and if so would flush and refill the pagecache for the dir.
    After this patch, the same "ls -l" will repeatedly return stale date until
    the cached attributes for the directory time out.

    The following patch fixes this by ensuring the d_revalidate is called by
    do_last when "." is being looked-up.
    link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
    is not set so nfs_lookup_verify_inode chooses not to do any validation.

    The following patch restores the original behaviour.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Al Viro

    Neil Brown
     
  • This reverts commit a7cf4145bb86aaf85d4d4d29a69b50b688e2e49d.

    Al Viro
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
    ext4: Make fsync sync new parent directories in no-journal mode
    ext4: Drop whitespace at end of lines
    ext4: Fix compat EXT4_IOC_ADD_GROUP
    ext4: Conditionally define compat ioctl numbers
    tracing: Convert more ext4 events to DEFINE_EVENT
    ext4: Add new tracepoints to track mballoc's buddy bitmap loads
    ext4: Add a missing trace hook
    ext4: restart ext4_ext_remove_space() after transaction restart
    ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
    ext4: Avoid crashing on NULL ptr dereference on a filesystem error
    ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
    ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
    ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
    ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
    ext4: Use our own write_cache_pages()
    ext4: Show journal_checksum option
    ext4: Fix for ext4_mb_collect_stats()
    ext4: check for a good block group before loading buddy pages
    ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
    ext4: Remove extraneous newlines in ext4_msg() calls
    ...

    Fixed up trivial conflict in fs/ext4/fsync.c

    Linus Torvalds
     
  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    NFS: Fix another nfs_wb_page() deadlock
    NFS: Ensure that we mark the inode as dirty if we exit early from commit
    NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker
    sunrpc: fix leak on error on socket xprt setup

    Linus Torvalds
     
  • Generic per-cpu counter has some memory overhead but it is negligible for
    modern systems and embedded systems compile without quota support. And code
    reuse is a good thing. This patch should fix complain from preemptive kernels
    which was introduced by dde9588853b1bde.

    [Jan Kara: Fixed patch to work on 32-bit archs as well]

    Reported-by: Rafael J. Wysocki
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara

    Dmitry Monakhov
     
  • Do not use the fallback default_llseek() if the readdir operation of the
    filesystem still uses the big kernel lock.

    Since llseek() modifies
    file->f_pos of the directory directly it may need locking to not confuse
    readdir which usually uses file->f_pos directly as well

    Since the special characteristics of the BKL (unlocked on schedule) are
    not necessary in this case, the inode mutex can be used for locking as
    provided by generic_file_llseek(). This is only possible since all
    filesystems, except reiserfs, either use a directory as a flat file or
    with disk address offsets. Reiserfs on the other hand uses a 32bit hash
    off the filename as the offset so generic_file_llseek() can get used as
    well since the hash is always smaller than sb->s_maxbytes (= (512 << 32) -
    blocksize).

    Signed-off-by: Jan Blunck
    Acked-by: Jan Kara
    Acked-by: Anders Larsen
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jan Blunck
     
  • This is an implementation of ->llseek useable for the rare special case
    when userspace expects the seek to succeed but the (device) file is
    actually not able to perform the seek. In this case you use noop_llseek()
    instead of falling back to the default implementation of ->llseek.

    Signed-off-by: Jan Blunck
    Cc: Frederic Weisbecker
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jan Blunck
     
  • The aio compat code was not converting the struct iovecs from 32bit to
    64bit pointers, causing either EINVAL to be returned from io_getevents, or
    EFAULT as the result of the I/O. This patch passes a compat flag to
    io_submit to signal that pointer conversion is necessary for a given iocb
    array.

    A variant of this was tested by Michael Tokarev. I have also updated the
    libaio test harness to exercise this code path with good success.
    Further, I grabbed a copy of ltp and ran the
    testcases/kernel/syscall/readv and writev tests there (compiled with -m32
    on my 64bit system). All seems happy, but extra eyes on this would be
    welcome.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
    Signed-off-by: Jeff Moyer
    Reported-by: Michael Tokarev
    Cc: Zach Brown
    Cc: [2.6.35.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and
    writev AIO operations were not functioning properly. It turns out that
    the code to convert the 32bit io vectors to 64 bits was never written.
    The results of that can be pretty bad, but in my testing, it mostly ended
    up in generating EFAULT as we walked off the list of I/O vectors provided.

    This patch set fixes the problem in my environment. are greatly
    appreciated.

    This patch:

    Factor out code that will be used by both compat_do_readv_writev and the
    compat aio submission code paths.

    Signed-off-by: Jeff Moyer
    Reported-by: Michael Tokarev
    Cc: Zach Brown
    Cc: [2.6.35.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
    clear what is the purpose of the operation, which otherwise looks like a
    no-op.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    type T;
    T x;
    identifier f;
    @@

    T f (...) { }

    @@
    expression x;
    @@

    - ERR_PTR(PTR_ERR(x))
    + ERR_CAST(x)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Extend KCORE_TEXT to cover the pages between _text and _stext, to allow
    examining some important page table pages.

    `readelf -a` output on x86_64 before and after patch:
    Type Offset VirtAddr PhysAddr
    before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000
    after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000

    The newly covered pages are:

    0xffffffff81000000 etc.
    0xffffffff81001000
    0xffffffff81002000
    0xffffffff81003000
    0xffffffff81004000
    0xffffffff81005000
    0xffffffff81006000
    0xffffffff81007000
    0xffffffff81008000

    Before patch, /proc/kcore shows outdated contents for the above page
    table pages, for example:

    (gdb) p level3_ident_pgt
    $1 = {} 0xffffffff81002000
    (gdb) p/x *((pud_t *)&level3_ident_pgt)@512
    $2 = {{pud = 0x1006063}, {pud = 0x0} }

    while the real content is:

    root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem
    1002000 6063 0100 0000 0000 8067 0000 0000 0000
    1002010 0000 0000 0000 0000 0000 0000 0000 0000
    *
    1003000

    That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB
    identity mapping before/after patch:

    (gdb) p/x *((pud_t *)&level3_ident_pgt)@512
    before $1 = {{pud = 0x1006063}, {pud = 0x0} }
    after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} }

    Obviously the content before patch is wrong.

    Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • A quick test shows these comments are obsolete, so just remove them.

    Signed-off-by: WANG Cong
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang