11 Sep, 2010

1 commit


10 Sep, 2010

12 commits

  • The workqueue implementation in 2.6.36-rcX has changed, resulting
    in the workqueues no longer having dedicated threads for work
    processing. This has caused severe livelocks under heavy parallel
    create workloads because the log IO completions have been getting
    held up behind metadata IO completions. Hence log commits would
    stall, memory allocation would stall because pages could not be
    cleaned, and lock contention on the AIL during inode IO completion
    processing was being seen to slow everything down even further.

    By making the log Io completion workqueue a high priority workqueue,
    they are queued ahead of all data/metadata IO completions and
    processed before the data/metadata completions. Hence the log never
    gets stalled, and operations needed to clean memory can continue as
    quickly as possible. This avoids the livelock conditions and allos
    the system to keep running under heavy load as per normal.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • An execve with a very large total of argument/environment strings
    can take a really long time in the execve system call. It runs
    uninterruptibly to count and copy all the strings. This change
    makes it abort the exec quickly if sent a SIGKILL.

    Note that this is the conservative change, to interrupt only for
    SIGKILL, by using fatal_signal_pending(). It would be perfectly
    correct semantics to let any signal interrupt the string-copying in
    execve, i.e. use signal_pending() instead of fatal_signal_pending().
    We'll save that change for later, since it could have user-visible
    consequences, such as having a timer set too quickly make it so that
    an execve can never complete, though it always happened to work before.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This adds a preemption point during the copying of the argument and
    environment strings for execve, in copy_strings(). There is already
    a preemption point in the count() loop, so this doesn't add any new
    points in the abstract sense.

    When the total argument+environment strings are very large, the time
    spent copying them can be much more than a normal user time slice.
    So this change improves the interactivity of the rest of the system
    when one process is doing an execve with very large arguments.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
    check the size of the argument/environment area on the stack.
    When it is unworkably large, shift_arg_pages() hits its BUG_ON.
    This is exploitable with a very large RLIMIT_STACK limit, to
    create a crash pretty easily.

    Check that the initial stack is not too large to make it possible
    to map in any executable. We're not checking that the actual
    executable (or intepreter, for binfmt_elf) will fit. So those
    mappings might clobber part of the initial stack mapping. But
    that is just userland lossage that userland made happen, not a
    kernel problem.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: Range check cpu in blk_cpu_to_group
    scatterlist: prevent invalid free when alloc fails
    writeback: Fix lost wake-up shutting down writeback thread
    writeback: do not lose wakeup events when forking bdi threads
    cciss: fix reporting of max queue depth since init
    block: switch s390 tape_block and mg_disk to elevator_change()
    block: add function call to switch the IO scheduler from a driver
    fs/bio-integrity.c: return -ENOMEM on kmalloc failure
    bio-integrity.c: remove dependency on __GFP_NOFAIL
    BLOCK: fix bio.bi_rw handling
    block: put dev->kobj in blk_register_queue fail path
    cciss: handle allocation failure
    cfq-iosched: Documentation help for new tunables
    cfq-iosched: blktrace print per slice sector stats
    cfq-iosched: Implement tunable group_idle
    cfq-iosched: Do group share accounting in IOPS when slice_idle=0
    cfq-iosched: Do not idle if slice_idle=0
    cciss: disable doorbell reset on reset_devices
    blkio: Fix return code for mkdir calls

    Linus Torvalds
     
  • The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
    bytes of uninitialized stack memory, because the fsxattr struct
    declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
    the 12-byte fsx_pad member before copying it back to the user. This
    patch takes care of it.

    Signed-off-by: Dan Rosenberg
    Reviewed-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Dan Rosenberg
     
  • Commit 9eed1fb721c ("minix: replace inode uid,gid,mode init with helper")
    broke directory creation on minix filesystems.

    Fix it by passing the needed mode flag to inode init helper.

    Signed-off-by: Jorge Boncompte [DTI2]
    Cc: Dmitry Monakhov
    Cc: Al Viro
    Cc: [2.6.35.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jorge Boncompte [DTI2]
     
  • O_NONBLOCK on parisc has a dual value:

    #define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */

    It is caught by the O_* bits uniqueness check and leads to a parisc
    compile error. The fix would be to take O_NONBLOCK out.

    Signed-off-by: Wu Fengguang
    Signed-off-by: James Bottomley
    Cc: Jamie Lokier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Bottomley
     
  • Commit 74641f584da ("alpha: binfmt_aout fix") (May 2009) introduced a
    regression - binfmt_misc is now consulted after binfmt_elf, which will
    unfortunately break ia32el. ia32 ELF binaries on ia64 used to be matched
    using binfmt_misc and executed using wrapper. As 32bit binaries are now
    matched by binfmt_elf before bindmt_misc kicks in, the wrapper is ignored.

    The fix increases precedence of binfmt_misc to the original state.

    Signed-off-by: Jan Sembera
    Cc: Ivan Kokshaysky
    Cc: Al Viro
    Cc: Richard Henderson [2.6.everything.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Sembera
     
  • Fix the left-over old ifdef for PG_uncached in /proc/kpageflags. Now it's
    used by x86, too.

    Signed-off-by: Takashi Iwai
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Iwai
     
  • commit c2c6ca4 (direct-io: do not merge logically non-contiguous requests)
    introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
    to the block layer. The problem is that the code expected
    dio->block_in_file to correspond to the current page in the dio. In fact,
    it corresponds to the previous page submitted via submit_page_section.
    This was purely an oversight, as the dio->cur_page_fs_offset field was
    introduced for just this purpose. This patch simply uses the correct
    variable when calculating whether there is a mismatch between contiguous
    logical blocks and contiguous physical blocks (as described in the
    comments).

    I also switched the if conditional following this check to an else if, to
    ensure that we never call dio_bio_submit twice for the same dio (in
    theory, this should not happen, anyway).

    I've tested this by running blktrace and verifying that a 64KB I/O was
    submitted as a single I/O. I also ran the patched kernel through
    xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
    and verified that there were no regressions as compared to an unpatched
    kernel.

    Signed-off-by: Jeff Moyer
    Acked-by: Josef Bacik
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: [2.6.35.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • So it can be used by all that need to check for that.

    Signed-off-by: Stefan Bader
    Signed-off-by: Linus Torvalds

    Stefan Bader
     

09 Sep, 2010

2 commits

  • * 'fixes' of git://oss.oracle.com/git/tma/linux-2.6:
    ocfs2: Fix orphan add in ocfs2_create_inode_in_orphan
    ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functions
    ocfs2: allow return of new inode block location before allocation of the inode
    ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open coding
    ocfs2: split out inode alloc code from ocfs2_mknod_locked
    Ocfs2: Fix a regression bug from mainline commit(6b933c8e6f1a2f3118082c455eef25f9b1ac7b45).
    ocfs2: Fix deadlock when allocating page
    ocfs2: properly set and use inode group alloc hint
    ocfs2: Use the right group in nfs sync check.
    ocfs2: Flush drive's caches on fdatasync
    ocfs2: make __ocfs2_page_mkwrite handle file end properly.
    ocfs2: Fix incorrect checksum validation error
    ocfs2: Fix metaecc error messages

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: fix lock annotations
    fuse: flush background queue on connection close

    Linus Torvalds
     

08 Sep, 2010

19 commits

  • ocfs2_create_inode_in_orphan() is used by reflink to create the newly
    reflinked inode simultaneously in the orphan dir. This allows us to easily
    handle partially-reflinked files during recovery cleanup.

    We have a problem though - the orphan dir stringifies inode # to determine
    a unique name under which the orphan entry dirent can be created. Since
    ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir
    before it can allocate the inode, we currently call into the orphan code:

    /*
    * We give the orphan dir the root blkno to fake an orphan name,
    * and allocate enough space for our insertion.
    */
    status = ocfs2_prepare_orphan_dir(osb, &orphan_dir,
    osb->root_blkno,
    orphan_name, &orphan_insert);

    Using osb->root_blkno might work fine on unindexed directories, but the
    orphan dir can have an index. When it has that index, the above code fails
    to allocate the proper index entry. Later, when we try to remove the file
    from the orphan dir (using the actual inode #), the reflink operation will
    fail.

    To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the
    newly split out orphan and inode alloc code to figure out what the inode
    block number will be (once allocated) and then prepare the orphan dir from
    that data.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Mark Fasheh
     
  • We do this because ocfs2_create_inode_in_orphan() wants to order locking of
    the orphan dir with respect to locking of the inode allocator *before*
    making any changes to the directory.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Mark Fasheh
     
  • This allows code which needs to know the eventual block number of an inode
    but can't allocate it yet due to transaction or lock ordering. For example,
    ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
    of the orphan dir because it can't yet know where the actual inode is placed
    - that code is actually in ocfs2_mknod_locked. This is a problem when the
    orphan dirs are indexed as the junk inode number will create an index entry
    which goes unused (and fails the later removal from the orphan dir). Now
    with these interfaces, ocfs2_create_inode_in_orphan() can run the block
    group search (and get back the inode block number) *before* any actual
    allocation occurs.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Mark Fasheh
     
  • ocfs2_search_chain() makes the same updates as
    ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding
    the bitmap update, use our helper function.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Mark Fasheh
     
  • Do this by splitting the bulk of the function away from the inode allocation
    code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to
    change and won't see any difference. The new function created,
    __ocfs2_mknod_locked() will be used shortly.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Mark Fasheh
     
  • The patch is to fix the regression bug brought from commit 6b933c8...( 'ocfs2:
    Avoid direct write if we fall back to buffered I/O'):

    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285

    The commit 6b933c8e6f1a2f3118082c455eef25f9b1ac7b45 changed __generic_file_aio_write
    to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to flush
    the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt
    the O_DIRECT semantics somehow in extented odirect writes.

    This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly
    flushed.

    Signed-off-by: Tristan Ye
    Signed-off-by: Tao Ma

    Tristan Ye
     
  • We cannot call grab_cache_page() when holding filesystem locks or with
    a transaction started as grab_cache_page() calls page allocation with
    GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
    causing deadlocks or various assertion failures. We have to use
    find_or_create_page() instead and pass it GFP_NOFS as we do with other
    allocations.

    Acked-by: Mark Fasheh
    Signed-off-by: Jan Kara
    Signed-off-by: Tao Ma

    Jan Kara
     
  • We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from
    res->sr_bg_blkno. Unfortunately, res->sr_bg_blkno is going to be zero under
    normal (non-fragmented) circumstances. The discontig block group patches
    effectively turned off that feature. Fix this by correctly calculating what
    the next group hint should be.

    Acked-by: Tao Ma
    Signed-off-by: Mark Fasheh
    Tested-by: Goldwyn Rodrigues
    Signed-off-by: Tao Ma

    Mark Fasheh
     
  • We have added discontig block group now, and now an inode
    can be allocated in an discontig block group. So get
    it in ocfs2_get_suballoc_slot_bit.

    The old ocfs2_test_suballoc_bit gets group block no
    from the allocation inode which is wrong. Fix it by
    passing the right group.

    Acked-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Tao Ma
     
  • When 'barrier' mount option is specified, we have to issue a cache flush
    during fdatasync(2). We have to do this even if inode doesn't have
    I_DIRTY_DATASYNC set because we still have to get written *data* to disk so
    that they are not lost in case of crash.

    Acked-by: Tao Ma
    Signed-off-by: Jan Kara
    Singed-off-by: Tao Ma

    Jan Kara
     
  • __ocfs2_page_mkwrite now is broken in handling file end.
    1. the last page should be the page contains i_size - 1.
    2. the len in the last page is also calculated wrong.
    So change them accordingly.

    Acked-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Tao Ma
     
  • For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to
    read the inode off the disk. The latter first checks to see if that block is
    cached in the journal, and, if so, returns that block. That is ok.

    But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum
    of such blocks. Blocks that are cached in the journal may not have had their
    checksum computed as yet. We should not validate the checksums of such blocks.

    Fixes ossbz#1282
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282

    Signed-off-by: Sunil Mushran
    Cc: stable@kernel.org
    Singed-off-by: Tao Ma

    Sunil Mushran
     
  • Like tools, the checksum validate function now prints the values in hex.

    Signed-off-by: Sunil Mushran
    Singed-off-by: Tao Ma

    Sunil Mushran
     
  • * 'for-2.6.36' of git://linux-nfs.org/~bfields/linux:
    nfsd4: mask out non-access bits in nfs4_access_to_omode

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: Make fiemap work with sparse files
    xfs: prevent 32bit overflow in space reservation
    xfs: Disallow 32bit project quota id
    xfs: improve buffer cache hash scalability

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    9p: potential ERR_PTR() dereference

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
    sysfs: checking for NULL instead of ERR_PTR

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
    nilfs2: fix leak of shadow dat inode in error path of load_nilfs

    Linus Torvalds
     
  • Sanity check the flags passed to change_mnt_propagation(). Exactly
    one flag should be set. Return EINVAL otherwise.

    Userspace can pass in arbitrary combinations of MS_* flags to mount().
    do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
    or MS_UNBINDABLE is set. do_change_type() clears MS_REC and then
    calls change_mnt_propagation() with the rest of the user-supplied
    flags. change_mnt_propagation() clearly assumes only one flag is set
    but do_change_type() does not check that this is true. For example,
    mount() with flags MS_SHARED | MS_RDONLY does not actually make the
    mount shared or read-only but does clear MNT_UNBINDABLE.

    Signed-off-by: Valerie Aurora
    Signed-off-by: Linus Torvalds

    Valerie Aurora
     

07 Sep, 2010

2 commits

  • Sparse doesn't understand lock annotations of the form
    __releases(&foo->lock). Change them to __releases(foo->lock). Same
    for __acquires().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • David Bartly reported that fuse can hang in fuse_get_req_nofail() when
    the connection to the filesystem server is no longer active.

    If bg_queue is not empty then flush_bg_queue() called from
    request_end() can put more requests on to the pending queue. If this
    happens while ending requests on the processing queue then those
    background requests will be queued to the pending list and never
    ended.

    Another problem is that fuse_dev_release() didn't wake up processes
    sleeping on blocked_waitq.

    Solve this by:

    a) flushing the background queue before calling end_requests() on the
    pending and processing queues

    b) setting blocked = 0 and waking up processes waiting on
    blocked_waitq()

    Thanks to David for an excellent bug report.

    Reported-by: David Bartley
    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org

    Miklos Szeredi
     

04 Sep, 2010

1 commit


03 Sep, 2010

3 commits

  • Alex Elder
     
  • In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
    to return fi_extent_max extents, but actually it won't work for
    a sparse file. The reason is that in xfs_getbmap we will
    calculate holes and set it in 'out', while out is malloced by
    bmv_count(fi_extent_max+1) which didn't consider holes. So in the
    worst case, if 'out' vector looks like
    [hole, extent, hole, extent, hole, ... hole, extent, hole],
    we will only return half of fi_extent_max extents.

    This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
    So with this flags, we don't use our 'out' in xfs_getbmap for
    a hole. The solution is a bit ugly by just don't increasing
    index of 'out' vector. I felt that it is not easy to skip it
    at the very beginning since we have the complicated check and
    some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.

    Cc: Dave Chinner
    Signed-off-by: Tao Ma
    Signed-off-by: Alex Elder

    Tao Ma
     
  • If we attempt to preallocate more than 2^32 blocks of space in a
    single syscall, the transaction block reservation will overflow
    leading to a hangs in the superblock block accounting code. This
    is trivially reproduced with xfs_io. Fix the problem by capping the
    allocation reservation to the maximum number of blocks a single
    xfs_bmapi() call can allocate (2^21 blocks).

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner