19 Aug, 2009

2 commits

  • The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
    the mm_struct. It was a very good first step for sanitize OOM.

    However Paul Menage reported the commit makes regression to his job
    scheduler. Current OOM logic can kill OOM_DISABLED process.

    Why? His program has the code of similar to the following.

    ...
    set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
    ...
    if (vfork() == 0) {
    set_oom_adj(0); /* Invoked child can be killed */
    execve("foo-bar-cmd");
    }
    ....

    vfork() parent and child are shared the same mm_struct. then above
    set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
    change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
    lost OOM immune and it was killed.

    Actually, fork-setting-exec idiom is very frequently used in userland program.
    We must not break this assumption.

    Then, this patch revert commit 2ff05b2b and related commit.

    Reverted commit list
    ---------------------
    - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
    - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
    - commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
    - commit 933b787b57 (mm: copy over oom_adj value at fork time)

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • get_sb_pseudo sets s_maxbytes to ~0ULL which becomes negative when cast
    to a signed value. Fix it to use MAX_LFS_FILESIZE which casts properly
    to a positive signed value.

    Signed-off-by: Jeff Layton
    Reviewed-by: Johannes Weiner
    Acked-by: Steve French
    Reviewed-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Robert Love
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

18 Aug, 2009

4 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: fix locking in xfs_iget_cache_hit

    Linus Torvalds
     
  • The inotify_add_watch man page specifies that inotify_add_watch() will
    return a non-negative integer. However, historically the inotify
    watches started at 1, not at 0.

    Turns out that the inotifywait program provided by the inotify-tools
    package doesn't properly handle a 0 watch descriptor. In 7e790dd5 we
    changed from starting at 1 to starting at 0. This patch starts at 1,
    just like in previous kernels, but also just like in previous kernels
    it's possible for it to wrap back to 0. This preserves the kernel
    functionality exactly like it was before the patch (neither method broke
    the spec)

    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • In f44aebcc the tail drop logic of events with no file backing
    (q_overflow and in_ignored) was reversed so IN_IGNORED events would
    never be tail dropped. This now means that Q_OVERFLOW events are NOT
    tail dropped. The fix is to not tail drop IN_IGNORED, but to tail drop
    Q_OVERFLOW.

    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • inotify decides if private data it passed to get added to an event was
    used by checking list_empty(). But it's possible that the event may
    have been dequeued and the private event removed so it would look empty.

    The fix is to use the return code from fsnotify_add_notify_event rather
    than looking at the list.

    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Eric Paris
     

17 Aug, 2009

1 commit

  • The locking in xfs_iget_cache_hit currently has numerous problems:

    - we clear the reclaim tag without i_flags_lock which protects
    modifications to it
    - we call inode_init_always which can sleep with pag_ici_lock
    held (this is oss.sgi.com BZ #819)
    - we acquire and drop i_flags_lock a lot and thus provide no
    consistency between the various flags we set/clear under it

    This patch fixes all that with a major revamp of the locking in
    the function. The new version acquires i_flags_lock early and
    only drops it once we need to call into inode_init_always or before
    calling xfs_ilock.

    This patch fixes a bug seen in the wild where we race modifying the
    reclaim tag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Reviewed-by: Eric Sandeen
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     

16 Aug, 2009

1 commit

  • The triggered field of struct poll_wqueues introduced in commit
    5f820f648c92a5ecc771a96b3c29aa6e90013bba ("poll: allow f_op->poll to
    sleep").

    It was first set to 1 in pollwake() (now __pollwake() ), tested and
    later set to 0 in poll_schedule_timeout(), but not initialized before.

    As a result when the process needs to sleep, triggered was likely to be
    non-zero even if pollwake() is not called before the first
    poll_schedule_timeout(), meaning schedule_hrtimeout_range() would not be
    called and an extra loop calling all ->poll() would be done.

    This patch initialize triggered to 0 in poll_initwait() so the ->poll()
    are not called twice before the process goes to sleep when it needs to.

    Signed-off-by: Guillaume Knispel
    Acked-by: Thomas Gleixner
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Guillaume Knispel
     

14 Aug, 2009

2 commits

  • Although this file is only ever written and not read by
    userspace, it seems that the utils are opening this
    file O_RDWR, so we need to allow that.

    Also fixes the whitespace which seemed to be broken.

    Signed-off-by: Steven Whitehouse
    Cc: David Teigland

    Steven Whitehouse
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
    ocfs2: Fix possible deadlock when extending quota file
    ocfs2: keep index within status_map[]
    ocfs2: Initialize the cluster we're writing to in a non-sparse extend
    ocfs2: Remove redundant BUG_ON in __dlm_queue_ast()
    ocfs2/quota: Release lock for error in ocfs2_quota_write.
    ocfs2: Define credit counts for quota operations
    ocfs2: Remove syncjiff field from quota info
    ocfs2: Fix initialization of blockcheck stats
    ocfs2: Zero out padding of on disk dquot structure
    ocfs2: Initialize blocks allocated to local quota file
    ocfs2: Mark buffer uptodate before calling ocfs2_journal_access_dq()
    ocfs2: Make global quota files blocksize aligned
    ocfs2: Use ocfs2_rec_clusters in ocfs2_adjust_adjacent_records.
    ocfs2: Fix deadlock on umount
    ocfs2: Add extra credits and access the modified bh in update_edge_lengths.
    ocfs2: Fail ocfs2_get_block() immediately when a block needs allocation
    ocfs2: Fix error return in ocfs2_write_cluster()
    ocfs2: Fix compilation warning for fs/ocfs2/xattr.c
    ocfs2: Initialize count in aio_write before generic_write_checks
    ocfs2: log the actual return value of ocfs2_file_aio_write()
    ...

    Linus Torvalds
     

12 Aug, 2009

14 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: fix spin_is_locked assert on uni-processor builds
    xfs: check for dinode realtime flag corruption
    use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock
    xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get
    xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap
    xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set
    xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory
    xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result
    xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make
    xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc
    xfs: switch to NOFS allocation under i_lock in xfs_getbmap
    xfs: avoid memory allocation under m_peraglock in growfs code

    Linus Torvalds
     
  • We can't call nfs_readdata_release()/nfs_writedata_release() without
    first initialising and referencing args.context. Doing so inside
    nfs_direct_read_schedule_segment()/nfs_direct_write_schedule_segment()
    causes an Oops.

    We should rather be calling nfs_readdata_free()/nfs_writedata_free() in
    those cases.

    Looking at the O_DIRECT code, the "struct nfs_direct_req" is already
    referencing the nfs_open_context for us. Since the readdata and writedata
    structures carry a reference to that, we can simplify things by getting rid
    of the extra nfs_open_context references, so that we can replace all
    instances of nfs_readdata_release()/nfs_writedata_release().

    Reported-by: Catalin Marinas
    Signed-off-by: Trond Myklebust
    Tested-by: Catalin Marinas
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • Without SMP or preemption spin_is_locked always returns false,
    so we can't do an assert with it. Instead use assert_spin_locked,
    which does the right thing on all builds.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reported-by: Johannes Engel
    Tested-by: Johannes Engel
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • Ramon tested XFS with a modified version of fsfuzzer and hit a NULL
    pointer dereference in __xfs_get_blocks due to the RT device target
    pointer being NULL.

    To fix this reject inode with the realtime bit set on a a filesystem
    without an RT subvolume during inode read.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reviewed-by: Felix Blyakher
    Reported-by: Ramon de Carvalho Valle
    Tested-by: Ramon de Carvalho Valle
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • In Red Hat Bug 512552
    - Can't write to XFS mount during raid5 resync

    a user ran into corruption while resyncing a raid, and we failed
    a consistency test, but didn't get much more info; it'd be nice
    to call XFS_CORRUPTION_ERROR here so we can see the buffer
    contents.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Felix Blyakher

    Eric Sandeen
     
  • xfs_attr_rmtval_get is always called with i_lock held, but i_lock is taken
    in reclaim context so all allocations under it must avoid recursions into
    the filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • xfs_readlink_bmap is called with i_lock held, but i_lock is taken in
    reclaim context so all allocations under it must avoid recursions into
    the filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • xfs_attr_rmtval_set is always called with i_lock held, and i_lock is taken
    in reclaim context so all allocations under it must avoid recursions into
    the filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • xfs_buf_associate_memory is used for setting up the spare buffer for the
    log wrap case in xlog_sync which can happen under i_lock when called from
    xfs_fsync. The i_lock mutex is taken in reclaim context so all allocations
    under it must avoid recursions into the filesystem. There are a couple
    more uses of xfs_buf_associate_memory in the log recovery code that are
    also affected by this, but I'd rather keep the code simple than passing on
    a gfp_mask argument. Longer term we should just stop requiring the memoery
    allocation in xlog_sync by some smaller rework of the buffer layer.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • xfs_dir_cilookup_result is always called with i_lock held, but i_lock is taken
    in reclaim context so all allocations under it must avoid recursions into the
    filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • i_lock is taken in the reclaim context so all allocations under it
    must avoid recursions into the filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • xfs_da_state_alloc is always called with i_lock held, but i_lock is taken in
    reclaim context so all allocations under it must avoid recursions into the
    filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • xfs_getbmap allocates memory with i_lock held, but i_lock is taken in
    reclaim context so all allocations under it must avoid recursions into
    the filesystem.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     
  • Allocate the memory for the larger m_perag array before taking the
    per-AG lock as the per-AG lock can be taken under the i_lock which
    can be taken from reclaim context.

    Reported by the new reclaim context tracing in lockdep.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     

11 Aug, 2009

1 commit

  • In OCFS2, allocator locks rank above transaction start. Thus we
    cannot extend quota file from inside a transaction less we could
    deadlock.

    We solve the problem by starting transaction not already in
    ocfs2_acquire_dquot() but only in ocfs2_local_read_dquot() and
    ocfs2_global_read_dquot() and we allocate blocks to quota files before starting
    the transaction. In case we crash, quota files will just have a few blocks
    more but that's no problem since we just use them next time we extend the
    quota file.

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     

10 Aug, 2009

3 commits

  • The problem is minor, but without ->cred_guard_mutex held we can race
    with exec() and get the new ->mm but check old creds.

    Now we do not need to re-check task->mm after ptrace_may_access(), it
    can't be changed to the new mm under us.

    Strictly speaking, this also fixes another very minor problem. Unless
    security check fails or the task exits mm_for_maps() should never
    return NULL, the caller should get either old or new ->mm.

    Signed-off-by: Oleg Nesterov
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Oleg Nesterov
     
  • mm_for_maps() takes ->mmap_sem after security checks, this looks
    strange and obfuscates the locking rules. Move this lock to its
    single caller, m_start().

    Signed-off-by: Oleg Nesterov
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Oleg Nesterov
     
  • It would be nice to kill __ptrace_may_access(). It requires task_lock(),
    but this lock is only needed to read mm->flags in the middle.

    Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
    the code a little bit.

    Also, we do not need to take ->mmap_sem in advance. In fact I think
    mm_for_maps() should not play with ->mmap_sem at all, the caller should
    take this lock.

    With or without this patch, without ->cred_guard_mutex held we can race
    with exec() and get the new ->mm but check old creds.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Serge Hallyn
    Signed-off-by: James Morris

    Oleg Nesterov
     

08 Aug, 2009

12 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSY
    Btrfs: correct error-handling zlib error handling
    Btrfs: remove superfluous NULL pointer check in btrfs_rename()
    Btrfs: make sure the async caching thread advances the key
    Btrfs: fix btrfs_remove_from_free_space corner case

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/hch/xfs-icache-races:
    xfs: fix freeing of inodes not yet added to the inode cache
    vfs: add __destroy_inode
    vfs: fix inode_init_always calling convention

    Linus Torvalds
     
  • Do not exceed array status_map[]

    Signed-off-by: Roel Kluin
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Joel Becker

    Roel Kluin
     
  • In a non-sparse extend, we correctly allocate (and zero) the clusters between
    the old_i_size and pos, but we don't zero the portions of the cluster we're
    writing to outside of poslen.

    It handles clustersize > pagesize and blocksize < pagesize.

    [Cleaned up by Joel Becker.]

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     
  • invalidate_inode_pages2_range may return -EBUSY occasionally
    which results Oops. This patch fixes the issue by moving
    invalidate_inode_pages2_range into a loop and keeping calling
    it until the return value is not -EBUSY.

    The EBUSY return is temporary, and can happen when the btrfs release page
    function is unable to release a page because the EXTENT_LOCK
    bit is set.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     
  • find_zlib_workspace returns an ERR_PTR value in an error case instead of NULL.

    A simplified version of the semantic match that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @match exists@
    expression x, E;
    statement S1, S2;
    @@

    x = find_zlib_workspace(...)
    ... when != x = E
    (
    * if (x == NULL || ...) S1 else S2
    |
    * if (x == NULL && ...) S1 else S2
    )
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Chris Mason

    Julia Lawall
     
  • This takes care of the following entry from Dan's list:

    fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode'

    Reported-by: Dan Carpenter
    Cc: Jonathan Corbet
    Cc: Eugene Teo
    Cc: Julia Lawall
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Chris Mason

    Bartlomiej Zolnierkiewicz
     
  • * git://git.infradead.org/mtd-2.6:
    jffs2: Fix return value from jffs2_do_readpage_nolock()
    mtd: mtdblock: introduce mtdblks_lock
    mtd: remove 'SBC8240 Wind River' Device Driver Code
    mtd: OneNAND: OMAP2/3: free GPMC CS on module removal
    mtd: OneNAND: fix incorrect bufferram offset
    mtd: blkdevs: do not forget to get MTD devices
    mtd: fix the conversion from dev to mtd_info
    mtd: let include/linux/mtd/partitions.h stand on its own

    Linus Torvalds
     
  • The new credentials code broke load_flat_shared_library() as it now uses
    an uninitialized cred pointer.

    Reported-by: Bernd Schmidt
    Tested-by: Bernd Schmidt
    Cc: Mike Frysinger
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • I suspect that mnt_want_write_file() may have wrong assumption. I think
    mnt_want_write_file() is assuming it increments ->mnt_writers if
    (file->f_mode & FMODE_WRITE). But, if it's special_file(), it is false?

    Signed-off-by: OGAWA Hirofumi
    Acked-by: Dave Hansen
    Cc: Al Viro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • The FIEMAP_IOC_FIEMAP mapping ioctl was missing a 32-bit compat handler,
    which means that 32-bit suerspace on 64-bit kernels cannot use this ioctl
    command.

    The structure is nicely aligned, padded, and sized, so it is just this
    simple.

    Tested w/ 32-bit ioctl tester (from Josef) on a 64-bit kernel on ext4.

    Signed-off-by: Eric Sandeen
    Cc:
    Cc: Mark Lord
    Cc: Arnd Bergmann
    Cc: Josef Bacik
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • When freeing an inode that lost race getting added to the inode cache we
    must not call into ->destroy_inode, because that would delete the inode
    that won the race from the inode cache radix tree.

    This patch uses splits a new xfs_inode_free helper out of xfs_ireclaim
    and uses that plus __destroy_inode to make sure we really only free
    the memory allocted for the inode that lost the race, and not mess with
    the inode cache state.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reported-by: Alex Samad
    Reported-by: Andrew Randrianasulu
    Reported-by: Stephane
    Reported-by: Tommy
    Reported-by: Miah Gregory
    Reported-by: Gabriel Barazer
    Reported-by: Leandro Lucarella
    Reported-by: Daniel Burr
    Reported-by: Nickolay
    Reported-by: Michael Guntsche
    Reported-by: Dan Carley
    Reported-by: Michael Ole Olsen
    Reported-by: Michael Weissenbacher
    Reported-by: Martin Spott
    Reported-by: Christian Kujau
    Tested-by: Michael Guntsche
    Tested-by: Dan Carley
    Tested-by: Christian Kujau

    Christoph Hellwig