03 Apr, 2012

20 commits

  • commit 05e9cfb408b24debb3a85fd98edbfd09dd148881 upstream.

    We can currently loop forever in nfs4_lookup_root() and in
    nfs41_proc_secinfo_no_name(), if the first iteration returns a
    NFS4ERR_DELAY or something else that causes exception.retry to get
    set.

    Reported-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit d97d32edcd732110758799ae60af725e5110b3dc upstream.

    When an IO error happens during inode deletion run from
    xlog_recover_process_iunlinks() filesystem gets shutdown. Thus any subsequent
    attempt to read buffers fails. Code in xlog_recover_process_iunlinks() does not
    count with the fact that read of a buffer which was read a while ago can
    really fail which results in the oops on
    agi = XFS_BUF_TO_AGI(agibp);

    Fix the problem by cleaning up the buffer handling in
    xlog_recover_process_iunlinks() as suggested by Dave Chinner. We release buffer
    lock but keep buffer reference to AG buffer. That is enough for buffer to stay
    pinned in memory and we don't have to call xfs_read_agi() all the time.

    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit b18dafc86bb879d2f38a1743985d7ceb283c2f4d upstream.

    In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
    case; in two subcases the inode i_lock is properly released but this
    does not occur in the -ELOOP subcase.

    This seems to have been introduced by commit 1836750115f2 ("fix loop
    checks in d_materialise_unique()").

    Signed-off-by: Michel Lespinasse
    [ Added a comment, and moved the unlock to where we generate the -ELOOP,
    which seems to be more natural.

    You probably can't actually trigger this without a buggy network file
    server - d_materialize_unique() is for finding aliases on non-local
    filesystems, and the d_ancestor() case is for a hardlinked directory
    loop.

    But we should be robust in the case of such buggy servers anyway. ]
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michel Lespinasse
     
  • commit 31d4f3a2f3c73f279ff96a7135d7202ef6833f12 upstream.

    Explicitly test for an extent whose length is zero, and flag that as a
    corrupted extent.

    This avoids a kernel BUG_ON assertion failure.

    Tested: Without this patch, the file system image found in
    tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a
    kernel panic. With this patch, an ext4 file system error is noted
    instead, and the file system is marked as being corrupted.

    https://bugzilla.kernel.org/show_bug.cgi?id=42859

    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 491caa43639abcffaa645fbab372a7ef4ce2975c upstream.

    The following command line will leave the aio-stress process unkillable
    on an ext4 file system (in my case, mounted on /mnt/test):

    aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2

    This is using the aio-stress program from the xfstests test suite.
    That particular command line tells aio-stress to do random writes to
    20 files from 20 threads (one thread per file). The files are NOT
    preallocated, so you will get writes to random offsets within the
    file, thus creating holes and extending i_size. It also opens the
    file with O_DIRECT and O_SYNC.

    On to the problem. When an I/O requires unwritten extent conversion,
    it is queued onto the completed_io_list for the ext4 inode. Two code
    paths will pull work items from this list. The first is the
    ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
    which is called via the fsync path (and O_SYNC handling, as well).
    There are two issues I've found in these code paths. First, if the
    fsync path beats the work routine to a particular I/O, the work
    routine will free the io_end structure! It does not take into account
    the fact that the io_end may still be in use by the fsync path. I've
    fixed this issue by adding yet another IO_END flag, indicating that
    the io_end is being processed by the fsync path.

    The second problem is that the work routine will make an assignment to
    io->flag outside of the lock. I have witnessed this result in a hang
    at umount. Moving the flag setting inside the lock resolved that
    problem.

    The problem was introduced by commit b82e384c7b ("ext4: optimize
    locking for end_io extent conversion"), which first appeared in 3.2.
    As such, the fix should be backported to that release (probably along
    with the unwritten extent conversion race fix).

    Signed-off-by: Jeff Moyer
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     
  • commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream.

    The following comment in ext4_end_io_dio caught my attention:

    /* XXX: probably should move into the real I/O completion handler */
    inode_dio_done(inode);

    The truncate code takes i_mutex, then calls inode_dio_wait. Because the
    ext4 code path above will end up dropping the mutex before it is
    reacquired by the worker thread that does the extent conversion, it
    seems to me that the truncate can happen out of order. Jan Kara
    mentioned that this might result in error messages in the system logs,
    but that should be the extent of the "damage."

    The fix is pretty straight-forward: don't call inode_dio_done until the
    extent conversion is complete.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Moyer
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     
  • commit 3d2b158262826e8b75bbbfb7b97010838dd92ac7 upstream.

    Ext4 does not support data journalling with delayed allocation enabled.
    We even do not allow to mount the file system with delayed allocation
    and data journalling enabled, however it can be set via FS_IOC_SETFLAGS
    so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file
    system mounted with delayed allocation (default) and that's where
    problem arises. The easies way to reproduce this problem is with the
    following set of commands:

    mkfs.ext4 /dev/sdd
    mount /dev/sdd /mnt/test1
    dd if=/dev/zero of=/mnt/test1/file bs=1M count=4
    chattr +j /mnt/test1/file
    dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc
    chattr -j /mnt/test1/file

    Additionally it can be reproduced quite reliably with xfstests 272 and
    269. In fact the above reproducer is a part of test 272.

    To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if
    the file system is mounted with delayed allocation. This can be easily
    done by fixing ext4_should_*_data() functions do ignore data journal
    flag when delalloc is set (suggested by Ted). We also have to set the
    appropriate address space operations for the inode (again, ignoring data
    journal flag if delalloc enabled).

    Additionally this commit introduces ext4_inode_journal_mode() function
    because ext4_should_*_data() has already had a lot of common code and
    this change is putting it all into one function so it is easier to
    read.

    Successfully tested with xfstests in following configurations:

    delalloc + data=ordered
    delalloc + data=writeback
    data=journal
    nodelalloc + data=ordered
    nodelalloc + data=writeback
    nodelalloc + data=journal

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Greg Kroah-Hartman

    Lukas Czerner
     
  • commit 15291164b22a357cb211b618adfef4fa82fc0de3 upstream.

    journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head
    state ala discard_buffer(), but does not touch _Delay or _Unwritten as
    discard_buffer() does.

    This can be problematic in some areas of the ext4 code which assume
    that if they have found a buffer marked unwritten or delay, then it's
    a live one. Perhaps those spots should check whether it is mapped
    as well, but if jbd2 is going to tear down a buffer, let's really
    tear it down completely.

    Without this I get some fsx failures on sub-page-block filesystems
    up until v3.2, at which point 4e96b2dbbf1d7e81f22047a50f862555a6cb87cb
    and 189e868fa8fdca702eb9db9d8afc46b5cb9144c9 make the failures go
    away, because buried within that large change is some more flag
    clearing. I still think it's worth doing in jbd2, since
    ->invalidatepage leads here directly, and it's the right place
    to clear away these flags.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     
  • commit 9a3ba432330e504ac61ff0043dbdaba7cea0e35a upstream.

    Prevent the state manager from filling up system logs when recovery
    fails on the server.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 4e474a00d7ff746ed177ddae14fa8b2d4bad7a00 upstream.

    Protect code accessing ctl_table by grabbing the header with grab_header()
    and after releasing with sysctl_head_finish(). This is needed if poll()
    is called in entries created by modules: currently only hostname and
    domainname support poll(), but this bug may be triggered when/if modules
    use it and if user called poll() in a file that doesn't support it.

    Dave Jones reported the following when using a syscall fuzzer while
    hibernating/resuming:

    RIP: 0010:[] [] proc_sys_poll+0x4e/0x90
    RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
    RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
    [ ... ]
    Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
    7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 16
    b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89

    If an entry goes away while we are polling() it, ctl_table may not exist
    anymore.

    Reported-by: Dave Jones
    Signed-off-by: Lucas De Marchi
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Lucas De Marchi
     
  • commit 1b26c9b334044cff6d1d2698f2be41bc7d9a0864 upstream.

    The namespace cleanup path leaks a dentry which holds a reference count
    on a network namespace. Keeping that network namespace from being freed
    when the last user goes away. Leaving things like vlan devices in the
    leaked network namespace.

    If you use ip netns add for much real work this problem becomes apparent
    pretty quickly. It light testing the problem hides because frequently
    you simply don't notice the leak.

    Use d_set_d_op() so that DCACHE_OP_* flags are set correctly.

    This issue exists back to 3.0.

    Acked-by: "Eric W. Biederman"
    Reported-by: Justin Pettit
    Signed-off-by: Pravin B Shelar
    Signed-off-by: Jesse Gross
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pravin B Shelar
     
  • commit ce85852b90a214cf577fc1b4f49d99fd7e98784a upstream.

    Signed-off-by: Pavel Shilovsky
    Reviewed-by: Jeff Layton
    Reported-by: Ben Hutchings
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     
  • commit 1daaae8fa4afe3df78ca34e724ed7e8187e4eb32 upstream.

    This patch fixes an issue when cifs_mount receives a
    STATUS_BAD_NETWORK_NAME error during cifs_get_tcon but is able to
    continue after an DFS ROOT referral. In this case, the return code
    variable is not reset prior to trying to mount from the system referred
    to. Thus, is_path_accessible is not executed and the final DFS referral
    is not performed causing a mount error.

    Use case: In DNS, example.com resolves to the secondary AD server
    ad2.example.com Our primary domain controller is ad1.example.com and has
    a DFS redirection set up from \\ad1\share\Users to \\files\share\Users.
    Mounting \\example.com\share\Users fails.

    Regression introduced by commit 724d9f1.

    Reviewed-by: Pavel Shilovsky
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 10b9b98e41ba248a899f6175ce96ee91431b6194 upstream.

    Some servers sets this value less than 50 that was hardcoded and
    we lost the connection if when we exceed this limit. Fix this by
    respecting this value - not sending more than the server allows.

    Reviewed-by: Jeff Layton
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     
  • commit f30d500f809eca67a21704347ab14bb35877b5ee upstream.

    When we get concurrent lookups of the same inode that is not in the
    per-AG inode cache, there is a race condition that triggers warnings
    in unlock_new_inode() indicating that we are initialising an inode
    that isn't in a the correct state for a new inode.

    When we do an inode lookup via a file handle or a bulkstat, we don't
    serialise lookups at a higher level through the dentry cache (i.e.
    pathless lookup), and so we can get concurrent lookups of the same
    inode.

    The race condition is between the insertion of the inode into the
    cache in the case of a cache miss and a concurrently lookup:

    Thread 1 Thread 2
    xfs_iget()
    xfs_iget_cache_miss()
    xfs_iread()
    lock radix tree
    radix_tree_insert()
    rcu_read_lock
    radix_tree_lookup
    lock inode flags
    XFS_INEW not set
    igrab()
    unlock inode flags
    rcu_read_unlock
    use uninitialised inode
    .....
    lock inode flags
    set XFS_INEW
    unlock inode flags
    unlock radix tree
    xfs_setup_inode()
    inode flags = I_NEW
    unlock_new_inode()
    WARNING as inode flags != I_NEW

    This can lead to inode corruption, inode list corruption, etc, and
    is generally a bad thing to occur.

    Fix this by setting XFS_INEW before inserting the inode into the
    radix tree. This will ensure any concurrent lookup will find the new
    inode with XFS_INEW set and that forces the lookup to wait until the
    XFS_INEW flag is removed before allowing the lookup to succeed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 3114ea7a24d3264c090556a2444fc6d2c06176d4 upstream.

    If a setattr() fails because of an NFS4ERR_OPENMODE error, it is
    probably due to us holding a read delegation. Ensure that the
    recovery routines return that delegation in this case.

    Reported-by: Miklos Szeredi
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit a1d0b5eebc4fd6e0edb02688b35f17f67f42aea5 upstream.

    If we know that the delegation stateid is bad or revoked, we need to
    remove that delegation as soon as possible, and then mark all the
    stateids that relied on that delegation for recovery. We cannot use
    the delegation as part of the recovery process.

    Also note that NFSv4.1 uses a different error code (NFS4ERR_DELEG_REVOKED)
    to indicate that the delegation was revoked.

    Finally, ensure that setlk() and setattr() can both recover safely from
    a revoked delegation.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit a05b0855fd15504972dba2358e5faa172a1e50ba upstream.

    Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as
    explained below

    Thread A:
    read() on hugetlbfs
    hugetlbfs_read() called
    i_mutex grabbed
    hugetlbfs_read_actor() called
    __copy_to_user() called
    page fault is triggered
    Thread B, sharing address space with A:
    mmap() the same file
    ->mmap_sem is grabbed on task_B->mm->mmap_sem
    hugetlbfs_file_mmap() is called
    attempt to grab ->i_mutex and block waiting for A to give it up
    Thread A:
    pagefault handled blocked on attempt to grab task_A->mm->mmap_sem,
    which happens to be the same thing as task_B->mm->mmap_sem. Block waiting
    for B to give it up.

    AFAIU the i_mutex locking was added to hugetlbfs_read() as per
    http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take
    care of the race between truncate and read. This patch fixes this by
    looking at page->mapping under lock_page() (find_lock_page()) to ensure
    that the inode didn't get truncated in the range during a parallel read.

    Ideally we can extend the patch to make sure we don't increase i_size in
    mmap. But that will break userspace, because applications will now have
    to use truncate(2) to increase i_size in hugetlbfs.

    Based on the original patch from Hillf Danton.

    Signed-off-by: Aneesh Kumar K.V
    Cc: Hillf Danton
    Cc: KAMEZAWA Hiroyuki
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.

    In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 93518dd2ebafcc761a8637b2877008cfd748c202 upstream.

    This patch fixies follwing two memory leak patterns that reported by kmemleak.
    sysfs_sd_setsecdata() is called during sys_lsetxattr() operation.
    It checks sd->s_iattr is NULL or not. Then if it is NULL, it calls
    sysfs_init_inode_attrs() to allocate memory.
    That code is this.

    iattrs = sd->s_iattr;
    if (!iattrs)
    iattrs = sysfs_init_inode_attrs(sd);

    The iattrs recieves sysfs_init_inode_attrs()'s result, but sd->s_iattr
    doesn't know the address. so it needs to set correct address to
    sd->s_iattr to free memory in other function.

    unreferenced object 0xffff880250b73e60 (size 32):
    comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
    hex dump (first 32 bytes):
    73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f system_u:object_
    72 3a 73 79 73 66 73 5f 74 3a 73 30 00 00 00 00 r:sysfs_t:s0....
    backtrace:
    [] kmemleak_alloc+0x73/0x98
    [] __kmalloc+0x100/0x12c
    [] context_struct_to_string+0x106/0x210
    [] security_sid_to_context_core+0x10b/0x129
    [] security_sid_to_context+0x10/0x12
    [] selinux_inode_getsecurity+0x7d/0xa8
    [] selinux_inode_getsecctx+0x22/0x2e
    [] security_inode_getsecctx+0x16/0x18
    [] sysfs_setxattr+0x96/0x117
    [] __vfs_setxattr_noperm+0x73/0xd9
    [] vfs_setxattr+0x83/0xa1
    [] setxattr+0xcf/0x101
    [] sys_lsetxattr+0x6a/0x8f
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff
    unreferenced object 0xffff88024163c5a0 (size 96):
    comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
    hex dump (first 32 bytes):
    00 00 00 00 ed 41 00 00 00 00 00 00 00 00 00 00 .....A..........
    00 00 00 00 00 00 00 00 0c 64 42 4f 00 00 00 00 .........dBO....
    backtrace:
    [] kmemleak_alloc+0x73/0x98
    [] kmem_cache_alloc_trace+0xc4/0xee
    [] sysfs_init_inode_attrs+0x2a/0x83
    [] sysfs_setxattr+0xbf/0x117
    [] __vfs_setxattr_noperm+0x73/0xd9
    [] vfs_setxattr+0x83/0xa1
    [] setxattr+0xcf/0x101
    [] sys_lsetxattr+0x6a/0x8f
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff
    `

    Signed-off-by: Masami Ichikawa
    Signed-off-by: Greg Kroah-Hartman

    Masami Ichikawa
     

19 Mar, 2012

1 commit

  • Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the
    number of possible wakeup paths in epoll is causing a few applications
    to longer work (dovecot for one).

    The original patch is really about limiting the amount of epoll nesting
    (since epoll fds can be attached to other fds). Thus, we probably can
    allow an unlimited number of paths of depth 1. My current patch limits
    it at 1000. And enforce the limits on paths that have a greater depth.

    This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578

    Signed-off-by: Jason Baron
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

17 Mar, 2012

5 commits

  • Merge some more email patches from Andrew Morton:
    "A couple of nilfs fixes"

    * emailed from Andrew Morton :
    nilfs2: fix NULL pointer dereference in nilfs_load_super_block()
    nilfs2: clamp ns_r_segments_percentage to [1, 99]

    Linus Torvalds
     
  • According to the report from Slicky Devil, nilfs caused kernel oops at
    nilfs_load_super_block function during mount after he shrank the
    partition without resizing the filesystem:

    BUG: unable to handle kernel NULL pointer dereference at 00000048
    IP: [] nilfs_load_super_block+0x17e/0x280 [nilfs2]
    *pde = 00000000
    Oops: 0000 [#1] PREEMPT SMP
    ...
    Call Trace:
    [] init_nilfs+0x4b/0x2e0 [nilfs2]
    [] nilfs_mount+0x447/0x5b0 [nilfs2]
    [] mount_fs+0x36/0x180
    [] vfs_kern_mount+0x51/0xa0
    [] do_kern_mount+0x3e/0xe0
    [] do_mount+0x169/0x700
    [] sys_mount+0x6b/0xa0
    [] sysenter_do_call+0x12/0x28
    Code: 53 18 8b 43 20 89 4b 18 8b 4b 24 89 53 1c 89 43 24 89 4b 20 8b 43
    20 c7 43 2c 00 00 00 00 23 75 e8 8b 50 68 89 53 28 8b 54 b3 20 72
    48 8b 7a 4c 8b 55 08 89 b3 84 00 00 00 89 bb 88 00 00 00
    EIP: [] nilfs_load_super_block+0x17e/0x280 [nilfs2] SS:ESP 0068:ca9bbdcc
    CR2: 0000000000000048

    This turned out due to a defect in an error path which runs if the
    calculated location of the secondary super block was invalid.

    This patch fixes it and eliminates the reported oops.

    Reported-by: Slicky Devil
    Signed-off-by: Ryusuke Konishi
    Tested-by: Slicky Devil
    Cc: [2.6.30+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     
  • ns_r_segments_percentage is read from the disk. Bogus or malicious
    value could cause integer overflow and malfunction due to meaningless
    disk usage calculation. This patch reports error when mounting such
    bogus volumes.

    Signed-off-by: Haogang Chen
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haogang Chen
     
  • When writing files to afs I sometimes hit a BUG:

    kernel BUG at fs/afs/rxrpc.c:179!

    With a backtrace of:

    afs_free_call
    afs_make_call
    afs_fs_store_data
    afs_vnode_store_data
    afs_write_back_from_locked_page
    afs_writepages_region
    afs_writepages

    The cause is:

    ASSERT(skb_queue_empty(&call->rx_queue));

    Looking at a tcpdump of the session the abort happens because we
    are exceeding our disk quota:

    rx abort fs reply store-data error diskquota exceeded (32)

    So the abort error is valid. We hit the BUG because we haven't
    freed all the resources for the call.

    By freeing any skbs in call->rx_queue before calling afs_free_call
    we avoid hitting leaking memory and avoid hitting the BUG.

    Signed-off-by: Anton Blanchard
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • A read of a large file on an afs mount failed:

    # cat junk.file > /dev/null
    cat: junk.file: Bad message

    Looking at the trace, call->offset wrapped since it is only an
    unsigned short. In afs_extract_data:

    _enter("{%u},{%zu},%d,,%zu", call->offset, len, last, count);
    ...

    if (call->offset < count) {
    if (last) {
    _leave(" = -EBADMSG [%d < %zu]", call->offset, count);
    return -EBADMSG;
    }

    Which matches the trace:

    [cat ] ==> afs_extract_data({65132},{524},1,,65536)
    [cat ] < 65536]

    call->offset went from 65132 to 0. Fix this by making call->offset an
    unsigned int.

    Signed-off-by: Anton Blanchard
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

15 Mar, 2012

1 commit

  • Pull block fixes from Jens Axboe:
    "Been sitting on this for a while, but lets get this out the door.
    This fixes various important bugs for 3.3 final, along with a few more
    trivial ones. Please pull!"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: fix ioc leak in put_io_context
    block, sx8: fix pointer math issue getting fw version
    Block: use a freezable workqueue for disk-event polling
    drivers/block/DAC960: fix -Wuninitialized warning
    drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
    block: fix __blkdev_get and add_disk race condition
    block: Fix setting bio flags in drivers (sd_dif/floppy)
    block: Fix NULL pointer dereference in sd_revalidate_disk
    block: exit_io_context() should call elevator_exit_icq_fn()
    block: simplify ioc_release_fn()
    block: replace icq->changed with icq->flags

    Linus Torvalds
     

14 Mar, 2012

1 commit


11 Mar, 2012

5 commits

  • wait_on_inode() doesn't have ->i_lock

    Signed-off-by: Al Viro

    Al Viro
     
  • complete_walk() returns either ECHILD or ESTALE. do_last() turns this into
    ECHILD unconditionally. If not in RCU mode, this error will reach userspace
    which is complete nonsense.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • complete_walk() already puts nd->path, no need to do it again at cleanup time.

    This would result in Oopses if triggered, apparently the codepath is not too
    well exercised.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • udf_release_file() can be called from munmap() path with mmap_sem held. Thus
    we cannot take i_mutex there because that ranks above mmap_sem. Luckily,
    i_mutex is not needed in udf_release_file() anymore since protection by
    i_data_sem is enough to protect from races with write and truncate.

    Reported-by: Al Viro
    Reviewed-by: Namjae Jeon
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • 9a7aa12f3911853a introduced additional logic around setting the i_mutex
    lockdep class for directory inodes. The idea was that some filesystems
    may want their own special lockdep class for different directory
    inodes and calling unlock_new_inode() should not clobber one of
    those special classes.

    I believe that the added conditional, around the *negated* return value
    of lockdep_match_class(), caused directory inodes to be placed in the
    wrong lockdep class.

    inode_init_always() sets the i_mutex lockdep class with i_mutex_key for
    all inodes. If the filesystem did not change the class during inode
    initialization, then the conditional mentioned above was false and the
    directory inode was incorrectly left in the non-directory lockdep class.
    If the filesystem did set a special lockdep class, then the conditional
    mentioned above was true and that class was clobbered with
    i_mutex_dir_key.

    This patch removes the negation from the conditional so that the i_mutex
    lockdep class is properly set for directory inodes. Special classes are
    preserved and directory inodes with unmodified classes are set with
    i_mutex_dir_key.

    Signed-off-by: Tyler Hicks
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro

    Tyler Hicks
     

10 Mar, 2012

3 commits

  • Current code has put_ioctx() called asynchronously from aio_fput_routine();
    that's done *after* we have killed the request that used to pin ioctx,
    so there's nothing to stop io_destroy() waiting in wait_for_all_aios()
    from progressing. As the result, we can end up with async call of
    put_ioctx() being the last one and possibly happening during exit_mmap()
    or elf_core_dump(), neither of which expects stray munmap() being done
    to them...

    We do need to prevent _freeing_ ioctx until aio_fput_routine() is done
    with that, but that's all we care about - neither io_destroy() nor
    exit_aio() will progress past wait_for_all_aios() until aio_fput_routine()
    does really_put_req(), so the ioctx teardown won't be done until then
    and we don't care about the contents of ioctx past that point.

    Since actual freeing of these suckers is RCU-delayed, we don't need to
    bump ioctx refcount when request goes into list for async removal.
    All we need is rcu_read_lock held just over the ->ctx_lock-protected
    area in aio_fput_routine().

    Signed-off-by: Al Viro
    Reviewed-by: Jeff Moyer
    Acked-by: Benjamin LaHaise
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Have ioctx_alloc() return an extra reference, so that caller would drop it
    on success and not bother with re-grabbing it on failure exit. The current
    code is obviously broken - io_destroy() from another thread that managed
    to guess the address io_setup() would've returned would free ioctx right
    under us; gets especially interesting if aio_context_t * we pass to
    io_setup() points to PROT_READ mapping, so put_user() fails and we end
    up doing io_destroy() on kioctx another thread has just got freed...

    Signed-off-by: Al Viro
    Acked-by: Benjamin LaHaise
    Reviewed-by: Jeff Moyer
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Pull btrfs updates from Chris Mason:
    "I have two additional and btrfs fixes in my for-linus branch. One is
    a casting error that leads to memory corruption on i386 during scrub,
    and the other fixes a corner case in the backref walking code (also
    triggered by scrub)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix casting error in scrub reada code
    btrfs: fix locking issues in find_parent_nodes()

    Linus Torvalds
     

07 Mar, 2012

3 commits


06 Mar, 2012

1 commit

  • Merge the emailed seties of 19 patches from Andrew Morton

    * akpm:
    rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
    memcg: fix mapcount check in move charge code for anonymous page
    mm: thp: fix BUG on mm->nr_ptes
    alpha: fix 32/64-bit bug in futex support
    memcg: fix GPF when cgroup removal races with last exit
    debugobjects: Fix selftest for static warnings
    floppy/scsi: fix setting of BIO flags
    memcg: fix deadlock by inverting lrucare nesting
    drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
    c2port: class_create() returns an ERR_PTR
    pps: class_create() returns an ERR_PTR, not NULL
    hung_task: fix the broken rcu_lock_break() logic
    vfork: kill PF_STARTING
    coredump_wait: don't call complete_vfork_done()
    vfork: make it killable
    vfork: introduce complete_vfork_done()
    aio: wake up waiters when freeing unused kiocbs
    kprobes: return proper error code from register_kprobe()
    kmsg_dump: don't run on non-error paths by default

    Linus Torvalds