16 Aug, 2013

1 commit

  • Ben Tebulin reported:

    "Since v3.7.2 on two independent machines a very specific Git
    repository fails in 9/10 cases on git-fsck due to an SHA1/memory
    failures. This only occurs on a very specific repository and can be
    reproduced stably on two independent laptops. Git mailing list ran
    out of ideas and for me this looks like some very exotic kernel issue"

    and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
    limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

    That commit itself is not actually buggy, but what it does is to make it
    much more likely to hit the partial TLB invalidation case, since it
    introduces a new case in tlb_next_batch() that previously only ever
    happened when running out of memory.

    The real bug is that the TLB gather virtual memory range setup is subtly
    buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
    enable tlb flush range in generic mmu_gather"), and the range handling
    was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
    range flushed when __tlb_remove_page() runs out of slots"), but that fix
    was not complete.

    The problem with the TLB gather virtual address range is that it isn't
    set up by the initial tlb_gather_mmu() initialization (which didn't get
    the TLB range information), but it is set up ad-hoc later by the
    functions that actually flush the TLB. And so any such case that forgot
    to update the TLB range entries would potentially miss TLB invalidates.

    Rather than try to figure out exactly which particular ad-hoc range
    setup was missing (I personally suspect it's the hugetlb case in
    zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
    did), this patch just gets rid of the problem at the source: make the
    TLB range information available to tlb_gather_mmu(), and initialize it
    when initializing all the other tlb gather fields.

    This makes the patch larger, but conceptually much simpler. And the end
    result is much more understandable; even if you want to play games with
    partial ranges when invalidating the TLB contents in chunks, now the
    range information is always there, and anybody who doesn't want to
    bother with it won't introduce subtle bugs.

    Ben verified that this fixes his problem.

    Reported-bisected-and-tested-by: Ben Tebulin
    Build-testing-by: Stephen Rothwell
    Build-testing-by: Richard Weinberger
    Reviewed-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Aug, 2013

7 commits

  • Recently we met quite a lot of random kernel panic issues after enabling
    CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something
    to do with following bug in pagemap:

    In struct pagemapread:

    struct pagemapread {
    int pos, len;
    pagemap_entry_t *buffer;
    bool v2;
    };

    pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
    buffer, it is a mistake to compare pos and len in add_page_map() for
    checking buffer is full or not, and this can lead to buffer overflow and
    random kernel panic issue.

    Correct len to be total number of PM_ENTRY_BYTES in buffer.

    [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
    Signed-off-by: Yonghua Zheng
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yonghua zheng
     
  • Fix a NULL pointer deference while removing an empty directory, which
    was introduced by commit 3704412bdbf3 ("[readdir] convert ocfs2").

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] (null)
    PGD 6da85067 PUD 6da89067 PMD 0
    Oops: 0010 [#1] SMP
    CPU: 0 PID: 6564 Comm: rmdir Tainted: G O 3.11.0-rc1 #4
    RIP: 0010:[] [< (null)>] (null)
    Call Trace:
    ocfs2_dir_foreach+0x49/0x50 [ocfs2]
    ocfs2_empty_dir+0x12c/0x3e0 [ocfs2]
    ocfs2_unlink+0x56e/0xc10 [ocfs2]
    vfs_rmdir+0xd5/0x140
    do_rmdir+0x1cb/0x1e0
    SyS_rmdir+0x16/0x20
    system_call_fastpath+0x16/0x1b
    Code: Bad RIP value.
    RIP [< (null)>] (null)
    RSP
    CR2: 0000000000000000

    [dan.carpenter@oracle.com: fix pointer math]
    Signed-off-by: Jie Liu
    Reported-by: David Weber
    Cc: Al Viro
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Liu
     
  • Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as
    the struct file pointer, it finally result in a null pointer dereference
    in ocfs2_duplicate_clusters_by_page.

    This patch replace file pointer with inode pointer in
    cow_duplicate_clusters to fix this issue.

    [jeff.liu@oracle.com: rebased patch against linux-next tree]
    Signed-off-by: Tiger Yang
    Signed-off-by: Jie Liu
    Cc: Joel Becker
    Cc: Mark Fasheh
    Acked-by: Tao Ma
    Tested-by: David Weber
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tiger Yang
     
  • Revert commit 40bd62eb7fb8 ("fs/ocfs2/journal.h: add bits_wanted while
    calculating credits in ocfs2_calc_extend_credits").

    Unfortunately this change broke fallocate even if there is insufficient
    disk space for the preallocation, which is a serious problem.

    # df -h
    /dev/sda8 22G 1.2G 21G 6% /ocfs2
    # fallocate -o 0 -l 200M /ocfs2/testfile
    fallocate: /ocfs2/test: fallocate failed: No space left on device

    and a kernel warning:

    CPU: 3 PID: 3656 Comm: fallocate Tainted: G W O 3.11.0-rc3 #2
    Call Trace:
    dump_stack+0x77/0x9e
    warn_slowpath_common+0xc4/0x110
    warn_slowpath_null+0x2a/0x40
    start_this_handle+0x6c/0x640 [jbd2]
    jbd2__journal_start+0x138/0x300 [jbd2]
    jbd2_journal_start+0x23/0x30 [jbd2]
    ocfs2_start_trans+0x166/0x300 [ocfs2]
    __ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2]
    ocfs2_allocate_unwritten_extents+0x3c9/0x520
    __ocfs2_change_file_space+0x5e0/0xa60 [ocfs2]
    ocfs2_fallocate+0xb1/0xe0 [ocfs2]
    do_fallocate+0x1cb/0x220
    SyS_fallocate+0x6f/0xb0
    system_call_fastpath+0x16/0x1b
    JBD2: fallocate wants too many credits (51216 > 4381)

    Signed-off-by: Jie Liu
    Cc: Goldwyn Rodrigues
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jie Liu
     
  • Dave has reported the following lockdep splat:

    =================================
    [ INFO: inconsistent lock state ]
    3.11.0-rc1+ #9 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: [] page_referenced+0x87/0x5e3
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x81/0xe7
    lockdep_trace_alloc+0x5e/0xbc
    __alloc_pages_nodemask+0x8b/0x9b6
    __get_free_pages+0x20/0x31
    get_zeroed_page+0x12/0x14
    __pmd_alloc+0x1c/0x6b
    huge_pmd_share+0x265/0x283
    huge_pte_alloc+0x5d/0x71
    hugetlb_fault+0x7c/0x64a
    handle_mm_fault+0x255/0x299
    __do_page_fault+0x142/0x55c
    do_page_fault+0xd/0x16
    error_code+0x6c/0x74
    irq event stamp: 3136917
    hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50
    hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78
    softirqs last enabled at (3136180): __do_softirq+0x137/0x30f
    softirqs last disabled at (3136175): irq_exit+0xa8/0xaa
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&mapping->i_mmap_mutex);

    lock(&mapping->i_mmap_mutex);

    *** DEADLOCK ***
    no locks held by kswapd0/49.

    stack backtrace:
    CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
    Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
    Call Trace:
    dump_stack+0x4b/0x79
    print_usage_bug+0x1d9/0x1e3
    mark_lock+0x1e0/0x261
    __lock_acquire+0x623/0x17f2
    lock_acquire+0x7d/0x195
    mutex_lock_nested+0x6c/0x3a7
    page_referenced+0x87/0x5e3
    shrink_page_list+0x3d9/0x947
    shrink_inactive_list+0x155/0x4cb
    shrink_lruvec+0x300/0x5ce
    shrink_zone+0x53/0x14e
    kswapd+0x517/0xa75
    kthread+0xa8/0xaa
    ret_from_kernel_thread+0x1b/0x28

    which is a false positive caused by hugetlb pmd sharing code which
    allocates a new pmd from withing mapping->i_mmap_mutex. If this
    allocation causes reclaim then the lockdep detector complains that we
    might self-deadlock.

    This is not correct though, because hugetlb pages are not reclaimable so
    their mapping will be never touched from the reclaim path.

    The patch tells lockup detector that hugetlb i_mmap_mutex is special by
    assigning it a separate lockdep class so it won't report possible
    deadlocks on unrelated mappings.

    [peterz@infradead.org: comment for annotation]
    Reported-by: Dave Jones
    Signed-off-by: Michal Hocko
    Cc: Peter Zijlstra
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Andy reported that if file page get reclaimed we lose the soft-dirty bit
    if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
    encoded into pte entry. Thus when #pf happens on such non-present pte
    we can restore it back.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Minchan Kim
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
    get swapped out, the bit is getting lost and no longer available when
    pte read back.

    To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
    pte entry for the page being swapped out. When such page is to be read
    back from a swap cache we check for bit presence and if it's there we
    clear it and restore the former _PAGE_SOFT_DIRTY bit back.

    One of the problem was to find a place in pte entry where we can save
    the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
    chosen for that, it doesn't intersect with swap entry format stored in
    pte.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Reviewed-by: Minchan Kim
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

13 Aug, 2013

2 commits

  • Pull CIFS fixes from Steve French:
    "A set of small cifs fixes, including 3 relating to symlink handling"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately
    cifs: set sb->s_d_op before calling d_make_root()
    cifs: fix bad error handling in crypto code
    cifs: file: initialize oparms.reconnect before using it
    Do not attempt to do cifs operations reading symlinks with SMB2
    cifs: extend the buffer length enought for sprintf() using

    Linus Torvalds
     
  • Pull more ext4 bugfixes from Ted Ts'o:
    "A number of miscellaneous ext4 bugs fixes for v3.11, including a fix
    so that if ext4 is built as a module, to allow it to be unloaded"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT
    ext4: fix mount/remount error messages for incompatible mount options
    ext4: allow the mount options nodelalloc and data=journal

    Linus Torvalds
     

12 Aug, 2013

1 commit


11 Aug, 2013

3 commits

  • Pull btrfs fixes from Chris Mason:
    "These are assorted fixes, mostly from Josef nailing down xfstests
    runs. Zach also has a long standing fix for problems with readdir
    wrapping f_pos (or ctx->pos)

    These patches were spread out over different bases, so I rebased
    things on top of rc4 and retested overnight"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: don't loop on large offsets in readdir
    Btrfs: check to see if root_list is empty before adding it to dead roots
    Btrfs: release both paths before logging dir/changed extents
    Btrfs: allow splitting of hole em's when dropping extent cache
    Btrfs: make sure the backref walker catches all refs to our extent
    Btrfs: fix backref walking when we hit a compressed extent
    Btrfs: do not offset physical if we're compressed
    Btrfs: fix extent buffer leak after backref walking
    Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
    btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified

    Linus Torvalds
     
  • Pull NFS client bugfixes from Trond Myklebust:

    - Stable patch for lockd to fix Oopses due to inappropriate calls to
    utsname()->nodename

    - Stable patches for sunrpc to fix Oopses on shutdown when using
    AF_LOCAL sockets with rpcbind

    - Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint

    - Fix a regression with the sync mount option failing to work for nfs4
    mounts

    - Fix a writeback performance issue when doing cache invalidation

    - Remove an incorrect call to nfs_setsecurity in nfs_fhget

    * tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Fix up nfs4_proc_lookup_mountpoint
    NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget()
    NFSv4: Fix the sync mount option for nfs4 mounts
    NFS: Fix writeback performance issue on cache invalidation
    SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister
    SUNRPC: Don't auto-disconnect from the local rpcbind socket
    LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs

    Linus Torvalds
     
  • Pull nfsd fixes from Bruce Fields:
    "Some fixes for a 4.1 feature that in retrospect probably should have
    waited for 3.12.... But it appears to be working now"

    * 'for-3.11' of git://linux-nfs.org/~bfields/linux:
    nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID
    nfsd4: Fix MACH_CRED NULL dereference

    Linus Torvalds
     

10 Aug, 2013

11 commits

  • When btrfs readdir() hits the last entry it sets the readdir offset to a
    huge value to stop buggy apps from breaking when the same name is
    returned by readdir() with concurrent rename()s.

    But unconditionally setting the offset to INT_MAX causes readdir() to
    loop returning any entries with offsets past INT_MAX. It only takes a
    few hours of constant file creation and removal to create entries past
    INT_MAX.

    So let's set the huge offset to LLONG_MAX if the last entry has already
    overflowed 32bit loff_t. Without large offsets behaviour is identical.
    With large offsets 64bit apps will work and 32bit apps will be no more
    broken than they currently are if they see large offsets.

    Signed-off-by: Zach Brown
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Zach Brown
     
  • A user reported a panic when running with autodefrag and deleting snapshots.
    This is because we could end up trying to add the root to the dead roots list
    twice. To fix this check to see if we are empty before adding ourselves to the
    dead roots list. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The ceph guys tripped over this bug where we were still holding onto the
    original path that we used to copy the inode with when logging. This is based
    on Chris's fix which was reported to fix the problem. We need to drop the paths
    in two cases anyway so just move the drop up so that we don't have duplicate
    code. Thanks,

    Cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I noticed while running multi-threaded fsync tests that sometimes fsck would
    complain about an improper gap. This happens because we fail to add a hole
    extent to the file, which was happening when we'd split a hole EM because
    btrfs_drop_extent_cache was just discarding the whole em instead of splitting
    it. So this patch fixes this by allowing us to split a hole em properly, which
    means that added holes actually get logged properly and we no longer see this
    fsck error. Thankfully we're tolerant of these sort of problems so a user would
    not see any adverse effects of this bug, other than fsck complaining. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Because we don't mess with the offset into the extent for compressed we will
    properly find both extents for this case

    [extent a][extent b][rest of extent a]

    but because we already added a ref for the front half we won't add the inode
    information for the second half. This causes us to leak that memory and not
    print out the other offset when we do logical-resolve. So fix this by calling
    ulist_add_merge and then add our eie to the existing entry if there is one.
    With this patch we get both offsets out of logical-resolve. With this and the
    other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
    set. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If you do btrfs inspect-internal logical-resolve on a compressed extent that has
    been partly overwritten it won't find anything. This is because we try and
    match the extent offset we've searched for based on the extent offset in the
    data extent entry. However this doesn't work for compressed extents because the
    offsets are for the uncompressed size, not the compressed size. So instead only
    do this check if we are not compressed, that way we can get an actual entry for
    the physical offset rather than nothing for compressed. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
    offsetting the physical based on the extent offset. This is perfectly fine with
    uncompressed extents, however the extent offset is into the uncompressed area,
    not the compressed. So we can return a physical value that isn't at all within
    the area we have allocated on disk. Fix this by returning the start of the
    extent if it is compressed no matter what the offset. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
    takes an extra increment on the reference of allocated dummy extent buffer, so now we
    cannot free this dummy one, and end up with extent buffer leak.

    Signed-off-by: Liu Bo
    Reviewed-by: Jan Schmidt
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • For partial extents, snapshot-aware defrag does not work as expected,
    since
    a) we use the wrong logical offset to search for parents, which should be
    disk_bytenr + extent_offset, not just disk_bytenr,
    b) 'offset' returned by the backref walking just refers to key.offset, not
    the 'offset' stored in btrfs_extent_data_ref which is
    (key.offset - extent_offset).

    The reproducer:
    $ mkfs.btrfs sda
    $ mount sda /mnt
    $ btrfs sub create /mnt/sub
    $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
    $ btrfs sub snap /mnt/sub /mnt/snap1
    $ btrfs sub snap /mnt/sub /mnt/snap2
    $ sync; btrfs filesystem defrag /mnt/sub/foo;
    $ umount /mnt
    $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.

    This addresses the above two problems.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Create a small file and fallocate it to a big size with
    FALLOC_FL_KEEP_SIZE option, then truncate it back to the
    small size again, the disk free space is not changed back
    in this case. i.e,

    total 4
    -rw-r--r-- 1 root root 512 Jun 28 11:35 test

    Filesystem Size Used Avail Use% Mounted on
    ....
    /dev/sdb1 8.0G 56K 7.2G 1% /mnt

    -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test

    Filesystem Size Used Avail Use% Mounted on
    ....
    /dev/sdb1 8.0G 5.1G 2.2G 70% /mnt

    Filesystem Size Used Avail Use% Mounted on
    ....
    /dev/sdb1 8.0G 5.1G 2.2G 70% /mnt

    With this fix, the truncated up space is back as:
    Filesystem Size Used Avail Use% Mounted on
    ....
    /dev/sdb1 8.0G 56K 7.2G 1% /mnt

    Signed-off-by: Jie Liu
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Jie Liu
     
  • device_close()->recalc_sigpending() is not needed, sigprocmask() takes
    care of TIF_SIGPENDING correctly.

    And without ->siglock it is racy and wrong, it can wrongly clear
    TIF_SIGPENDING and miss a signal.

    But even with this patch device_close() is still buggy:

    1. sigprocmask() should not be used, we have set_task_blocked(),
    but this is minor.

    2. We should never block SIGKILL or SIGSTOP, and this is what
    the code tries to do.

    3. This can't protect against SIGKILL or SIGSTOP anyway. Another
    thread can do signal_wake_up(), say, do_signal_stop() or
    complete_signal() or debugger.

    4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
    TIF_SIGPENDING, say, freezing() or ->jobctl.

    5. device_write() looks equally wrong by the same reason.

    Looks like, this tries to protect some wait_event_interruptible() logic
    from signals, it should be turned into uninterruptible wait. Or we need
    to implement something like signals_stop/start for such a use-case.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Aug, 2013

3 commits

  • Commit 5688978 ("ext4: improve handling of conflicting mount options")
    introduced incorrect messages shown while choosing wrong mount options.

    First of all, both cases of incorrect mount options,
    "data=journal,delalloc" and "data=journal,dioread_nolock" result in
    the same error message.

    Secondly, the problem above isn't solved for remount option: the
    mismatched parameter is simply ignored. Moreover, ext4_msg states
    that remount with options "data=journal,delalloc" succeeded, which is
    not true.

    To fix it up, I added a simple check after parse_options() call to
    ensure that data=journal and delalloc/dioread_nolock parameters are
    not present at the same time.

    Signed-off-by: Piotr Sarna
    Acked-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Piotr Sarna
     
  • Commit 26092bf ("ext4: use a table-driven handler for mount options")
    wrongly disallows the specifying the mount options nodelalloc and
    data=journal simultaneously. This is incorrect; it should have only
    disallowed the combination of delalloc and data=journal
    simultaneously.

    Reported-by: Piotr Sarna
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Pull ext4 bugfixes from Ted Ts'o.

    Misc ext4 fixes, delayed by Ted moving mail servers and email getting
    marked as spam due to bad spf records.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: add WARN_ON to check the length of allocated blocks
    ext4: fix retry handling in ext4_ext_truncate()
    ext4: destroy ext4_es_cachep on module unload
    ext4: make sure group number is bumped after a inode allocation race

    Linus Torvalds
     

08 Aug, 2013

7 commits

  • Currently, we do not check the return value of client = rpc_clone_client(),
    nor do we shut down the resulting cloned rpc_clnt in the case where a
    NFS4ERR_WRONGSEC has caused nfs4_proc_lookup_common() to replace the
    original value of 'client' (causing a memory leak).

    Fix both issues and simplify the code by moving the call to
    rpc_clone_client() until after nfs4_proc_lookup_common() has
    done its business.

    Reported-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We only need to call it on the creation of the inode.

    Reported-by: Julia Lawall
    Cc: Steve Dickson
    Cc: Dave Quigley
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • The sync mount option stopped working for NFSv4 mounts after commit
    c02d7adf8c5429727a98bad1d039bccad4c61c50 (NFSv4: Replace nfs4_path_walk() with
    FS path lookup in a private namespace). If MS_SYNCHRONOUS is set in the
    super_block that we're cloning from, then it should be set in the new
    super_block as well.

    Signed-off-by: Scott Mayhew
    Signed-off-by: Trond Myklebust

    Scott Mayhew
     
  • If a cache invalidation is triggered, and we happen to have a lot of
    writebacks cached at the time, then the call to invalidate_inode_pages2()
    will end up calling ->launder_page() on each and every dirty page in order
    to sync its contents to disk, thus defeating write coalescing.
    The following patch ensures that we try to sync the inode to disk before
    calling invalidate_inode_pages2() so that we do the writeback as efficiently
    as possible.

    Reported-by: William Dauchy
    Reported-by: Pascal Bouchareine
    Signed-off-by: Trond Myklebust
    Tested-by: William Dauchy
    Reviewed-by: Jeff Layton

    Trond Myklebust
     
  • …t/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "Oleg Nesterov has been working hard in closing all the holes that can
    lead to race conditions between deleting an event and accessing an
    event debugfs file. This included a fix to the debugfs system (acked
    by Greg Kroah-Hartman). We think that all the holes have been patched
    and hopefully we don't find more. I haven't marked all of them for
    stable because I need to examine them more to figure out how far back
    some of the changes need to go.

    Along the way, some other fixes have been made. Alexander Z Lam fixed
    some logic where the wrong buffer was being modifed.

    Andrew Vagin found a possible corruption for machines that actually
    allocate cpumask, as a reference to one was being zeroed out by
    mistake.

    Dhaval Giani found a bad prototype when tracing is not configured.

    And I not only had some changes to help Oleg, but also finally fixed a
    long standing bug that Dave Jones and others have been hitting, where
    a module unload and reload can cause the function tracing accounting
    to get screwed up"

    * tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix reset of time stamps during trace_clock changes
    tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct buffer
    tracing: Fix trace_dump_stack() proto when CONFIG_TRACING is not set
    tracing: Fix fields of struct trace_iterator that are zeroed by mistake
    tracing/uprobes: Fail to unregister if probe event files are in use
    tracing/kprobes: Fail to unregister if probe event files are in use
    tracing: Add comment to describe special break case in probe_remove_event_call()
    tracing: trace_remove_event_call() should fail if call/file is in use
    debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs)
    ftrace: Check module functions being traced on reload
    ftrace: Consolidate some duplicate code for updating ftrace ops
    tracing: Change remove_event_file_dir() to clear "d_subdirs"->i_private
    tracing: Introduce remove_event_file_dir()
    tracing: Change f_start() to take event_mutex and verify i_private != NULL
    tracing: Change event_filter_read/write to verify i_private != NULL
    tracing: Change event_enable/disable_read() to verify i_private != NULL
    tracing: Turn event/id->i_private into call->event.type

    Linus Torvalds
     
  • - don't BUG_ON() when not SP4_NONE
    - calculate recv and send reserve sizes correctly

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: J. Bruce Fields

    Weston Andros Adamson
     
  • Fixes a NULL-dereference on attempts to use MACH_CRED protection over
    auth_sys.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

07 Aug, 2013

1 commit

  • David reported that commit c2b93e06 (cifs: only set ops for inodes in
    I_NEW state) caused a regression with mfsymlinks. Prior to that patch,
    if a mfsymlink dentry was instantiated at readdir time, the inode would
    get a new set of ops when it was revalidated. After that patch, this
    did not occur.

    This patch addresses this by simply skipping instantiating dentries in
    the readdir codepath when we know that they will need to be immediately
    revalidated. The next attempt to use that dentry will cause a new lookup
    to occur (which is basically what we want to happen anyway).

    Cc:
    Cc: "Stefan (metze) Metzmacher"
    Cc: Sachin Prabhu
    Reported-and-Tested-by: David McBride
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     

06 Aug, 2013

1 commit

  • Firstly, nlmclnt_setlockargs can be called from a reclaimer thread, in
    which case we're in entirely the wrong namespace.

    Secondly, commit 8aac62706adaaf0fab02c4327761561c8bda9448 (move
    exit_task_namespaces() outside of exit_notify()) now means that
    exit_task_work() is called after exit_task_namespaces(), which
    triggers an Oops when we're freeing up the locks.

    Fix this by ensuring that we initialise the nlm_host's rpc_client at mount
    time, so that the cl_nodename field is initialised to the value of
    utsname()->nodename that the net namespace uses. Then replace the
    lockd callers of utsname()->nodename.

    Signed-off-by: Trond Myklebust
    Cc: Toralf Förster
    Cc: Oleg Nesterov
    Cc: Nix
    Cc: Jeff Layton
    Cc: stable@vger.kernel.org # 3.10.x

    Trond Myklebust
     

05 Aug, 2013

3 commits

  • As comment in include/uapi/asm-generic/fcntl.h described, when
    introducing new O_* bits, we need to check its uniqueness in
    fcntl_init(). But __O_TMPFILE bit is missing. So fix it.

    Signed-off-by: Zheng Liu
    Signed-off-by: Al Viro

    Zheng Liu
     
  • Every now and then someone proposes a new flink syscall, and this spawns
    a long discussion of whether it would be a security problem. I think
    that this is missing the point: flink is *already* allowed without
    privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.

    Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
    write it, and link it in is very convenient. The only problem is that
    it requires that /proc be mounted so that you can do:

    linkat(AT_FDCWD, "/proc/self/fd/", dfd, path, AT_SYMLINK_NOFOLLOW)

    This sucks -- it's much nicer to do:

    linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)

    Let's allow it.

    If this turns out to be excessively scary, it we could instead require
    that the inode in question be I_LINKABLE, but this seems pointless given
    the /proc situation

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Al Viro

    Andy Lutomirski
     
  • O_TMPFILE, like O_CREAT, should respect the requested mode and should
    create regular files.

    This fixes two bugs: O_TMPFILE required privilege (because the mode
    ended up as 000) and it produced bogus inodes with no type.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Al Viro

    Andy Lutomirski