26 Sep, 2014

5 commits

  • commit 66d2f4d28cd030220e7ea2a628993fcabcb956d1 upstream.

    Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
    in isolate_lru_pages() at mm/vmscan.c:1281!

    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") looks like interrupted work-in-progress.

    mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
    - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
    when the page is always visible in radix_tree, and often already on LRU.

    Revert change to shmem_write_begin(), and use init_page_accessed() or
    mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().

    SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
    before; but since many other filesystems use [__]page_symlink(), which did
    and does mark the page accessed, consider this as rectifying an oversight.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

    aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 07a427884348d38a6fd56fa4d78249c407196650 upstream.

    shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible. This is unnecessary as what
    could it possible race against? Use an unlocked variant.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream.

    shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87 upstream.

    Page cache radix tree slots are usually stabilized by the page lock, but
    shmem's swap cookies have no such thing. Because the overall truncation
    loop is lockless, the swap entry is currently confirmed by a tree lookup
    and then deleted by another tree lookup under the same tree lock region.

    Use radix_tree_delete_item() instead, which does the verification and
    deletion with only one lookup. This also allows removing the
    delete-only special case from shmem_radix_tree_replace().

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     

28 Jul, 2014

3 commits

  • commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream.

    shmem_fault() is the actual culprit in trinity's hole-punch starvation,
    and the most significant cause of such problems: since a page faulted is
    one that then appears page_mapped(), needing unmap_mapping_range() and
    i_mmap_mutex to be unmapped again.

    But it is not the only way in which a page can be brought into a hole in
    the radix_tree while that hole is being punched; and Vlastimil's testing
    implies that if enough other processors are busy filling in the hole,
    then shmem_undo_range() can be kept from completing indefinitely.

    shmem_file_splice_read() is the main other user of SGP_CACHE, which can
    instantiate shmem pagecache pages in the read-only case (without holding
    i_mutex, so perhaps concurrently with a hole-punch). Probably it's
    silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
    ought to be safe, but might bring surprises - not a change to be rushed.

    shmem_read_mapping_page_gfp() is an internal interface used by
    drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
    shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
    called internally by the kernel (perhaps for a stacking filesystem,
    which might rely on holes to be reserved): it's unclear whether it could
    be provoked to keep hole-punch busy or not.

    We could apply the same umbrella as now used in shmem_fault() to
    shmem_file_splice_read() and the others; but it looks ugly, and use over
    a range raises questions - should it actually be per page? can these get
    starved themselves?

    The origin of this part of the problem is my v3.1 commit d0823576bf4b
    ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
    into shmem.c. It seemed like a nice idea at the time, to ensure
    (barring RCU lookup fuzziness) that there's an instant when the entire
    hole is empty; but the indefinitely repeated scans to ensure that make
    it vulnerable.

    Revert that "enhancement" to hole-punch from shmem_undo_range(), but
    retain the unproblematic rescanning when it's truncating; add a couple
    of comments there.

    Remove the "indices[0] >= end" test: that is now handled satisfactorily
    by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
    to be worth avoiding here.

    But if we do not always loop indefinitely, we do need to handle the case
    of swap swizzled back to page before shmem_free_swap() gets it: add a
    retry for that case, as suggested by Konstantin Khlebnikov; and for the
    case of page swizzled back to swap, as suggested by Johannes Weiner.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Suggested-by: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream.

    Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
    punched") was buggy: Sasha sent a lockdep report to remind us that
    grabbing i_mutex in the fault path is a no-no (write syscall may already
    hold i_mutex while faulting user buffer).

    We tried a completely different approach (see following patch) but that
    proved inadequate: good enough for a rational workload, but not good
    enough against trinity - which forks off so many mappings of the object
    that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
    into serious starvation when concurrent faults force the puncher to fall
    back to single-page unmap_mapping_range() searches of the i_mmap tree.

    So return to the original umbrella approach, but keep away from i_mutex
    this time. We really don't want to bloat every shmem inode with a new
    mutex or completion, just to protect this unlikely case from trinity.
    So extend the original with wait_queue_head on stack at the hole-punch
    end, and wait_queue item on the stack at the fault end.

    This involves further use of i_lock to guard against the races: lockdep
    has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
    i_lock around wake_up_bit(), which is comparable to what we do here.
    i_lock is more convenient, but we could switch to shmem's info->lock.

    This issue has been tagged with CVE-2014-4171, which will require commit
    f00cdc6df7d7 and this and the following patch to be backported: we
    suggest to 3.1+, though in fact the trinity forkbomb effect might go
    back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
    not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
    Anyone running trinity on 3.0 and earlier? I don't think we need care.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream.

    Trinity finds that mmap access to a hole while it's punched from shmem
    can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
    from completing, until the reader chooses to stop; with the puncher's
    hold on i_mutex locking out all other writers until it can complete.

    It appears that the tmpfs fault path is too light in comparison with its
    hole-punching path, lacking an i_data_sem to obstruct it; but we don't
    want to slow down the common case.

    Extend shmem_fallocate()'s existing range notification mechanism, so
    shmem_fault() can refrain from faulting pages into the hole while it's
    punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
    faulting when not).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Jiri Slaby

    Hugh Dickins
     

12 Sep, 2013

2 commits

  • Conditionally call the appropriate fs_init function and fill_super
    functions. Add a use once guard to shmem_init() to simply succeed on a
    second call.

    (Note that IS_ENABLED() is a compile time constant so dead code
    elimination removes unused function calls when CONFIG_TMPFS is disabled.)

    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

04 Sep, 2013

1 commit


25 Aug, 2013

1 commit


05 Aug, 2013

1 commit

  • Commit 46a1c2c7ae53 ("vfs: export lseek_execute() to modules") broke the
    tmpfs SEEK_DATA/SEEK_HOLE implementation, because vfs_setpos() converts
    the carefully prepared -ENXIO to -EINVAL. Other filesystems avoid it in
    error cases: do the same in tmpfs.

    Signed-off-by: Hugh Dickins
    Cc: Jie Liu
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Jul, 2013

1 commit

  • Pull security subsystem updates from James Morris:
    "In this update, Smack learns to love IPv6 and to mount a filesystem
    with a transmutable hierarchy (i.e. security labels are inherited
    from parent directory upon creation rather than creating process).

    The rest of the changes are maintenance"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (37 commits)
    tpm/tpm_i2c_infineon: Remove unused header file
    tpm: tpm_i2c_infinion: Don't modify i2c_client->driver
    evm: audit integrity metadata failures
    integrity: move integrity_audit_msg()
    evm: calculate HMAC after initializing posix acl on tmpfs
    maintainers: add Dmitry Kasatkin
    Smack: Fix the bug smackcipso can't set CIPSO correctly
    Smack: Fix possible NULL pointer dereference at smk_netlbl_mls()
    Smack: Add smkfstransmute mount option
    Smack: Improve access check performance
    Smack: Local IPv6 port based controls
    tpm: fix regression caused by section type conflict of tpm_dev_release() in ppc builds
    maintainers: Remove Kent from maintainers
    tpm: move TPM_DIGEST_SIZE defintion
    tpm_tis: missing platform_driver_unregister() on error in init_tis()
    security: clarify cap_inode_getsecctx description
    apparmor: no need to delay vfree()
    apparmor: fix fully qualified name parsing
    apparmor: fix setprocattr arg processing for onexec
    apparmor: localize getting the security context to a few macros
    ...

    Linus Torvalds
     

03 Jul, 2013

1 commit

  • For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
    SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
    matter in lseek_execute() to update the current file offset
    to the desired offset if it is valid, ceph also does the
    simliar things at ceph_llseek().

    To reduce the duplications, this patch make lseek_execute()
    public accessible so that we can call it directly from the
    underlying file systems.

    Thanks Dave Chinner for this suggestion.

    [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

    v2->v1:
    - Add kernel-doc comments for lseek_execute()
    - Call lseek_execute() in ceph->llseek()

    Signed-off-by: Jie Liu
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Ben Myers
    Cc: Ted Tso
    Cc: Hugh Dickins
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Sage Weil
    Signed-off-by: Al Viro

    Jie Liu
     

29 Jun, 2013

1 commit


20 Jun, 2013

1 commit

  • Included in the EVM hmac calculation is the i_mode. Any changes to
    the i_mode need to be reflected in the hmac. shmem_mknod() currently
    calls generic_acl_init(), which modifies the i_mode, after calling
    security_inode_init_security(). This patch reverses the order in
    which they are called.

    Reported-by: Sven Vermeulen
    Signed-off-by: Mimi Zohar
    Acked-by: Hugh Dickins

    Mimi Zohar
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

30 Apr, 2013

1 commit


02 Mar, 2013

1 commit


27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

3 commits

  • This patch is a follow up on below patch:

    [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
    commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63

    Signed-off-by: Namjae Jeon
    Signed-off-by: Vivek Trivedi
    Acked-by: Steven Whitehouse
    Acked-by: Sage Weil
    Signed-off-by: Al Viro

    Namjae Jeon
     
  • Note that provided ->d_dname() reproduces what we used to get for
    those guys in e.g. /proc/self/maps; it might be a good idea to change
    that to something less ugly, but for now let's keep the existing
    user-visible behaviour

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

24 Feb, 2013

3 commits

  • Fix several mempolicy leaks in the tmpfs mount logic. These leaks are
    slow - on the order of one object leaked per mount attempt.

    Leak 1 (umount doesn't free mpol allocated in mount):
    while true; do
    mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
    umount /mnt
    done

    Leak 2 (errors parsing remount options will leak mpol):
    mount -t tmpfs -o size=100M nodev /mnt
    while true; do
    mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
    done
    umount /mnt

    Leak 3 (multiple mpol per mount leak mpol):
    while true; do
    mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
    umount /mnt
    done

    This patch fixes all of the above. I could have broken the patch into
    three pieces but is seemed easier to review as one.

    [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
    Signed-off-by: Greg Thelen
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
    option is not specified in the remount request. A new policy can be
    specified if mpol=M is given.

    Before this patch remounting an mpol bound tmpfs without specifying
    mpol= mount option in the remount request would set the filesystem's
    mempolicy object to a freed mempolicy object.

    To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
    # mkdir /tmp/x

    # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0

    # mount -o remount,size=200M nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
    # note ? garbage in mpol=... output above

    # dd if=/dev/zero of=/tmp/x/f count=1
    # panic here

    Panic:
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [< (null)>] (null)
    [...]
    Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
    Call Trace:
    mpol_shared_policy_init+0xa5/0x160
    shmem_get_inode+0x209/0x270
    shmem_mknod+0x3e/0xf0
    shmem_create+0x18/0x20
    vfs_create+0xb5/0x130
    do_last+0x9a1/0xea0
    path_openat+0xb3/0x4d0
    do_filp_open+0x42/0xa0
    do_sys_open+0xfe/0x1e0
    compat_sys_open+0x1b/0x20
    cstar_dispatch+0x7/0x1f

    Non-debug kernels will not crash immediately because referencing the
    dangling mpol will not cause a fault. Instead the filesystem will
    reference a freed mempolicy object, which will cause unpredictable
    behavior.

    The problem boils down to a dropped mpol reference below if
    shmem_parse_options() does not allocate a new mpol:

    config = *sbinfo
    shmem_parse_options(data, &config, true)
    mpol_put(sbinfo->mpol)
    sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */

    This patch avoids the crash by not releasing the mempolicy if
    shmem_parse_options() doesn't create a new mpol.

    How far back does this issue go? I see it in both 2.6.36 and 3.3. I did
    not look back further.

    Signed-off-by: Greg Thelen
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • In shmem_find_get_pages_and_swap(), use the faster radix tree iterator
    construct from commit 78c1d78488a3 ("radix-tree: introduce bit-optimized
    iterator").

    Signed-off-by: Johannes Weiner
    Acked-by: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Feb, 2013

3 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     
  • Signed-off-by: Al Viro

    Al Viro
     

27 Jan, 2013

1 commit

  • There is no backing store to tmpfs and file creation rules are the
    same as for any other filesystem so it is semantically safe to allow
    unprivileged users to mount it. ramfs is safe for the same reasons so
    allow either flavor of tmpfs to be mounted by a user namespace root
    user.

    The memory control group successfully limits how much memory tmpfs can
    consume on any system that cares about a user namespace root using
    tmpfs to exhaust memory the memory control group can be deployed.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Jan, 2013

1 commit


18 Dec, 2012

1 commit


13 Dec, 2012

1 commit

  • Revert 3.5's commit f21f8062201f ("tmpfs: revert SEEK_DATA and
    SEEK_HOLE") to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and
    SEEK_HOLE"), with the intervening additional arg to
    generic_file_llseek_size().

    In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
    SEEK_DATA and SEEK_HOLE support; and a good case has now been made for
    it on tmpfs, so let's join the party.

    It's quite easy for tmpfs to scan the radix_tree to support llseek's new
    SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are
    still on my mind (in particular, the !PageUptodate-ness of pages
    fallocated but still unwritten).

    [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
    Signed-off-by: Hugh Dickins
    Cc: Dave Chinner
    Cc: Jaegeuk Hanse
    Cc: "Theodore Ts'o"
    Cc: Zheng Liu
    Cc: Jeff liu
    Cc: Paul Eggert
    Cc: Christoph Hellwig
    Cc: Josef Bacik
    Cc: Andi Kleen
    Cc: Andreas Dilger
    Cc: Marco Stornelli
    Cc: Chris Mason
    Cc: Sunil Mushran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 Dec, 2012

1 commit

  • This fixes a regression in 3.7-rc, which has since gone into stable.

    Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
    imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
    refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
    on expecting alloc_page_vma() to drop the refcount it had acquired.
    This deserves a rework: but for now fix the leak in shmem_alloc_page().

    Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
    the same refcounting there as in shmem_alloc_page(), delete its onstack
    mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
    those were invented to let swapin_readahead() make an unknown number of
    calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
    alloc_pages_vma() has kept refcount in balance, so now no problem.

    Reported-and-tested-by: Tommi Rantala
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Nov, 2012

2 commits

  • Under a particular load on one machine, I have hit shmem_evict_inode()'s
    BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
    race between swapout and eviction.

    It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
    and the lack of coherent locking between mapping's nrpages and shmem's
    swapped count. There's a window in shmem_writepage(), between lowering
    nrpages in shmem_delete_from_page_cache() and then raising swapped
    count, when the freed count appears to be +1 when it should be 0, and
    then the asymmetry stops it from being corrected with -1 before hitting
    the BUG.

    One answer is coherent locking: using tree_lock throughout, without
    info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
    used_blocks makes that messier than expected. Another answer may be a
    further effort to eliminate the weird shmem_recalc_inode() altogether,
    but previous attempts at that failed.

    So far undecided, but for now change the BUG_ON to WARN_ON: in usual
    circumstances it remains a useful consistency check.

    Signed-off-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
    has converted to WARNING) in shmem_getpage_gfp():

    WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
    Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
    Call Trace:
    warn_slowpath_common+0x7f/0xc0
    warn_slowpath_null+0x1a/0x20
    shmem_getpage_gfp+0xa5c/0xa70
    shmem_fault+0x4f/0xa0
    __do_fault+0x71/0x5c0
    handle_pte_fault+0x97/0xae0
    handle_mm_fault+0x289/0x350
    __do_page_fault+0x18e/0x530
    do_page_fault+0x2b/0x50
    page_fault+0x28/0x30
    tracesys+0xe1/0xe6

    Thanks to Johannes for pointing to truncation: free_swap_and_cache()
    only does a trylock on the page, so the page lock we've held since
    before confirming swap is not enough to protect against truncation.

    What cleanup is needed in this case? Just delete_from_swap_cache(),
    which takes care of the memcg uncharge.

    Signed-off-by: Hugh Dickins
    Reported-by: Dave Jones
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Oct, 2012

1 commit

  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     

09 Oct, 2012

1 commit

  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

25 Aug, 2012

1 commit

  • Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.

    $ size vmlinux.o
    text data bss dec hex filename
    4658782 880729 5195032 10734543 a3cbcf vmlinux.o
    $ size vmlinux.o
    text data bss dec hex filename
    4658957 880729 5195032 10734718 a3cc7e vmlinux.o

    v7:
    - checkpatch warnings fixed
    - Implement the changes requested by Hugh Dickins:
    - make simple_xattrs_init and simple_xattrs_free inline
    - get rid of locking and list reinitialization in simple_xattrs_free,
    they're not needed
    v6:
    - no changes
    v5:
    - no changes
    v4:
    - move simple_xattrs_free() to fs/xattr.c
    v3:
    - in kmem_xattrs_free(), reinitialize the list
    - use simple_xattr_* prefix
    - introduce simple_xattr_add() to prevent direct list usage

    Original-patch-by: Li Zefan
    Cc: Li Zefan
    Cc: Hillf Danton
    Cc: Lennart Poettering
    Acked-by: Hugh Dickins
    Signed-off-by: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski