08 Aug, 2020

2 commits

  • The default is still set to inode32 for backwards compatibility, but
    system administrators can opt in to the new 64-bit inode numbers by
    either:

    1. Passing inode64 on the command line when mounting, or
    2. Configuring the kernel with CONFIG_TMPFS_INODE64=y

    The inode64 and inode32 names are used based on existing precedent from
    XFS.

    [hughd@google.com: Kconfig fixes]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils

    Signed-off-by: Chris Down
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Amir Goldstein
    Acked-by: Hugh Dickins
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.

    In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On
    affected tiers, in excess of 10% of hosts show multiple files with
    different content and the same inode number, with some servers even having
    as many as 150 duplicated inode numbers with differing file content.

    This causes actual, tangible problems in production. For example, we have
    complaints from those working on remote caches that their application is
    reporting cache corruptions because it uses (device, inodenum) to
    establish the identity of a particular cache object, but because it's not
    unique any more, the application refuses to continue and reports cache
    corruption. Even worse, sometimes applications may not even detect the
    corruption but may continue anyway, causing phantom and hard to debug
    behaviour.

    In general, userspace applications expect that (device, inodenum) should
    be enough to be uniquely point to one inode, which seems fair enough. One
    might also need to check the generation, but in this case:

    1. That's not currently exposed to userspace
    (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
    2. Even with generation, there shouldn't be two live inodes with the
    same inode number on one device.

    In order to mitigate this, we take a two-pronged approach:

    1. Moving inum generation from being global to per-sb for tmpfs. This
    itself allows some reduction in i_ino churn. This works on both 64-
    and 32- bit machines.
    2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
    64-bit ino_t only: we allow users to mount tmpfs with a new inode64
    option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

    You can see how this compares to previous related patches which didn't
    implement this per-superblock:

    - https://patchwork.kernel.org/patch/11254001/
    - https://patchwork.kernel.org/patch/11023915/

    This patch (of 2):

    get_next_ino has a number of problems:

    - It uses and returns a uint, which is susceptible to become overflowed
    if a lot of volatile inodes that use get_next_ino are created.
    - It's global, with no specificity per-sb or even per-filesystem. This
    means it's not that difficult to cause inode number wraparounds on a
    single device, which can result in having multiple distinct inodes
    with the same inode number.

    This patch adds a per-superblock counter that mitigates the second case.
    This design also allows us to later have a specific i_ino size per-device,
    for example, allowing users to choose whether to use 32- or 64-bit inodes
    for each tmpfs mount. This is implemented in the next commit.

    For internal shmem mounts which may be less tolerant to spinlock delays,
    we implement a percpu batching scheme which only takes the stat_lock at
    each batch boundary.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Cc: Amir Goldstein
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     

08 Apr, 2020

1 commit

  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

08 Feb, 2020

1 commit


13 Sep, 2019

1 commit

  • Convert the ramfs, shmem, tmpfs, devtmpfs and rootfs filesystems to the new
    internal mount API as the old one will be obsoleted and removed. This
    allows greater flexibility in communication of mount parameters between
    userspace, the VFS and the filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Note that tmpfs is slightly tricky as it can contain embedded commas, so it
    can't be trivially split up using strsep() to break on commas in
    generic_parse_monolithic(). Instead, tmpfs has to supply its own generic
    parser.

    However, if tmpfs changes, then devtmpfs and rootfs, which are wrappers
    around tmpfs or ramfs, must change too - and thus so must ramfs, so these
    had to be converted also.

    [AV: rewritten]

    Signed-off-by: David Howells
    cc: Hugh Dickins
    cc: linux-mm@kvack.org
    Signed-off-by: Al Viro

    David Howells
     

06 Sep, 2019

1 commit


20 Apr, 2019

1 commit

  • The igrab() in shmem_unuse() looks good, but we forgot that it gives no
    protection against concurrent unmounting: a point made by Konstantin
    Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
    ("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
    swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
    Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

    Once again, give up on using igrab(); but don't go back to making such
    heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
    the new design, and I expect could deadlock inside shmem_swapin_page().

    Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
    specific inode, and shmem_evict_inode() wait for that to go down to 0.
    Call it "stop_eviction" rather than "swapoff_busy" because it can be put
    to use for others later (huge tmpfs patches expect to use it).

    That simplifies shmem_unuse(), protecting it from both unlink and
    unmount; and in practice lets it locate all the swap in its first try.
    But do not rely on that: there's still a theoretical case, when
    shmem_writepage() might have been preempted after its get_swap_page(),
    before making the swap entry visible to swapoff.

    [hughd@google.com: remove incorrect list_del()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Mar, 2019

1 commit

  • This patch was initially posted by Kelley Nielsen. Reposting the patch
    with all review comments addressed and with minor modifications and
    optimizations. Also, folding in the fixes offered by Hugh Dickins and
    Huang Ying. Tests were rerun and commit message updated with new
    results.

    try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
    It unuses swap entries one by one, potentially iterating over all the
    page tables for all the processes in the system for each one.

    This new proposed implementation of try_to_unuse simplifies its
    complexity to linear. It iterates over the system's mms once, unusing
    all the affected entries as it walks each set of page tables. It also
    makes similar changes to shmem_unuse.

    Improvement

    swapoff was called on a swap partition containing about 6G of data, in a
    VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

    Present implementation....about 1200M calls(8min, avg 80% cpu util).
    Prototype.................about 9.0K calls(3min, avg 5% cpu util).

    Details

    In shmem_unuse(), iterate over the shmem_swaplist and, for each
    shmem_inode_info that contains a swap entry, pass it to
    shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
    iterate over its associated xarray, and store the index and value of
    each swap entry in an array for passing to shmem_swapin_page() outside
    of the RCU critical section.

    In try_to_unuse(), instead of iterating over the entries in the type and
    unusing them one by one, perhaps walking all the page tables for all the
    processes for each one, iterate over the mmlist, making one pass. Pass
    each mm to unuse_mm() to begin its page table walk, and during the walk,
    unuse all the ptes that have backing store in the swap type received by
    try_to_unuse(). After the walk, check the type for orphaned swap
    entries with find_next_to_unuse(), and remove them from the swap cache.
    If find_next_to_unuse() starts over at the beginning of the type, repeat
    the check of the shmem_swaplist and the walk a maximum of three times.

    Change unuse_mm() and the intervening walk functions down to
    unuse_pte_range() to take the type as a parameter, and to iterate over
    their entire range, calling the next function down on every iteration.
    In unuse_pte_range(), make a swap entry from each pte in the range using
    the passed in type. If it has backing store in the type, call
    swapin_readahead() to retrieve the page and pass it to unuse_pte().

    Pass the count of pages_to_unuse down the page table walks in
    try_to_unuse(), and return from the walk when the desired number of
    pages has been swapped back in.

    Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
    Signed-off-by: Vineeth Remanan Pillai
    Signed-off-by: Kelley Nielsen
    Signed-off-by: Huang Ying
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineeth Remanan Pillai
     

08 Jun, 2018

1 commit

  • With the addition of memfd hugetlbfs support, we now have the situation
    where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only
    supported on tmpfs, so it made sense that the code resided in shmem.c.
    In the current code, memfd is only functional if TMPFS is defined. If
    HUGETLFS is defined and TMPFS is not defined, then memfd functionality
    will not be available for hugetlbfs. This does not cause BUGs, just a
    lack of potentially desired functionality.

    Code is restructured in the following way:
    - include/linux/memfd.h is a new file containing memfd specific
    definitions previously contained in shmem_fs.h.
    - mm/memfd.c is a new file containing memfd specific code previously
    contained in shmem.c.
    - memfd specific code is removed from shmem_fs.h and shmem.c.
    - A new config option MEMFD_CREATE is added that is defined if TMPFS
    or HUGETLBFS is defined.

    No functional changes are made to the code: restructuring only.

    Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Khalid Aziz
    Cc: Andrea Arcangeli
    Cc: David Herrmann
    Cc: Hugh Dickins
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Feb, 2018

2 commits

  • Those functions are called for memfd files, backed by shmem or hugetlb
    (the next patches will handle hugetlb).

    Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     
  • Patch series "memfd: add sealing to hugetlb-backed memory", v3.

    Recently, Mike Kravetz added hugetlbfs support to memfd. However, he
    didn't add sealing support. One of the reasons to use memfd is to have
    shared memory sealing when doing IPC or sharing memory with another
    process with some extra safety. qemu uses shared memory & hugetables
    with vhost-user (used by dpdk), so it is reasonable to use memfd now
    instead for convenience and security reasons.

    This patch (of 9):

    The functions are called through shmem_fcntl() only. And no danger in
    removing the EXPORTs as the routines only work with shmem file structs.

    Link: http://lkml.kernel.org/r/20171107122800.25517-2-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     

16 Nov, 2017

1 commit

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for v4.15.

    Core:
    - Atomic object lifetime fixes
    - Atomic iterator improvements
    - Sparse/smatch fixes
    - Legacy kms ioctls to be interruptible
    - EDID override improvements
    - fb/gem helper cleanups
    - Simple outreachy patches
    - Documentation improvements
    - Fix dma-buf rcu races
    - DRM mode object leasing for improving VR use cases.
    - vgaarb improvements for non-x86 platforms.

    New driver:
    - tve200: Faraday Technology TVE200 block.

    This "TV Encoder" encodes a ITU-T BT.656 stream and can be found in
    the StorLink SL3516 (later Cortina Systems CS3516) as well as the
    Grain Media GM8180.

    New bridges:
    - SiI9234 support

    New panels:
    - S6E63J0X03, OTM8009A, Seiko 43WVF1G, 7" rpi touch panel, Toshiba
    LT089AC19000, Innolux AT043TN24

    i915:
    - Remove Coffeelake from alpha support
    - Cannonlake workarounds
    - Infoframe refactoring for DisplayPort
    - VBT updates
    - DisplayPort vswing/emph/buffer translation refactoring
    - CCS fixes
    - Restore GPU clock boost on missed vblanks
    - Scatter list updates for userptr allocations
    - Gen9+ transition watermarks
    - Display IPC (Isochronous Priority Control)
    - Private PAT management
    - GVT: improved error handling and pci config sanitizing
    - Execlist refactoring
    - Transparent Huge Page support
    - User defined priorities support
    - HuC/GuC firmware refactoring
    - DP MST fixes
    - eDP power sequencing fixes
    - Use RCU instead of stop_machine
    - PSR state tracking support
    - Eviction fixes
    - BDW DP aux channel timeout fixes
    - LSPCON fixes
    - Cannonlake PLL fixes

    amdgpu:
    - Per VM BO support
    - Powerplay cleanups
    - CI powerplay support
    - PASID mgr for kfd
    - SR-IOV fixes
    - initial GPU reset for vega10
    - Prime mmap support
    - TTM updates
    - Clock query interface for Raven
    - Fence to handle ioctl
    - UVD encode ring support on Polaris
    - Transparent huge page DMA support
    - Compute LRU pipe tweaks
    - BO flag to allow buffers to opt out of implicit sync
    - CTX priority setting API
    - VRAM lost infrastructure plumbing

    qxl:
    - fix flicker since atomic rework

    amdkfd:
    - Further improvements from internal AMD tree
    - Usermode events
    - Drop radeon support

    nouveau:
    - Pascal temperature sensor support
    - Improved BAR2 handling
    - MMU rework to support Pascal MMU

    exynos:
    - Improved HDMI/mixer support
    - HDMI audio interface support

    tegra:
    - Prep work for tegra186
    - Cleanup/fixes

    msm:
    - Preemption support for a5xx
    - Display fixes for 8x96 (snapdragon 820)
    - Async cursor plane fixes
    - FW loading rework
    - GPU debugging improvements

    vc4:
    - Prep for DSI panels
    - fix T-format tiling scanout
    - New madvise ioctl

    Rockchip:
    - LVDS support

    omapdrm:
    - omap4 HDMI CEC support

    etnaviv:
    - GPU performance counters groundwork

    sun4i:
    - refactor driver load + TCON backend
    - HDMI improvements
    - A31 support
    - Misc fixes

    udl:
    - Probe/EDID read fixes.

    tilcdc:
    - Misc fixes.

    pl111:
    - Support more variants

    adv7511:
    - Improve EDID handling.
    - HDMI CEC support

    sii8620:
    - Add remote control support"

    * tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux: (1480 commits)
    drm/rockchip: analogix_dp: Use mutex rather than spinlock
    drm/mode_object: fix documentation for object lookups.
    drm/i915: Reorder context-close to avoid calling i915_vma_close() under RCU
    drm/i915: Move init_clock_gating() back to where it was
    drm/i915: Prune the reservation shared fence array
    drm/i915: Idle the GPU before shinking everything
    drm/i915: Lock llist_del_first() vs llist_del_all()
    drm/i915: Calculate ironlake intermediate watermarks correctly, v2.
    drm/i915: Disable lazy PPGTT page table optimization for vGPU
    drm/i915/execlists: Remove the priority "optimisation"
    drm/i915: Filter out spurious execlists context-switch interrupts
    drm/amdgpu: use irq-safe lock for kiq->ring_lock
    drm/amdgpu: bypass lru touch for KIQ ring submission
    drm/amdgpu: Potential uninitialized variable in amdgpu_vm_update_directories()
    drm/amdgpu: potential uninitialized variable in amdgpu_vce_ring_parse_cs()
    drm/amd/powerplay: initialize a variable before using it
    drm/amd/powerplay: suppress KASAN out of bounds warning in vega10_populate_all_memory_levels
    drm/amd/amdgpu: fix evicted VRAM bo adjudgement condition
    drm/vblank: Tune drm_crtc_accurate_vblank_count() WARN down to a debug
    drm/rockchip: add CONFIG_OF dependency for lvds
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Oct, 2017

1 commit

  • We are planning to use our own tmpfs mnt in i915 in place of the
    shm_mnt, such that we can control the mount options, in particular
    huge=, which we require to support huge-gtt-pages. So rather than roll
    our own version of __shmem_file_setup, it would be preferred if we could
    just give shmem our mnt, and let it do the rest.

    Signed-off-by: Matthew Auld
    Cc: Joonas Lahtinen
    Cc: Chris Wilson
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Acked-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Joonas Lahtinen
    Link: https://patchwork.freedesktop.org/patch/msgid/20171006145041.21673-2-matthew.auld@intel.com
    Signed-off-by: Chris Wilson
    Link: https://patchwork.freedesktop.org/patch/msgid/20171006221833.32439-1-chris@chris-wilson.co.uk

    Matthew Auld
     

07 Sep, 2017

1 commit

  • shmem_mfill_zeropage_pte is the low level routine that implements the
    userfaultfd UFFDIO_ZEROPAGE command. Since for shmem mappings zero
    pages are always allocated and accounted, the new method is a slight
    extension of the existing shmem_mcopy_atomic_pte.

    Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

25 Feb, 2017

1 commit

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Feb, 2017

1 commit

  • shmem_mcopy_atomic_pte is the low level routine that implements the
    userfaultfd UFFDIO_COPY command. It is based on the existing
    mcopy_atomic_pte routine with modifications for shared memory pages.

    Link: http://lkml.kernel.org/r/20161216144821.5183-29-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

27 Jul, 2016

6 commits

  • Even if user asked to allocate huge pages always (huge=always), we
    should be able to free up some memory by splitting pages which are
    partly byound i_size if memory presure comes or once we hit limit on
    filesystem size (-o size=).

    In order to do this we maintain per-superblock list of inodes, which
    potentially have huge pages on the border of file size.

    Per-fs shrinker can reclaim memory by splitting such pages.

    If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to
    free up space on the filesystem and retry allocation if it succeed.

    Link: http://lkml.kernel.org/r/1466021202-61880-37-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch extends khugepaged to support collapse of tmpfs/shmem pages.
    We share fair amount of infrastructure with anon-THP collapse.

    Few design points:

    - First we are looking for VMA which can be suitable for mapping huge
    page;

    - If the VMA maps shmem file, the rest scan/collapse operations
    operates on page cache, not on page tables as in anon VMA case.

    - khugepaged_scan_shmem() finds a range which is suitable for huge
    page. The scan is lockless and shouldn't disturb system too much.

    - once the candidate for collapse is found, collapse_shmem() attempts
    to create a huge page:

    + scan over radix tree, making the range point to new huge page;

    + new huge page is not-uptodate, locked and freezed (refcount
    is 0), so nobody can touch them until we say so.

    + we swap in pages during the scan. khugepaged_scan_shmem()
    filters out ranges with more than khugepaged_max_ptes_swap
    swapped out pages. It's HPAGE_PMD_NR/8 by default.

    + old pages are isolated, unmapped and put to local list in case
    to be restored back if collapse failed.

    - if collapse succeed, we retract pte page tables from VMAs where huge
    pages mapping is possible. The huge page will be mapped as PMD on
    next minor fault into the range.

    Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Provide a shmem_get_unmapped_area method in file_operations, called at
    mmap time to decide the mapping address. It could be conditional on
    CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
    it unconditional.

    shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
    (which we treat as a black box, highly dependent on architecture and
    config and executable layout). Lots of conditions, and in most cases it
    just goes with the address that chose; but when our huge stars are
    rightly aligned, yet that did not provide a suitable address, go back to
    ask for a larger arena, within which to align the mapping suitably.

    There have to be some direct calls to shmem_get_unmapped_area(), not via
    the file_operations: because of the way shmem_zero_setup() is called to
    create a shmem object late in the mmap sequence, when MAP_SHARED is
    requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
    when /proc/sys/vm/shmem_huge has been set.

    Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch adds new mount option "huge=". It can have following values:

    - "always":
    Attempt to allocate huge pages every time we need a new page;

    - "never":
    Do not allocate huge pages;

    - "within_size":
    Only allocate huge page if it will be fully within i_size.
    Also respect fadvise()/madvise() hints;

    - "advise:
    Only allocate huge pages if requested with fadvise()/madvise();

    Default is "never" for now.

    "mount -o remount,huge= /mountpoint" works fine after mount: remounting
    huge=never will not attempt to break up huge pages at all, just stop
    more from being allocated.

    No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
    is the appropriate option to protect those who don't want the new bloat,
    and with which we shall share some pmd code.

    Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
    invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
    explicit).

    Allow enabling THP only if the machine has_transparent_hugepage().

    But what about Shmem with no user-visible mount? SysV SHM, memfds,
    shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
    objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob
    /sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
    huge on those.

    And allow shmem_enabled two further values:

    - "deny":
    For use in emergencies, to force the huge option off from
    all mounts;
    - "force":
    Force the huge option on for all - very useful for testing;

    Based on patch by Hugh Dickins.

    Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit


15 Jan, 2016

2 commits

  • Following the previous patch, further reduction of /proc/pid/smaps cost
    is possible for private writable shmem mappings with unpopulated areas
    where the page walk invokes the .pte_hole function. We can use radix
    tree iterator for each such area instead of calling find_get_entry() in
    a loop. This is possible at the extra maintenance cost of introducing
    another shmem function shmem_partial_swap_usage().

    To demonstrate the diference, I have measured this on a process that
    creates a private writable 2GB mapping of a partially swapped out
    /dev/shm/file (which cannot employ the optimizations from the prvious
    patch) and doesn't populate it at all. I time how long does it take to
    cat /proc/pid/smaps of this process 100 times.

    Before this patch:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    After this patch:

    real 0m1.176s
    user 0m0.180s
    sys 0m0.684s

    The time is similar to the case where a radix tree iterator is employed
    on the whole mapping.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The previous patch has improved swap accounting for shmem mapping, which
    however made /proc/pid/smaps more expensive for shmem mappings, as we
    consult the radix tree for each pte_none entry, so the overal complexity
    is O(n*log(n)).

    We can reduce this significantly for mappings that cannot contain COWed
    pages, because then we can either use the statistics tha shmem object
    itself tracks (if the mapping contains the whole object, or the swap
    usage of the whole object is zero), or use the radix tree iterator,
    which is much more effective than repeated find_get_entry() calls.

    This patch therefore introduces a function shmem_swap_usage(vma) and
    makes /proc/pid/smaps use it when possible. Only for writable private
    mappings of shmem objects (i.e. tmpfs files) with the shmem object
    itself (partially) swapped outwe have to resort to the find_get_entry()
    approach.

    Hopefully such mappings are relatively uncommon.

    To demonstrate the diference, I have measured this on a process that
    creates a 2GB mapping and dirties single pages with a stride of 2MB, and
    time how long does it take to cat /proc/pid/smaps of this process 100
    times.

    Private writable mapping of a /dev/shm/file (the most complex case):

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
    (which needs to employ the radix tree iterator).

    real 0m1.351s
    user 0m0.096s
    sys 0m0.768s

    Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

    real 0m0.935s
    user 0m0.128s
    sys 0m0.344s

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    The cost is now much closer to the private anonymous mapping case, unless
    the shmem mapping is private and writable.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

09 Aug, 2014

1 commit

  • If two processes share a common memory region, they usually want some
    guarantees to allow safe access. This often includes:
    - one side cannot overwrite data while the other reads it
    - one side cannot shrink the buffer while the other accesses it
    - one side cannot grow the buffer beyond previously set boundaries

    If there is a trust-relationship between both parties, there is no need
    for policy enforcement. However, if there's no trust relationship (eg.,
    for general-purpose IPC) sharing memory-regions is highly fragile and
    often not possible without local copies. Look at the following two
    use-cases:

    1) A graphics client wants to share its rendering-buffer with a
    graphics-server. The memory-region is allocated by the client for
    read/write access and a second FD is passed to the server. While
    scanning out from the memory region, the server has no guarantee that
    the client doesn't shrink the buffer at any time, requiring rather
    cumbersome SIGBUS handling.
    2) A process wants to perform an RPC on another process. To avoid huge
    bandwidth consumption, zero-copy is preferred. After a message is
    assembled in-memory and a FD is passed to the remote side, both sides
    want to be sure that neither modifies this shared copy, anymore. The
    source may have put sensible data into the message without a separate
    copy and the target may want to parse the message inline, to avoid a
    local copy.

    While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
    ways to achieve most of this, the first one is unproportionally ugly to
    use in libraries and the latter two are broken/racy or even disabled due
    to denial of service attacks.

    This patch introduces the concept of SEALING. If you seal a file, a
    specific set of operations is blocked on that file forever. Unlike locks,
    seals can only be set, never removed. Hence, once you verified a specific
    set of seals is set, you're guaranteed that no-one can perform the blocked
    operations on this file, anymore.

    An initial set of SEALS is introduced by this patch:
    - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
    in size. This affects ftruncate() and open(O_TRUNC).
    - GROW: If SEAL_GROW is set, the file in question cannot be increased
    in size. This affects ftruncate(), fallocate() and write().
    - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
    are possible. This affects fallocate(PUNCH_HOLE), mmap() and
    write().
    - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
    This basically prevents the F_ADD_SEAL operation on a file and
    can be set to prevent others from adding further seals that you
    don't want.

    The described use-cases can easily use these seals to provide safe use
    without any trust-relationship:

    1) The graphics server can verify that a passed file-descriptor has
    SEAL_SHRINK set. This allows safe scanout, while the client is
    allowed to increase buffer size for window-resizing on-the-fly.
    Concurrent writes are explicitly allowed.
    2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
    SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
    process can modify the data while the other side parses it.
    Furthermore, it guarantees that even with writable FDs passed to the
    peer, it cannot increase the size to hit memory-limits of the source
    process (in case the file-storage is accounted to the source).

    The new API is an extension to fcntl(), adding two new commands:
    F_GET_SEALS: Return a bitset describing the seals on the file. This
    can be called on any FD if the underlying file supports
    sealing.
    F_ADD_SEALS: Change the seals of a given file. This requires WRITE
    access to the file and F_SEAL_SEAL may not already be set.
    Furthermore, the underlying file must support sealing and
    there may not be any existing shared mapping of that file.
    Otherwise, EBADF/EPERM is returned.
    The given seals are _added_ to the existing set of seals
    on the file. You cannot remove seals again.

    The fcntl() handler is currently specific to shmem and disabled on all
    files. A file needs to explicitly support sealing for this interface to
    work. A separate syscall is added in a follow-up, which creates files that
    support sealing. There is no intention to support this on other
    file-systems. Semantics are unclear for non-volatile files and we lack any
    use-case right now. Therefore, the implementation is specific to shmem.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     

04 Apr, 2014

1 commit

  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

02 Dec, 2013

1 commit

  • We have a problem where the big_key key storage implementation uses a
    shmem backed inode to hold the key contents. Because of this detail of
    implementation LSM checks are being done between processes trying to
    read the keys and the tmpfs backed inode. The LSM checks are already
    being handled on the key interface level and should not be enforced at
    the inode level (since the inode is an implementation detail, not a
    part of the security model)

    This patch implements a new function shmem_kernel_file_setup() which
    returns the equivalent to shmem_file_setup() only the underlying inode
    has S_PRIVATE set. This means that all LSM checks for the inode in
    question are skipped. It should only be used for kernel internal
    operations where the inode is not exposed to userspace without proper
    LSM checking. It is possible that some other users of
    shmem_file_setup() should use the new interface, but this has not been
    explored.

    Reproducing this bug is a little bit difficult. The steps I used on
    Fedora are:

    (1) Turn off selinux enforcing:

    setenforce 0

    (2) Create a huge key

    k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s`

    (3) Access the key in another context:

    runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null

    (4) Examine the audit logs:

    ausearch -m AVC -i --subject httpd_t | audit2allow

    If the last command's output includes a line that looks like:

    allow httpd_t user_tmpfs_t:file { open read };

    There was an inode check between httpd and the tmpfs filesystem. With
    this patch no such denial will be seen. (NOTE! you should clear your
    audit log if you have tested for this previously)

    (Please return you box to enforcing)

    Signed-off-by: Eric Paris
    Signed-off-by: David Howells
    cc: Hugh Dickins
    cc: linux-mm@kvack.org

    Eric Paris
     

25 Aug, 2012

1 commit

  • Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.

    $ size vmlinux.o
    text data bss dec hex filename
    4658782 880729 5195032 10734543 a3cbcf vmlinux.o
    $ size vmlinux.o
    text data bss dec hex filename
    4658957 880729 5195032 10734718 a3cc7e vmlinux.o

    v7:
    - checkpatch warnings fixed
    - Implement the changes requested by Hugh Dickins:
    - make simple_xattrs_init and simple_xattrs_free inline
    - get rid of locking and list reinitialization in simple_xattrs_free,
    they're not needed
    v6:
    - no changes
    v5:
    - no changes
    v4:
    - move simple_xattrs_free() to fs/xattr.c
    v3:
    - in kmem_xattrs_free(), reinitialize the list
    - use simple_xattr_* prefix
    - introduce simple_xattr_add() to prevent direct list usage

    Original-patch-by: Li Zefan
    Cc: Li Zefan
    Cc: Hillf Danton
    Cc: Lennart Poettering
    Acked-by: Hugh Dickins
    Signed-off-by: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     

16 May, 2012

1 commit


24 Jan, 2012

1 commit

  • Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
    find_get_pages") correctly fixed an infinite loop; but left a problem
    that find_get_pages() on shmem would return 0 (appearing to callers to
    mean end of tree) when it meets a run of nr_pages swap entries.

    The only uses of find_get_pages() on shmem are via pagevec_lookup(),
    called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
    scan_mapping_unevictable_pages(). The first is already commented, and
    not worth worrying about; but the second can leave pages on the
    Unevictable list after an unusual sequence of swapping and locking.

    Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
    swap) instead of pagevec_lookup().

    But I don't want to contaminate vmscan.c with shmem internals, nor
    shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
    shmem.c, renaming it shmem_unlock_mapping(); and rename
    check_move_unevictable_page() to check_move_unevictable_pages(), looping
    down an array of pages, oftentimes under the same lock.

    Leave out the "rotate unevictable list" block: that's a leftover from
    when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
    handling involved looking at pages at tail of LRU.

    Was there significance to the sequence first ClearPageUnevictable, then
    test page_evictable, then SetPageUnevictable here? I think not, we're
    under LRU lock, and have no barriers between those.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: [back to 3.1 but will need respins]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Jan, 2012

1 commit


04 Aug, 2011

4 commits

  • But we've not yet removed the old swp_entry_t i_direct[16] from
    shmem_inode_info. That's because it was still being shared with the
    inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode
    size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

    I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
    rather than shmem_evict_inode(), where we usually do such freeing? I
    guess it doesn't matter, and I'm not into NUMA mpol testing right now.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
    had to move swappage to filecache with GFP_NOWAIT.

    Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
    moving its call out from shmem_add_to_page_cache() to two of thats three
    callers. But leave it doing mem_cgroup_uncharge_cache_page() on error:
    although asymmetrical, it's easier for all 3 callers to handle.

    These two changes would also be appropriate if anyone were to start
    using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

    Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
    radix_tree_exceptional_entry() to get what it needs for itself.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While it's at its least, make a number of boring nitpicky cleanups to
    shmem.c, mostly for consistency of variable naming. Things like "swap"
    instead of "entry", "pgoff_t index" instead of "unsigned long idx".

    And since everything else here is prefixed "shmem_", better change
    init_tmpfs() to shmem_init().

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The maximum size of a shmem/tmpfs file has been limited by the maximum
    size of its triple-indirect swap vector. With 4kB page size, maximum
    filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
    that on a 64-bit kernel. (With 8kB page size, maximum filesize was just
    over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
    MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

    It's a shame that tmpfs should be more restrictive than ramfs, and this
    limitation has now been noticed. Add another level to the swap vector?
    No, it became obscure and hard to maintain, once I complicated it to
    make use of highmem pages nine years ago: better choose another way.

    Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
    tmpfs would never have invented its own peculiar radix tree: we would
    have fitted swap entries into the common radix tree instead, in much the
    same way as we fit swap entries into page tables.

    And why should each file have a separate radix tree for its pages and
    for its swap entries? The swap entries are required precisely where and
    when the pages are not. We want to put them together in a single radix
    tree: which can then avoid much of the locking which was needed to
    prevent them from being exchanged underneath us.

    This also avoids the waste of memory devoted to swap vectors, first in
    the shmem_inode itself, then at least two more pages once a file grew
    beyond 16 data pages (pages accounted by df and du, but not by memcg).
    Allocated upfront, to avoid allocation when under swapping pressure, but
    pure waste when CONFIG_SWAP is not set - I have never spattered around
    the ifdefs to prevent that, preferring this move to sharing the common
    radix tree instead.

    There are three downsides to sharing the radix tree. One, that it binds
    tmpfs more tightly to the rest of mm, either requiring knowledge of swap
    entries in radix tree there, or duplication of its code here in shmem.c.
    I believe that the simplications and memory savings (and probable higher
    performance, not yet measured) justify that.

    Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
    nodes that cannot be freed under memory pressure - whereas before it was
    the less precious highmem swap vector pages that could not be freed.
    I'm hoping that 64-bit has now been accessible for long enough, that the
    highmem argument has grown much less persuasive.

    Three, that swapoff is slower than it used to be on tmpfs files, since
    it's using a simple generic mechanism not tailored to it: I find this
    noticeable, and shall want to improve, but maybe nobody else will
    notice.

    So... now remove most of the old swap vector code from shmem.c. But,
    for the moment, keep the simple i_direct vector of 16 pages, with simple
    accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
    to help mark where swap needs to be handled in subsequent patches.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Jun, 2011

3 commits

  • Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
    is unsuited to tmpfs, because it inserts a page into pagecache before
    calling the filesystem's ->readpage: tmpfs may have pages in swapcache
    which only it knows how to locate and switch to filecache.

    At present tmpfs provides a ->readpage method, and copes with this by
    copying pages; but soon we can simplify it by removing its ->readpage.
    Provide shmem_read_mapping_page_gfp() now, ready for that transition,

    Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
    with shmem_read_mapping_page() inline for the common mapping_gfp case.

    (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
    read_mapping_page functions use the mapping's ->readpage, and the
    read_cache_page functions use the supplied filler, so I think
    read_cache_page_gfp was slightly misnamed.)

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 2.6.35's new truncate convention gave tmpfs the opportunity to control
    its file truncation, no longer enforced from outside by vmtruncate().
    We shall want to build upon that, to handle pagecache and swap together.

    Slightly redefine the ->truncate_range interface: let it now be called
    between the unmap_mapping_range()s, with the filesystem responsible for
    doing the truncate_inode_pages_range() from it - just as the filesystem
    is nowadays responsible for doing that from its ->setattr.

    Let's rename shmem_notify_change() to shmem_setattr(). Instead of
    calling the generic truncate_setsize(), bring that code in so we can
    call shmem_truncate_range() - which will later be updated to perform its
    own variant of truncate_inode_pages_range().

    Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
    now that the COW's unmap_mapping_range() comes after ->truncate_range,
    there is no need to call it a third time.

    Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
    that i915_gem_object_truncate() can call it explicitly in future; get
    this patch in first, then update drm/i915 once this is available (until
    then, i915 will just be doing the truncate_inode_pages() twice).

    Though introduced five years ago, no other filesystem is implementing
    ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
    expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
    whereupon ->truncate_range can be removed from inode_operations -
    shmem_truncate_range() will help i915 across that transition too.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins