07 Jan, 2012

1 commit


04 Jan, 2012

5 commits


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

02 Nov, 2011

1 commit


01 Nov, 2011

1 commit

  • When a race between putback_lru_page() and shmem_lock with lock=0 happens,
    progrom execution order is as follows, but clear_bit in processor #1 could
    be reordered right before spin_unlock of processor #1. Then, the page
    would be stranded on the unevictable list.

    spin_lock
    SetPageLRU
    spin_unlock
    clear_bit(AS_UNEVICTABLE)
    spin_lock
    if PageLRU()
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    smp_mb
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    spin_unlock

    But, pagevec_lookup() in scan_mapping_unevictable_pages() has
    rcu_read_[un]lock() so it could protect reordering before reaching
    test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
    But it's a unexpected side effect and we should solve this problem
    properly.

    This patch adds a barrier after mapping_clear_unevictable.

    I didn't meet this problem but just found during review.

    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 Oct, 2011

1 commit


09 Aug, 2011

1 commit


04 Aug, 2011

11 commits

  • Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

    It's hard to devise a suitable snappy name that illuminates the use by
    shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
    generality. And akpm points out that /* radix_tree_deref_retry(page) */
    comments look like calls that have been commented out for unknown
    reason.

    Skirt the naming difficulty by rearranging these blocks to handle the
    transient radix_tree_deref_retry(page) case first; then just explain the
    remaining shmem/tmpfs swap case in a comment.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have already acknowledged that swapoff of a tmpfs file is slower than
    it was before conversion to the generic radix_tree: a little slower
    there will be acceptable, if the hotter paths are faster.

    But it was a shock to find swapoff of a 500MB file 20 times slower on my
    laptop, taking 10 minutes; and at that rate it significantly slows down
    my testing.

    Now, most of that turned out to be overhead from PROVE_LOCKING and
    PROVE_RCU: without those it was only 4 times slower than before; and
    more realistic tests on other machines don't fare as badly.

    I've tried a number of things to improve it, including tagging the swap
    entries, then doing lookup by tag: I'd expected that to halve the time,
    but in practice it's erratic, and often counter-productive.

    The only change I've so far found to make a consistent improvement, is
    to short-circuit the way we go back and forth, gang lookup packing
    entries into the array supplied, then shmem scanning that array for the
    target entry. Scanning in place doubles the speed, so it's now only
    twice as slow as before (or three times slower when the PROVEs are on).

    So, add radix_tree_locate_item() as an expedient, once-off,
    single-caller hack to do the lookup directly in place. #ifdef it on
    CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
    applicability as save space in other configurations. And, sadly,
    #include sched.h for cond_resched().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • But we've not yet removed the old swp_entry_t i_direct[16] from
    shmem_inode_info. That's because it was still being shared with the
    inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode
    size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

    I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
    rather than shmem_evict_inode(), where we usually do such freeing? I
    guess it doesn't matter, and I'm not into NUMA mpol testing right now.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
    shmem_radix_tree_replace() to substitute swap entry for page pointer
    atomically in the radix tree.

    As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
    copying such code from delete_from_swap_cache, but again judged easier
    to sell than making its other callers go through the extras.

    Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
    now unreferenced, and the hack to disable swap: it's now good to go.

    The way things have worked out, info->lock no longer helps to guard the
    shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
    That global mutex exclusion between shmem_writepage() and shmem_unuse()
    is not pretty, and we ought to find another way; but it's been forced on
    us by recent race discoveries, not a consequence of this patchset.

    And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
    swap entry was found already present? That's no longer possible, the
    (unknown) one inserting this page into filecache would hit the swap
    entry occupying that slot.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
    had to move swappage to filecache with GFP_NOWAIT.

    Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
    moving its call out from shmem_add_to_page_cache() to two of thats three
    callers. But leave it doing mem_cgroup_uncharge_cache_page() on error:
    although asymmetrical, it's easier for all 3 callers to handle.

    These two changes would also be appropriate if anyone were to start
    using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

    Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
    radix_tree_exceptional_entry() to get what it needs for itself.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
    swap entry returned from radix tree by find_lock_page().

    Whereas the repetitive old method proceeded mainly under info->lock,
    dropping and repeating whenever one of the conditions needed was not
    met, now we can proceed without it, leaving shmem_add_to_page_cache() to
    check for a race.

    This way there is no need to preallocate a page, no need for an early
    radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

    Move the error unwinding down to the bottom instead of repeating it
    throughout. ENOSPC handling is a little different from before: there is
    no longer any race between find_lock_page() and finding swap, but we can
    arrive at ENOSPC before calling shmem_recalc_inode(), which might
    occasionally discover freed space.

    Be stricter to check i_size before returning. info->lock is used for
    little but alloced, swapped, i_blocks updates. Move i_blocks updates
    out from under the max_blocks check, so even an unlimited size=0 mount
    can show accurate du.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
    tree, searching for matching swap.

    This is somewhat slower than the old method: because of repeated radix
    tree descents, because of copying entries up, but probably most because
    the old method noted and skipped once a vector page was cleared of swap.
    Perhaps we can devise a use of radix tree tagging to achieve that later.

    shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
    for the lockless lookup by checking that the expected entry is in place,
    under lock. It is not very satisfactory to be copying this much from
    add_to_page_cache_locked(), but I think easier to sell than insisting
    that every caller of add_to_page_cache*() go through the extras.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Disable the toy swapping implementation in shmem_writepage() - it's hard
    to support two schemes at once - and convert shmem_truncate_range() to a
    lockless gang lookup of swap entries along with pages, freeing both.

    Since the second loop tightens its noose until all entries of either
    kind have been squeezed out (and we shall make sure that there's not an
    instant when neither is visible), there is no longer a need for yet
    another pass below.

    shmem_radix_tree_replace() compensates for the lockless lookup by
    checking that the expected entry is in place, under lock, before
    replacing it. Here it just deletes, but will be used in later patches
    to substitute swap entry for page or page for swap entry.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Bring truncate.c's code for truncate_inode_pages_range() inline into
    shmem_truncate_range(), replacing its first call (there's a followup
    call below, but leave that one, it will disappear next).

    Don't play with it yet, apart from leaving out the cleancache flush, and
    (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
    partial page preparation into its partial page handling.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While it's at its least, make a number of boring nitpicky cleanups to
    shmem.c, mostly for consistency of variable naming. Things like "swap"
    instead of "entry", "pgoff_t index" instead of "unsigned long idx".

    And since everything else here is prefixed "shmem_", better change
    init_tmpfs() to shmem_init().

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The maximum size of a shmem/tmpfs file has been limited by the maximum
    size of its triple-indirect swap vector. With 4kB page size, maximum
    filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
    that on a 64-bit kernel. (With 8kB page size, maximum filesize was just
    over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
    MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

    It's a shame that tmpfs should be more restrictive than ramfs, and this
    limitation has now been noticed. Add another level to the swap vector?
    No, it became obscure and hard to maintain, once I complicated it to
    make use of highmem pages nine years ago: better choose another way.

    Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
    tmpfs would never have invented its own peculiar radix tree: we would
    have fitted swap entries into the common radix tree instead, in much the
    same way as we fit swap entries into page tables.

    And why should each file have a separate radix tree for its pages and
    for its swap entries? The swap entries are required precisely where and
    when the pages are not. We want to put them together in a single radix
    tree: which can then avoid much of the locking which was needed to
    prevent them from being exchanged underneath us.

    This also avoids the waste of memory devoted to swap vectors, first in
    the shmem_inode itself, then at least two more pages once a file grew
    beyond 16 data pages (pages accounted by df and du, but not by memcg).
    Allocated upfront, to avoid allocation when under swapping pressure, but
    pure waste when CONFIG_SWAP is not set - I have never spattered around
    the ifdefs to prevent that, preferring this move to sharing the common
    radix tree instead.

    There are three downsides to sharing the radix tree. One, that it binds
    tmpfs more tightly to the rest of mm, either requiring knowledge of swap
    entries in radix tree there, or duplication of its code here in shmem.c.
    I believe that the simplications and memory savings (and probable higher
    performance, not yet measured) justify that.

    Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
    nodes that cannot be freed under memory pressure - whereas before it was
    the less precious highmem swap vector pages that could not be freed.
    I'm hoping that 64-bit has now been accessible for long enough, that the
    highmem argument has grown much less persuasive.

    Three, that swapoff is slower than it used to be on tmpfs files, since
    it's using a simple generic mechanism not tailored to it: I find this
    noticeable, and shall want to improve, but maybe nobody else will
    notice.

    So... now remove most of the old swap vector code from shmem.c. But,
    for the moment, keep the simple i_direct vector of 16 pages, with simple
    accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
    to help mark where swap needs to be handled in subsequent patches.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Jul, 2011

10 commits

  • * Merge akpm patch series: (122 commits)
    drivers/connector/cn_proc.c: remove unused local
    Documentation/SubmitChecklist: add RCU debug config options
    reiserfs: use hweight_long()
    reiserfs: use proper little-endian bitops
    pnpacpi: register disabled resources
    drivers/rtc/rtc-tegra.c: properly initialize spinlock
    drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
    drivers/rtc: add support for Qualcomm PMIC8xxx RTC
    drivers/rtc/rtc-s3c.c: support clock gating
    drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
    init: skip calibration delay if previously done
    misc/eeprom: add eeprom access driver for digsy_mtc board
    misc/eeprom: add driver for microwire 93xx46 EEPROMs
    checkpatch.pl: update $logFunctions
    checkpatch: make utf-8 test --strict
    checkpatch.pl: add ability to ignore various messages
    checkpatch: add a "prefer __aligned" check
    checkpatch: validate signature styles and To: and Cc: lines
    checkpatch: add __rcu as a sparse modifier
    checkpatch: suggest using min_t or max_t
    ...

    Did this as a merge because of (trivial) conflicts in
    - Documentation/feature-removal-schedule.txt
    - arch/xtensa/include/asm/uaccess.h
    that were just easier to fix up in the merge than in the patch series.

    Linus Torvalds
     
  • shmem_unuse_inode() and shmem_writepage() contain a little code to cope
    with pages inserted independently into the filecache, probably by a
    filesystem stacked on top of tmpfs, then fed to its ->readpage() or
    ->writepage().

    Unionfs was indeed experimenting with working in that way three years ago,
    but I find no current examples: nowadays the stacking filesystems use vfs
    interfaces to the lower filesystem.

    It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.

    Signed-off-by: Hugh Dickins
    Cc: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
    filepage passed in via shmem_readpage(), then swappage found, which must
    then be copied over to it.

    Although at first it's tempting to replace the **pagep arg by returning
    struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
    callers, so leave as is.

    Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
    complication came from uninitialized pages inserted into filecache prior
    to readpage; but now we're in control, and only release pagelock on
    filecache once it's uptodate (if an error occurs in reading back from
    swap, the page remains in swapcache, never moved to filecache).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
    complicated: first simplify that before going on to filepage/swappage.

    That's right, don't report ENOMEM when the preallocation fails: we may or
    may not need the page. But simply report ENOMEM once we find we do need
    it, instead of dropping lock, repeating allocation, unwinding on failure
    etc. And leave the out label on the fast path, don't goto.

    Fix something that looks like a bug but turns out not to be: set
    PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
    the removed case was doing. That's important before adding to LRU
    (determines which LRU the page goes on), and does affect which path it
    takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
    is handled no differently from CACHE.

    Signed-off-by: Hugh Dickins
    Acked-by: Shaohua Li
    Cc: "Zhang, Yanmin"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove that pernicious shmem_readpage() at last: the things we needed it
    for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
    shmem_file_splice_read() and shmem_read_mapping_page_gfp().

    This removal clears the way for a simpler shmem_getpage_gfp(), since page
    is never passed in; but leave most of that cleanup until after.

    sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
    instead of unexpectedly trying to read ahead on tmpfs: if that proves to
    be an issue for someone, then we can either arrange for them to return
    success instead, or try to implement async readahead on tmpfs.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
    shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().

    Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
    CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().

    Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
    though we might later want to let that case route to read_cache_page_gfp().

    It annoys me to have these two almost-redundant args, gfp and fault_type:
    I can't find a better way; but initialize fault_type only in shmem_fault().

    Note that before, read_cache_page_gfp() was allocating i915_gem's pages
    with __GFP_NORETRY as intended; but the corresponding swap vector pages
    got allocated without it, leaving a small possibility of OOM.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tidy up shmem_file_splice_read():

    Remove readahead: okay, we could implement shmem readahead on swap,
    but have never done so before, swap being the slow exceptional path.

    Use shmem_getpage() instead of find_or_create_page() plus ->readpage().

    Remove several comments: sorry, I found them more distracting than
    helpful, and this will not be the reference version of splice_read().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Copy __generic_file_splice_read() and generic_file_splice_read() from
    fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
    page_cache_pipe_buf_ops and spd_release_page() accessible to it.

    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 2.6.36's 7e496299d4d2 ("tmpfs: make tmpfs scalable with percpu_counter for
    used blocks") to make tmpfs scalable with percpu_counter used
    inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
    that was adverse to scalability, and unnecessary, since info->lock is
    already held there in the fast paths.

    Remove those uses of i_lock, and add info->lock in the three error paths
    where it's then needed across shmem_free_blocks(). It's not actually
    needed across shmem_unacct_blocks(), but they're so often paired that it
    looks wrong to split them apart.

    Signed-off-by: Hugh Dickins
    Acked-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace the ->check_acl method with a ->get_acl method that simply reads an
    ACL from disk after having a cache miss. This means we can replace the ACL
    checking boilerplate code with a single implementation in namei.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

19 Jul, 2011

1 commit

  • This patch changes the security_inode_init_security API by adding a
    filesystem specific callback to write security extended attributes.
    This change is in preparation for supporting the initialization of
    multiple LSM xattrs and the EVM xattr. Initially the callback function
    walks an array of xattrs, writing each xattr separately, but could be
    optimized to write multiple xattrs at once.

    For existing security_inode_init_security() calls, which have not yet
    been converted to use the new callback function, such as those in
    reiserfs and ocfs2, this patch defines security_old_inode_init_security().

    Signed-off-by: Mimi Zohar

    Mimi Zohar
     

28 Jun, 2011

2 commits

  • Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
    is unsuited to tmpfs, because it inserts a page into pagecache before
    calling the filesystem's ->readpage: tmpfs may have pages in swapcache
    which only it knows how to locate and switch to filecache.

    At present tmpfs provides a ->readpage method, and copes with this by
    copying pages; but soon we can simplify it by removing its ->readpage.
    Provide shmem_read_mapping_page_gfp() now, ready for that transition,

    Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
    with shmem_read_mapping_page() inline for the common mapping_gfp case.

    (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
    read_mapping_page functions use the mapping's ->readpage, and the
    read_cache_page functions use the supplied filler, so I think
    read_cache_page_gfp was slightly misnamed.)

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 2.6.35's new truncate convention gave tmpfs the opportunity to control
    its file truncation, no longer enforced from outside by vmtruncate().
    We shall want to build upon that, to handle pagecache and swap together.

    Slightly redefine the ->truncate_range interface: let it now be called
    between the unmap_mapping_range()s, with the filesystem responsible for
    doing the truncate_inode_pages_range() from it - just as the filesystem
    is nowadays responsible for doing that from its ->setattr.

    Let's rename shmem_notify_change() to shmem_setattr(). Instead of
    calling the generic truncate_setsize(), bring that code in so we can
    call shmem_truncate_range() - which will later be updated to perform its
    own variant of truncate_inode_pages_range().

    Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
    now that the COW's unmap_mapping_range() comes after ->truncate_range,
    there is no need to call it a third time.

    Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
    that i915_gem_object_truncate() can call it explicitly in future; get
    this patch in first, then update drm/i915 once this is available (until
    then, i915 will just be doing the truncate_inode_pages() twice).

    Though introduced five years ago, no other filesystem is implementing
    ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
    expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
    whereupon ->truncate_range can be removed from inode_operations -
    shmem_truncate_range() will help i915 across that transition too.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 May, 2011

1 commit

  • While running fsx on tmpfs with a memhog then swapoff, swapoff was hanging
    (interruptibly), repeatedly failing to locate the owner of a 0xff entry in
    the swap_map.

    Although shmem_writepage() does abandon when it sees incoming page index
    is beyond eof, there was still a window in which shmem_truncate_range()
    could come in between writepage's dropping lock and updating swap_map,
    find the half-completed swap_map entry, and in trying to free it,
    leave it in a state that swap_shmem_alloc() could not correct.

    Arguably a bug in __swap_duplicate()'s and swap_entry_free()'s handling
    of the different cases, but easiest to fix by moving swap_shmem_alloc()
    under cover of the lock.

    More interesting than the bug: it's been there since 2.6.33, why could
    I not see it with earlier kernels? The mmotm of two weeks ago seems to
    have some magic for generating races, this is just one of three I found.

    With yesterday's git I first saw this in mainline, bisected in search of
    that magic, but the easy reproducibility evaporated. Oh well, fix the bug.

    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 May, 2011

1 commit

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

25 May, 2011

1 commit

  • Implement generic xattrs for tmpfs filesystems. The Feodra project, while
    trying to replace suid apps with file capabilities, realized that tmpfs,
    which is used on the build systems, does not support file capabilities and
    thus cannot be used to build packages which use file capabilities. Xattrs
    are also needed for overlayfs.

    The xattr interface is a bit odd. If a filesystem does not implement any
    {get,set,list}xattr functions the VFS will call into some random LSM hooks
    and the running LSM can then implement some method for handling xattrs.
    SELinux for example provides a method to support security.selinux but no
    other security.* xattrs.

    As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have
    xattr handler routines specifically to handle acls. Because of this tmpfs
    would loose the VFS/LSM helpers to support the running LSM. To make up
    for that tmpfs had stub functions that did nothing but call into the LSM
    hooks which implement the helpers.

    This new patch does not use the LSM fallback functions and instead just
    implements a native get/set/list xattr feature for the full security.* and
    trusted.* namespace like a normal filesystem. This means that tmpfs can
    now support both security.selinux and security.capability, which was not
    previously possible.

    The basic implementation is that I attach a:

    struct shmem_xattr {
    struct list_head list; /* anchored by shmem_inode_info->xattr_list */
    char *name;
    size_t size;
    char value[0];
    };

    Into the struct shmem_inode_info for each xattr that is set. This
    implementation could easily support the user.* namespace as well, except
    some care needs to be taken to prevent large amounts of unswappable memory
    being allocated for unprivileged users.

    [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks]
    Signed-off-by: Eric Paris
    Signed-off-by: Miklos Szeredi
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Cc: Kyle McMartin
    Acked-by: Hugh Dickins
    Tested-by: Jordi Pujol
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     

21 May, 2011

1 commit

  • Commit 778dd893ae78 ("tmpfs: fix race between umount and swapoff")
    forgot the new rules for strict atomic kmap nesting, causing

    WARNING: at arch/x86/mm/highmem_32.c:81

    from __kunmap_atomic(), then

    BUG: unable to handle kernel paging request at fffb9000

    from shmem_swp_set() when shmem_unuse_inode() is handling swapoff with
    highmem in use. My disgrace again.

    See
    https://bugzilla.kernel.org/show_bug.cgi?id=35352

    Reported-by: Witold Baryluk
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 May, 2011

1 commit

  • Shame on me! Commit b1dea800ac39 "tmpfs: fix race between umount and
    writepage" fixed the advertized race, but introduced another: as even
    its comment makes clear, we cannot safely rely on a peek at list_empty()
    while holding no lock - until info->swapped is set, shmem_unuse_inode()
    may delete any formerly-swapped inode from the shmem_swaplist, which
    in this case would leave a swap area impossible to swapoff.

    Although I don't relish taking the mutex every time, I don't care much
    for the alternatives either; and at least the peek at list_empty() in
    shmem_evict_inode() (a hotter path since most inodes would never have
    been swapped) remains safe, because we already truncated the whole file.

    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins