30 Apr, 2008

1 commit

  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Apr, 2008

9 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mm/shmem.c currently contains functions to parse and display memory policy
    strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with
    the rest of the mempolicy support. With subsequent patches, we'll be able to
    remove knowledge of the details [mode, flags, policy, ...] completely from
    shmem.c

    1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
    mm/mempolicy.c. Rework to use the policy_types[] array [used by
    mpol_to_str()] to look up mode by name.

    2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str()
    expects a pointer to a struct mempolicy, so temporarily construct one.
    This will be replaced with a reference to a struct mempolicy in the tmpfs
    superblock in a subsequent patch.

    NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
    sign '=' as the nodemask delimiter to match mpol_parse_str() and the
    tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This
    is a user visible change to numa_maps, but then the addition of the mode
    flags already changed the display. It makes sense to me to have the mounts
    and numa_maps display the policy in the same format. However, if anyone
    objects strongly, I can pass the desired nodemask delimeter as an arg to
    mpol_to_str().

    Note 2: Like show_numa_map(), I don't check the return code from
    mpol_to_str(). I do use a longer buffer than the one provided by
    show_numa_map(), which seems to have sufficed so far.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Parsing of new mode flags in the tmpfs mpol mount option is slightly broken:

    Setting a valid flag works OK:
    #mount -o remount,mpol=bind=static:1-2 /dev/shm
    #mount
    ...
    tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2)
    ...

    However, we can't remove them or change them, once we've
    set a valid flag:

    #mount -o remount,mpol=bind:1-2 /dev/shm
    #mount
    ...
    tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2)
    ...

    It SAYS it removed it, but that's just a copy of the input
    string. If we now try to set it to a different flag, we
    get:

    #mount -o remount,mpol=bind=relative:1-2 /dev/shm
    mount: /dev/shm not mounted already, or bad option

    And on the console, we see:
    tmpfs: Bad value 'bind' for mount option 'mpol'
    ^ lost remainder of string

    Furthermore, bogus flags are accepted with out error.
    Granted, they are a no-op:

    #mount -o remount,mpol=interleave=foo:0-3 /dev/shm
    #mount
    ...
    tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3)

    Again, that's just a copy of the input string shown by the mount command.

    This patch fixes the behavior by pre-zeroing the flags so that only one of the
    mutually exclusive flags can be set at one time. It also reports an error
    when an unrecognized flag is specified.

    The check for both flags being set is removed because it can't happen with
    this implementation. If we ever want to support multiple non-exclusive flags,
    this area will need rework and we will need to check that any mutually
    exclusive flags aren't specified.

    Signed-off-by: Lee Schermerhorn
    Cc: David Rientjes
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
    nodemasks passed via set_mempolicy() or mbind() should be considered relative
    to the current task's mems_allowed.

    When the mempolicy is created, the passed nodemask is folded and mapped onto
    the current task's mems_allowed. For example, consider a task using
    set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
    nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
    5-7 (the second, third, and fourth node of mems_allowed).

    If the same task is attached to a cpuset, the mempolicy nodemask is rebound
    each time the mems are changed. Some possible rebinds and results are:

    mems result
    1-3 1-3
    1-7 2-4
    1,5-6 1,5-6
    1,5-7 5-7

    Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
    to the resultant nodemask from the relative remap.

    In the MPOL_PREFERRED case, the preferred node is remapped from the currently
    effected nodemask to the relative nodemask.

    This mempolicy mode flag was conceived of by Paul Jackson .

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
    node remap when the policy is rebound.

    Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
    a union with cpuset_mems_allowed:

    struct mempolicy {
    ...
    union {
    nodemask_t cpuset_mems_allowed;
    nodemask_t user_nodemask;
    } w;
    }

    that stores the the nodemask that the user passed when he or she created the
    mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
    which is passed with any mempolicy mode, the user's passed nodemask
    intersected with the VMA or task's allowed nodes is always used when
    determining the preferred node, setting the MPOL_BIND zonelist, or creating
    the interleave nodemask. This happens whenever the policy is rebound,
    including when a task's cpuset assignment changes or the cpuset's mems are
    changed.

    This creates an interesting side-effect in that it allows the mempolicy
    "intent" to lie dormant and uneffected until it has access to the node(s) that
    it desires. For example, if you currently ask for an interleaved policy over
    a set of nodes that you do not have access to, the mempolicy is not created
    and the task continues to use the previous policy. With this change, however,
    it is possible to create the same mempolicy; it is only effected when access
    to nodes in the nodemask is acquired.

    It is also possible to mount tmpfs with the static nodemask behavior when
    specifying a node or nodemask. To do this, simply add "=static" immediately
    following the mempolicy mode at mount time:

    mount -o remount mpol=interleave=static:1-3

    Also removes mpol_check_policy() and folds its logic into mpol_new() since it
    is now obsoleted. The unused vma_mpol_equal() is also removed.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
    MPOL_INTERLEAVE, are better declared as part of an enum since they are
    sequentially numbered and cannot be combined.

    The policy member of struct mempolicy is also converted from type short to
    type unsigned short. A negative policy does not have any legitimate meaning,
    so it is possible to change its type in preparation for adding optional mode
    flags later.

    The equivalent member of struct shmem_sb_info is also changed from int to
    unsigned short.

    For compatibility, the policy formal to get_mempolicy() remains as a pointer
    to an int:

    int get_mempolicy(int *policy, unsigned long *nmask,
    unsigned long maxnode, unsigned long addr,
    unsigned long flags);

    although the only possible values is the range of type unsigned short.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to
    mem_cgroup_charge_common. It seemed convenient at the time, but hard to
    justify now: there's a perfectly appropriate swappage to charge and uncharge
    instead, this is not on any hot path through shmem_getpage, and no performance
    hit was observed from the slight extra overhead.

    So revert that NULL page handling from mem_cgroup_charge_common; and make it
    clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that
    was a helper I found more of a hindrance to understanding.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Feb, 2008

2 commits


08 Feb, 2008

1 commit

  • The memcgroup regime relies upon a cgroup reclaiming pages from itself within
    add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs
    rely upon using add_to_page_cache while holding a spinlock: when it cannot
    wait. The consequence is that when a cgroup reaches its limit, shmem_getpage
    just hangs - unless there is outside memory pressure too, neither kswapd nor
    radix_tree_preload get it out of the retry loop.

    In most cases we can mem_cgroup_cache_charge the page waitably first, to
    attach the page_cgroup in advance, so add_to_page_cache will do no more than
    increment a count; then mem_cgroup_uncharge_page after (in both success and
    failure cases) to balance the books again.

    And where there used to be a congestion_wait for kswapd (recently made
    redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
    to go through a cycle of allocation and freeing, without accounting to any
    particular page, and without updating the statistics vector. This brings the
    cgroup below its limit so the next try usually succeeds.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Feb, 2008

16 commits

  • This patch modifies the interface to inode_getsecurity to have the function
    return a buffer containing the security blob and its length via parameters
    instead of relying on the calling function to give it an appropriately sized
    buffer.

    Security blobs obtained with this function should be freed using the
    release_secctx LSM hook. This alleviates the problem of the caller having to
    guess a length and preallocate a buffer for this function allowing it to be
    used elsewhere for Labeled NFS.

    The patch also removed the unused err parameter. The conversion is similar to
    the one performed by Al Viro for the security_getprocattr hook.

    Signed-off-by: David P. Quigley
    Cc: Stephen Smalley
    Cc: Chris Wright
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Casey Schaufler
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David P. Quigley
     
  • Intensive swapoff testing shows shmem_unuse spinning on an entry in
    shmem_swaplist pointing to itself: how does that come about? Days pass...

    First guess is this: shmem_delete_inode tests list_empty without taking the
    global mutex (so the swapping case doesn't slow down the common case); but
    there's an instant in shmem_unuse_inode's list_move_tail when the list entry
    may appear empty (a rare case, because it's actually moving the head not the
    the list member). So there's a danger of leaving the inode on the swaplist
    when it's freed, then reinitialized to point to itself when reused. Fix that
    by skipping the list_move_tail when it's a no-op, which happens to plug this.

    But this same spinning then surfaces on another machine. Ah, I'd never
    suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
    still hold page lock, which would hold off inode deletion if the page were in
    pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
    doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
    earlier, but then shmem_unuse_inode could never prune unswapped inodes.

    Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
    though I am a little uneasy about the iput which has to follow - it works, and
    I see nothing wrong with it, but it is surprising that shmem inode deletion
    may now occur below shmem_writepage. Revisit this fix later?

    And while we're looking at these races: the way shmem_unuse tests swapped
    without holding info->lock looks unsafe, if we've more than one swap area: a
    racing shmem_writepage on another page of the same inode could be putting it
    in swapcache, just as we're deciding to remove the inode from swaplist -
    there's a danger of going on swap without being listed, so a later swapoff
    would hang, being unable to locate the entry. Move that test and removal down
    into shmem_unuse_inode, once info->lock is held.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
    or swap cache, without any radix tree preload: so tending to deplete emergency
    reserves of memory.

    GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
    being called under memory pressure, so must not wait for more memory to become
    available. But shmem_unuse_inode now has a window in which it can and should
    preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
    add_to_page_cache.

    shmem_getpage is not so straightforward: its filepage/swappage integrity
    relies upon exchanging between caches under spinlock, and it would need a lot
    of restructuring to place the preloads correctly. Instead, follow its pattern
    of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
    add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
    radix_tree_preload, followed immediately by radix_tree_preload_end - that
    won't guarantee success in the next add_to_page_cache, but doesn't need to.

    And we can then remove that bothersome congestion_wait: when needed, it'll
    automatically get done in the course of the radix_tree_preload.

    Signed-off-by: Hugh Dickins
    Looks-good-to: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There are a couple of reasons (patches follow) why it would be good to open a
    window for sleep in shmem_unuse_inode, between its search for a matching swap
    entry, and its handling of the entry found.

    shmem_unuse_inode must then use igrab to hold the inode against deletion in
    that window, and its corresponding iput might result in deletion: so it had
    better unlock_page before the iput, and might as well release the page too.

    Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
    leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
    into shmem_unuse_inode, in the case when it finds a match.

    Let try_to_unuse break on error in the shmem_unuse case, as it does in the
    unuse_mm case: though at this point in the series, no error to break on.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • shmem_unuse is at present an unbroken search through every swap vector page of
    every tmpfs file which might be swapped, all under shmem_swaplist_lock. This
    dates from long ago, when the caller held mmlist_lock over it all too: long
    gone, but there's never been much pressure for preemptible swapoff.

    Make it a little more preemptible, replacing shmem_swaplist_lock by
    shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
    cond_resched_lock (on info->lock) at one convenient point in the
    shmem_unuse_inode loop, where it has no outstanding kmap_atomic.

    If we're serious about preemptible swapoff, there's much further to go e.g.
    I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
    case dictate preemptiblility for other configs. But as in the earlier patch
    to make swapoff scan ptes preemptibly, my hidden agenda is really towards
    making memcgroups work, hardly about preemptibility at all.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or
    size=0). But if a stacked filesystem such as unionfs gets pages from a sparse
    tmpfs file by reading holes, and then writes to them, it can easily exceed any
    such limit at present.

    So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
    reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed,
    pessimistically mark such pages as dirty, so they cannot get reclaimed and
    unaccounted by mistake. The venerable shmem_recalc_inode code (originally to
    account for the reclaim of clean pages) suffices to get the accounting right
    when swappages are dropped in favour of more uptodate filepages.

    This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
    caused by unionfs writing to a very sparse tmpfs file: to minimize memory
    allocation in swapout, tmpfs requires the swap vector be allocated upfront,
    which wasn't always happening in this stacked case.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tmpfs has long allowed for a fresh filepage to be created in pagecache, just
    before shmem_getpage gets the chance to match it up with the swappage which
    already belongs to that offset. But unionfs_writepage now does a
    find_or_create_page, divorced from shmem_getpage, which leaves conflicting
    filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.

    Therefore shmem_writepage (where a page is swizzled from file to swap) must
    now be on the lookout for existing swap, ready to free it in favour of the
    more uptodate filepage, instead of BUGging on that clash. And when the
    add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
    filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails
    in shmem_getpage, it should retry in the same way it already does.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
    between tmpfs page cache and swap cache, to avoid page copying) are only used
    by shmem.c; and our subsequent fix for unionfs needs different treatments in
    the two instances of move_from_swap_cache. Move them from swap_state.c into
    their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
    add_to_swap_cache externally visible.

    shmem.c likes to say set_page_dirty where swap_state.c liked to say
    SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
    makes moot (and implies we should lose that "shift page from clean_pages to
    dirty_pages list" comment: it's on neither).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When tmpfs is mounted with a size less than one page, the number of blocks
    is set to 0 which makes the tmpfs mount unlimited. This can lead to a
    quick and surprising death if someone typos a tmpfs mount command and
    writes too much.

    tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0,
    as Documentation/filesystems/tmpfs.txt says.

    Hugh: do this by rounding size up instead of down in all cases: which
    slightly expands other odd-sized tmpfs mounts, but in a consistent way.

    Signed-off-by: Michael Marineau
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Marineau
     
  • The shmem_sb_info structure has a number of free_inodes. This
    value is altered in appropriate places under spinlock and with
    the sbi->max_inodes != 0 check.

    Consolidate these manipulations into two helpers.

    This is minus 42 bytes of shmem.o and minus 4 :) lines of code.

    [akpm@linux-foundation.org: fix error return values]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • With the old aops, writing to a tmpfs file had to use its own special method:
    the generic method would pass in a fresh page to prepare_write when the right
    page was there in swapcache - which was inefficient to handle, even once we'd
    concocted the code to handle it.

    With the new aops, the generic method uses shmem_write_end, which lets
    shmem_getpage find the right page: so now abandon shmem_file_write in favour
    of the generic method. Yes, that does do several things that tmpfs hasn't
    really needed (notably balance_dirty_pages_ratelimited, which ramfs also
    calls); but more use of common code is preferable.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In the new aops, write_begin is supposed to return the page locked: though
    I've seen no ill effects, that's been overlooked in the case of
    shmem_write_begin, and should be fixed. Then shmem_write_end must unlock the
    page: do so _after_ updating i_size, as we found to be important in other
    filesystems (though since shmem pages don't go the usual writeback route, they
    never suffered from that corruption).

    For shmem_write_begin to return the page locked, we need shmem_getpage to
    return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's
    simplify the interface and return it locked even when SGP_READ.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no
    users now. Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and
    shmem_getpage is about to return with page always locked).

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Building in a filesystem on a loop device on a tmpfs file can hang when
    swapping, the loop thread caught in that infamous throttle_vm_writeout.

    In theory this is a long standing problem, which I've either never seen in
    practice, or long ago suppressed the recollection, after discounting my load
    and my tmpfs size as unrealistically high. But now, with the new aops, it has
    become easy to hang on one machine.

    Loop used to grab_cache_page before the old prepare_write to tmpfs, which
    seems to have been enough to free up some memory for any swapin needed; but
    the new write_begin lets tmpfs find or allocate the page (much nicer, since
    grab_cache_page missed tmpfs pages in swapcache).

    When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
    has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
    break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
    read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
    mapping_gfp_mask - hence the hang.

    So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
    swapin_readahead to read_swap_cache_async to add_to_swap_cache.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
    beside its kindred read_swap_cache_async. Why were its args in a different
    order? rearrange them. And since it was always followed by a
    read_swap_cache_async of the target page, fold that in and return struct
    page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and
    read_swap_cache_async stubs.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
    code, advancing addr, and stepping on to the next vma at the boundary, to line
    up the mempolicy for each page allocation.

    It _might_ be a good idea to allocate swap more according to vma layout; but
    the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
    allocated as needed for pages as they sink to the bottom of the inactive LRUs.
    Sometimes that may match vma layout, but not so often that it's worth going
    to these misleading vma->vm_next lengths: rip all that out.

    Originally I intended to retain the incrementation of addr, but correct its
    initial value: valid_swaphandles generally supplies an offset below the target
    addr (this is readaround rather than readahead), but addr has not been
    adjusted accordingly, so in the interleave case it has usually been allocating
    the target page from the "wrong" node (though that may not matter very much).

    But look at the equivalent shmem_swapin code: either by oversight or by
    design, though it has all the apparatus for choosing a new mempolicy per page,
    it uses the same idx throughout, choosing the same mempolicy and interleave
    node for each page of the cluster.

    Which is actually a much better strategy: each node has its own LRUs and its
    own kswapd, so if you're betting on any particular relationship between swap
    and node, the best bet is that nearby swap entries belong to pages from the
    same node - even when the mempolicy of the target page is to interleave. And
    examining a map of nodes corresponding to swap entries on a numa=fake system
    bears this out. (We could later tweak swap allocation to make it even more
    likely, but this patch is merely about removing cruft.)

    So, neither adjust nor increment addr in swapin_readahead, and then
    shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
    once per cluster, and so few fields of pvma are used, let's skip the memset -
    from shmem_alloc_page also.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Nov, 2007

1 commit

  • tmpfs was misconverted to __GFP_ZERO in 2.6.11. There's an unusual case in
    which shmem_getpage receives the page from its caller instead of allocating.
    We must cover this case by clear_highpage before SetPageUptodate, as before.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Oct, 2007

1 commit

  • It's possible to provoke unionfs (not yet in mainline, though in mm and
    some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect
    it's possible to provoke the 2.6.23 ecryptfs in the same way (but the
    2.6.24 ecryptfs no longer calls lower level's ->writepage).

    This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could
    leak from tmpfs via write_cache_pages and unionfs to userspace. There's
    already a fix (e423003028183df54f039dfda8b58c49e78c89d7 - writeback: don't
    propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so
    far as it goes; but insufficient because it doesn't address the underlying
    issue, that shmem_writepage expects to be called only by vmscan (relying on
    backing_dev_info capabilities to prevent the normal writeback path from
    ever approaching it).

    That's an increasingly fragile assumption, and ramdisk_writepage (the other
    source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check
    wbc->for_reclaim before returning it. Make the same check in
    shmem_writepage, thereby sidestepping the page_mapped BUG also.

    Signed-off-by: Hugh Dickins
    Cc: Erez Zadok
    Cc:
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Oct, 2007

2 commits

  • Now that nfsd has stopped writing to the find_exported_dentry member we an
    mark the export_operations const

    Signed-off-by: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc:
    Cc: Dave Kleikamp
    Cc: Anton Altaparmakov
    Cc: David Chinner
    Cc: Timothy Shimmin
    Cc: OGAWA Hirofumi
    Cc: Hugh Dickins
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • I'm not sure what people were thinking when adding support to export tmpfs,
    but here's the conversion anyway:

    Signed-off-by: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

17 Oct, 2007

5 commits

  • Why do we need r/o bind mounts?

    This feature allows a read-only view into a read-write filesystem. In the
    process of doing that, it also provides infrastructure for keeping track of
    the number of writers to any given mount.

    This has a number of uses. It allows chroots to have parts of filesystems
    writable. It will be useful for containers in the future because users may
    have root inside a container, but should not be allowed to write to
    somefilesystems. This also replaces patches that vserver has had out of the
    tree for several years.

    It allows security enhancement by making sure that parts of your filesystem
    read-only (such as when you don't trust your FTP server), when you don't want
    to have entire new filesystems mounted, or when you want atime selectively
    updated. I've been using the following script to test that the feature is
    working as desired. It takes a directory and makes a regular bind and a r/o
    bind mount of it. It then performs some normal filesystem operations on the
    three directories, including ones that are expected to fail, like creating a
    file on the r/o mount.

    This patch:

    Some filesystems forego the vfs and may_open() and create their own 'struct
    file's.

    This patch creates a couple of helper functions which can be used by these
    filesystems, and will provide a unified place which the r/o bind mount code
    may patch.

    Also, rename an existing, static-scope init_file() to a less generic name.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • These aren't modular, so SLAB_PANIC is OK.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • provide BDI constructor/destructor hooks

    [akpm@linux-foundation.org: compile fix]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This patch makes three needlessly global functions static.

    Signed-off-by: Adrian Bunk
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk