21 Oct, 2020

1 commit

  • Pull XArray updates from Matthew Wilcox:

    - Fix the test suite after introduction of the local_lock

    - Fix a bug in the IDA spotted by Coverity

    - Change the API that allows the workingset code to delete a node

    - Fix xas_reload() when dealing with entries that occupy multiple
    indices

    - Add a few more tests to the test suite

    - Fix an unsigned int being shifted into an unsigned long

    * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
    XArray: Fix xas_create_range for ranges above 4 billion
    radix-tree: fix the comment of radix_tree_next_slot()
    XArray: Fix xas_reload for multi-index entries
    XArray: Add private interface for workingset node deletion
    XArray: Fix xas_for_each_conflict documentation
    XArray: Test marked multiorder iterations
    XArray: Test two more things about xa_cmpxchg
    ida: Free allocated bitmap in error path
    radix tree test suite: Fix compilation

    Linus Torvalds
     

17 Oct, 2020

2 commits

  • In order to use multi-index entries for huge pages in the page cache, we
    need to be able to split a multi-index entry (eg if a file is truncated in
    the middle of a huge page entry). This version does not support splitting
    more than one level of the tree at a time. This is an acceptable
    limitation for the page cache as we do not expect to support order-12
    pages in the near future.

    [akpm@linux-foundation.org: export xas_split_alloc() to modules]
    [willy@infradead.org: fix xarray split]
    Link: https://lkml.kernel.org/r/20200910175450.GV6583@casper.infradead.org
    [willy@infradead.org: fix xarray]
    Link: https://lkml.kernel.org/r/20201001233943.GW20115@casper.infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: "Kirill A . Shutemov"
    Cc: Qian Cai
    Cc: Song Liu
    Link: https://lkml.kernel.org/r/20200903183029.14930-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Fix read-only THP for non-tmpfs filesystems".

    As described more verbosely in the [3/3] changelog, we can inadvertently
    put an order-0 page in the page cache which occupies 512 consecutive
    entries. Users are running into this if they enable the
    READ_ONLY_THP_FOR_FS config option; see
    https://bugzilla.kernel.org/show_bug.cgi?id=206569 and Qian Cai has also
    reported it here:
    https://lore.kernel.org/lkml/20200616013309.GB815@lca.pw/

    This is a rather intrusive way of fixing the problem, but has the
    advantage that I've actually been testing it with the THP patches, which
    means that it sees far more use than it does upstream -- indeed, Song has
    been entirely unable to reproduce it. It also has the advantage that it
    removes a few patches from my gargantuan backlog of THP patches.

    This patch (of 3):

    This function returns the order of the entry at the index. We need this
    because there isn't space in the shadow entry to encode its order.

    [akpm@linux-foundation.org: export xa_get_order to modules]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: "Kirill A . Shutemov"
    Cc: Qian Cai
    Cc: Song Liu
    Link: https://lkml.kernel.org/r/20200903183029.14930-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20200903183029.14930-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Oct, 2020

2 commits


08 Oct, 2020

1 commit


09 Jun, 2020

1 commit

  • __xa_store() and xa_store() document that the functions can fail, and
    that the return code can be an xa_err() encoded error code.

    xa_store_bh() and xa_store_irq() do not document that the functions can
    fail and that they can also return xa_err() encoded error codes.

    Thus: Update the documentation.

    Signed-off-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200430111424.16634-1-manfred@colorfullife.com
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

13 Mar, 2020

1 commit

  • xas_for_each_marked() is using entry == NULL as a termination condition
    of the iteration. When xas_for_each_marked() is used protected only by
    RCU, this can however race with xas_store(xas, NULL) in the following
    way:

    TASK1 TASK2
    page_cache_delete() find_get_pages_range_tag()
    xas_for_each_marked()
    xas_find_marked()
    off = xas_find_chunk()

    xas_store(&xas, NULL)
    xas_init_marks(&xas);
    ...
    rcu_assign_pointer(*slot, NULL);
    entry = xa_entry(off);

    And thus xas_for_each_marked() terminates prematurely possibly leading
    to missed entries in the iteration (translating to missing writeback of
    some pages or a similar problem).

    If we find a NULL entry that has been marked, skip it (unless we're trying
    to allocate an entry).

    Reported-by: Jan Kara
    CC: stable@vger.kernel.org
    Fixes: ef8e5717db01 ("page cache: Convert delete_batch to XArray")
    Signed-off-by: Matthew Wilcox (Oracle)

    Matthew Wilcox (Oracle)
     

01 Feb, 2020

1 commit


18 Jan, 2020

2 commits


15 Oct, 2019

1 commit

  • Fix (Sphinx) kernel-doc warning in :

    include/linux/xarray.h:232: WARNING: Unexpected indentation.

    Link: http://lkml.kernel.org/r/89ba2134-ce23-7c10-5ee1-ef83b35aa984@infradead.org
    Fixes: a3e4d3f97ec8 ("XArray: Redesign xa_alloc API")
    Signed-off-by: Randy Dunlap
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

01 Jun, 2019

1 commit

  • Since a28334862993 ("page cache: Finish XArray conversion"), on most
    major Linux distributions, the page cache doesn't correctly transition
    when the hot data set is changing, and leaves the new pages thrashing
    indefinitely instead of kicking out the cold ones.

    On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
    running stock Arch Linux:

    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120086/153600 workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120268/153600 workingset-b

    workingset-b is a 600M file on a 1G host that is otherwise entirely
    idle. No matter how often it's being accessed, it won't get cached.

    While investigating, I noticed that the non-resident information gets
    aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
    a problem because a workingset transition like this relies on the
    non-resident information tracked in the page cache tree of evicted
    file ranges: when the cache faults are refaults of recently evicted
    cache, we challenge the existing active set, and that allows a new
    workingset to establish itself.

    Tracing the shrinker that maintains this memory revealed that all page
    cache tree nodes were allocated to the root cgroup. This is a problem,
    because 1) the shrinker sizes the amount of non-resident information
    it keeps to the size of the cgroup's other memory and 2) on most major
    Linux distributions, only kernel threads live in the root cgroup and
    everything else gets put into services or session groups:

    [root@ham ~]# cat /proc/self/cgroup
    0::/user.slice/user-0.slice/session-c1.scope

    As a result, we basically maintain no non-resident information for the
    workloads running on the system, thus breaking the caching algorithm.

    Looking through the code, I found the culprit in the above-mentioned
    patch: when switching from the radix tree to xarray, it dropped the
    __GFP_ACCOUNT flag from the tree node allocations - the flag that
    makes sure the allocated memory gets charged to and tracked by the
    cgroup of the calling process - in this case, the one doing the fault.

    To fix this, allow xarray users to specify per-tree flag that makes
    xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
    tree annotation to request such cgroup tracking for the cache nodes.

    With this patch applied, the page cache correctly converges on new
    workingsets again after just a few iterations:

    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    124607/153600 workingset-a
    87876/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    81313/153600 workingset-a
    133321/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    63036/153600 workingset-a
    153600/153600 workingset-b

    Cc: stable@vger.kernel.org # 4.20+
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Signed-off-by: Matthew Wilcox (Oracle)

    Johannes Weiner
     

21 Feb, 2019

2 commits

  • Jason feels this is clearer, and it saves a function and an exported
    symbol.

    Suggested-by: Jason Gunthorpe
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • xa_cmpxchg() was a little too magic in turning ZERO entries into NULL,
    and would leave the entry set to the ZERO entry instead of releasing
    it for future use. After careful review of existing users of
    xa_cmpxchg(), change the semantics so that it does not translate either
    incoming argument from NULL into ZERO entries.

    Add several tests to the test-suite to make sure this problem doesn't
    come back.

    Reported-by: Jason Gunthorpe
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

09 Feb, 2019

1 commit


07 Feb, 2019

4 commits

  • This differs slightly from the IDR equivalent in five ways.

    1. It can allocate up to UINT_MAX instead of being limited to INT_MAX,
    like xa_alloc(). Also like xa_alloc(), it will write to the 'id'
    pointer before placing the entry in the XArray.
    2. The 'next' cursor is allocated separately from the XArray instead
    of being part of the IDR. This saves memory for all the users which
    do not use the cyclic allocation API and suits some users better.
    3. It returns -EBUSY instead of -ENOSPC.
    4. It will attempt to wrap back to the minimum value on memory allocation
    failure as well as on an -EBUSY error, assuming that a user would
    rather allocate a small ID than suffer an ID allocation failure.
    5. It reports whether it has wrapped, which is important to some users.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • It was too easy to forget to initialise the start index. Add an
    xa_limit data structure which can be used to pass min & max, and
    define a couple of special values for common cases. Also add some
    more tests cribbed from the IDR test suite. Change the return value
    from -ENOSPC to -EBUSY to match xa_insert().

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • A lot of places want to allocate IDs starting at 1 instead of 0.
    While the xa_alloc() API supports this, it's not very efficient if lots
    of IDs are allocated, due to having to walk down to the bottom of the
    tree to see if ID 1 is available, then all the way over to the next
    non-allocated ID. This method marks ID 0 as being occupied which wastes
    one slot in the XArray, but preserves xa_empty() as working.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Userspace translates EEXIST to "File exists" which isn't a very good
    error message for the problem. "Device or resource busy" is a better
    indication of what went wrong.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

05 Feb, 2019

1 commit


17 Jan, 2019

1 commit

  • There is a math problem here which leads to a lot of static checker
    warnings for me:

    net/sunrpc/clnt.c:451 rpc_new_client() error: (-4096) too low for ERR_PTR

    Error values are from -1 to -4095 or from 0xffffffff to 0xfffff001 in
    hexadecimal. (I am assuming a 32 bit system for simplicity). We are
    using the lowest two bits to hold some internal XArray data so the
    error is shifted two spaces to the left. 0xfffff001 << 2 is 0xffffc004.
    And finally we want to check that BIT(1) is set so we add 2 which gives
    us 0xffffc006.

    In other words, we should be checking that "entry >= 0xffffc006", but
    the check is actually testing if "entry >= 0xffffc002".

    Fixes: 76b4e5299565 ("XArray: Permit storing 2-byte-aligned pointers")
    Signed-off-by: Dan Carpenter
    [Use xa_mk_internal() instead of changing the bracketing]
    Signed-off-by: Matthew Wilcox

    Dan Carpenter
     

15 Jan, 2019

1 commit


07 Jan, 2019

4 commits

  • xa_insert() should treat reserved entries as occupied, not as available.
    Also, it should treat requests to insert a NULL pointer as a request
    to reserve the slot. Add xa_insert_bh() and xa_insert_irq() for
    completeness.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • On m68k, statically allocated pointers may only be two-byte aligned.
    This clashes with the XArray's method for tagging internal pointers.
    Permit storing these pointers in single slots (ie not in multislots).

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • There were three problems with this API:
    1. It took too many arguments; almost all users wanted to iterate over
    every element in the array rather than a subset.
    2. It required that 'index' be initialised before use, and there's no
    realistic way to make GCC catch that.
    3. 'index' and 'entry' were the opposite way round from every other
    member of the XArray APIs.

    So split it into three different APIs:

    xa_for_each(xa, index, entry)
    xa_for_each_start(xa, index, entry, start)
    xa_for_each_marked(xa, index, entry, filter)

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • A regular xa_init_flags() put all dynamically-initialised XArrays into
    the same locking class. That leads to lockdep believing that taking
    one XArray lock while holding another is a deadlock. It's possible to
    work around some of these situations with separate locking classes for
    irq/bh/regular XArrays, and SINGLE_DEPTH_NESTING, but that's ugly, and
    it doesn't work for all situations (where we have completely unrelated
    XArrays).

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

06 Dec, 2018

1 commit


06 Nov, 2018

5 commits


21 Oct, 2018

7 commits