11 Sep, 2013

6 commits

  • We currently use a compile-time constant to size the node array for the
    list_lru structure. Due to this, we don't need to allocate any memory at
    initialization time. But as a consequence, the structures that contain
    embedded list_lru lists can become way too big (the superblock for
    instance contains two of them).

    This patch aims at ameliorating this situation by dynamically allocating
    the node arrays with the firmware provided nr_node_ids.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • The list_lru implementation has one function, list_lru_dispose_all, with
    only one user (the dentry code). At first, such function appears to make
    sense because we are really not interested in the result of isolating each
    dentry separately - all of them are going away anyway. However, it's
    implementation is buggy in the following way:

    When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
    marking them with DCACHE_SHRINK_LIST. However, this is done without the
    nlru->lock taken. The imediate result of that is that someone else may
    add or remove the dentry from the LRU at the same time. When list_lru_del
    happens in that scenario we will see an element that is not yet marked
    with DCACHE_SHRINK_LIST (even though it will be in the future) and
    obviously remove it from an lru where the element no longer is. Since
    list_lru_dispose_all will in effect count down nlru's nr_items and
    list_lru_del will do the same, this will lead to an imbalance.

    The solution for this would not be so simple: we can obviously just keep
    the lru_lock taken, but then we have no guarantees that we will be able to
    acquire the dentry lock (dentry->d_lock). To properly solve this, we need
    a communication mechanism between the lru and dentry code, so they can
    coordinate this with each other.

    Such mechanism already exists in the form of the list_lru_walk_cb
    callback. So it is possible to construct a dcache-side prune function
    that does the right thing only by calling list_lru_walk in a loop until no
    more dentries are available.

    With only one user, plus the fact that a sane solution for the problem
    would involve boucing between dcache and list_lru anyway, I see little
    justification to keep the special case list_lru_dispose_all in tree.

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Acked-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • This patch adapts the list_lru API to accept an optional node argument, to
    be used by NUMA aware shrinking functions. Code that does not care about
    the NUMA placement of objects can still call into the very same functions
    as before. They will simply iterate over all nodes.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • The LRU_RETRY code assumes that the list traversal status after we have
    dropped and regained the list lock. Unfortunately, this is not a valid
    assumption, and that can lead to racing traversals isolating objects that
    the other traversal expects to be the next item on the list.

    This is causing problems with the inode cache shrinker isolation, with
    races resulting in an inode on a dispose list being "isolated" because a
    racing traversal still thinks it is on the LRU. The inode is then never
    reclaimed and that causes hangs if a subsequent lookup on that inode
    occurs.

    Fix it by always restarting the list walk on a LRU_RETRY return from the
    isolate callback. Avoid the possibility of livelocks the current code was
    trying to avoid by always decrementing the nr_to_walk counter on retries
    so that even if we keep hitting the same item on the list we'll eventually
    stop trying to walk and exit out of the situation causing the problem.

    Reported-by: Michal Hocko
    Signed-off-by: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now that we have an LRU list API, we can start to enhance the
    implementation. This splits the single LRU list into per-node lists and
    locks to enhance scalability. Items are placed on lists according to the
    node the memory belongs to. To make scanning the lists efficient, also
    track whether the per-node lists have entries in them in a active
    nodemask.

    Note: We use a fixed-size array for the node LRU, this struct can be very
    big if MAX_NUMNODES is big. If this becomes a problem this is fixable by
    turning this into a pointer and dynamically allocating this to
    nr_node_ids. This quantity is firwmare-provided, and still would provide
    room for all nodes at the cost of a pointer lookup and an extra
    allocation. Because that allocation will most likely come from a may very
    well fail.

    [glommer@openvz.org: fix warnings, added note about node lru]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Reviewed-by: Greg Thelen
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • Several subsystems use the same construct for LRU lists - a list head, a
    spin lock and and item count. They also use exactly the same code for
    adding and removing items from the LRU. Create a generic type for these
    LRU lists.

    This is the beginning of generic, node aware LRUs for shrinkers to work
    with.

    [glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Reviewed-by: Greg Thelen
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner