15 Jan, 2016

40 commits

  • Currently looking at /proc//status or statm, there is no way to
    distinguish shmem pages from pages mapped to a regular file (shmem pages
    are mapped to /dev/zero), even though their implication in actual memory
    use is quite different.

    The internal accounting currently counts shmem pages together with
    regular files. As a preparation to extend the userspace interfaces,
    this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
    shmem pages separately from MM_FILEPAGES. The next patch will expose it
    to userspace - this patch doesn't change the exported values yet, by
    adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
    used before. The only user-visible change after this patch is the OOM
    killer message that separates the reported "shmem-rss" from "file-rss".

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Following the previous patch, further reduction of /proc/pid/smaps cost
    is possible for private writable shmem mappings with unpopulated areas
    where the page walk invokes the .pte_hole function. We can use radix
    tree iterator for each such area instead of calling find_get_entry() in
    a loop. This is possible at the extra maintenance cost of introducing
    another shmem function shmem_partial_swap_usage().

    To demonstrate the diference, I have measured this on a process that
    creates a private writable 2GB mapping of a partially swapped out
    /dev/shm/file (which cannot employ the optimizations from the prvious
    patch) and doesn't populate it at all. I time how long does it take to
    cat /proc/pid/smaps of this process 100 times.

    Before this patch:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    After this patch:

    real 0m1.176s
    user 0m0.180s
    sys 0m0.684s

    The time is similar to the case where a radix tree iterator is employed
    on the whole mapping.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The previous patch has improved swap accounting for shmem mapping, which
    however made /proc/pid/smaps more expensive for shmem mappings, as we
    consult the radix tree for each pte_none entry, so the overal complexity
    is O(n*log(n)).

    We can reduce this significantly for mappings that cannot contain COWed
    pages, because then we can either use the statistics tha shmem object
    itself tracks (if the mapping contains the whole object, or the swap
    usage of the whole object is zero), or use the radix tree iterator,
    which is much more effective than repeated find_get_entry() calls.

    This patch therefore introduces a function shmem_swap_usage(vma) and
    makes /proc/pid/smaps use it when possible. Only for writable private
    mappings of shmem objects (i.e. tmpfs files) with the shmem object
    itself (partially) swapped outwe have to resort to the find_get_entry()
    approach.

    Hopefully such mappings are relatively uncommon.

    To demonstrate the diference, I have measured this on a process that
    creates a 2GB mapping and dirties single pages with a stride of 2MB, and
    time how long does it take to cat /proc/pid/smaps of this process 100
    times.

    Private writable mapping of a /dev/shm/file (the most complex case):

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
    (which needs to employ the radix tree iterator).

    real 0m1.351s
    user 0m0.096s
    sys 0m0.768s

    Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

    real 0m0.935s
    user 0m0.128s
    sys 0m0.344s

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    The cost is now much closer to the private anonymous mapping case, unless
    the shmem mapping is private and writable.

    Signed-off-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Jerome Marchand
    Cc: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
    shmem-backed mappings, even if the mapped portion does contain pages
    that were swapped out. This is because unlike private anonymous
    mappings, shmem does not change pte to swap entry, but pte_none when
    swapping the page out. In the smaps page walk, such page thus looks
    like it was never faulted in.

    This patch changes smaps_pte_entry() to determine the swap status for
    such pte_none entries for shmem mappings, similarly to how
    mincore_page() does it. Swapped out shmem pages are thus accounted for.
    For private mappings of tmpfs files that COWed some of the pages, swaped
    out status of the original shmem pages is naturally ignored. If some of
    the private copies was also swapped out, they are accounted via their
    page table swap entries, so the resulting reported swap usage is then a
    sum of both swapped out private copies, and swapped out shmem pages that
    were not COWed. No double accounting can thus happen.

    The accounting is arguably still not as precise as for private anonymous
    mappings, since now we will count also pages that the process in
    question never accessed, but another process populated them and then let
    them become swapped out. I believe it is still less confusing and
    subtle than not showing any swap usage by shmem mappings at all.
    Swapped out counter might of interest of users who would like to prevent
    from future swapins during performance critical operation and pre-fault
    them at their convenience. Especially for larger swapped out regions
    the cost of swapin is much higher than a fresh page allocation. So a
    differentiation between pte_none vs. swapped out is important for those
    usecases.

    One downside of this patch is that it makes /proc/pid/smaps more
    expensive for shmem mappings, as we consult the radix tree for each
    pte_none entry, so the overal complexity is O(n*log(n)). I have
    measured this on a process that creates a 2GB mapping and dirties single
    pages with a stride of 2MB, and time how long does it take to cat
    /proc/pid/smaps of this process 100 times.

    Private anonymous mapping:

    real 0m0.949s
    user 0m0.116s
    sys 0m0.348s

    Mapping of a /dev/shm/file:

    real 0m3.831s
    user 0m0.180s
    sys 0m3.212s

    The difference is rather substantial, so the next patch will reduce the
    cost for shared or read-only mappings.

    In a less controlled experiment, I've gathered pids of processes on my
    desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This
    included the Chrome browser and some KDE processes. Again, I've run cat
    /proc/pid/smaps on each 100 times.

    Before this patch:

    real 0m9.050s
    user 0m0.518s
    sys 0m8.066s

    After this patch:

    real 0m9.221s
    user 0m0.541s
    sys 0m8.187s

    This suggests low impact on average systems.

    Note that this patch doesn't attempt to adjust the SwapPss field for
    shmem mappings, which would need extra work to determine who else could
    have the pages mapped. Thus the value stays zero except for COWed
    swapped out pages in a shmem mapping, which are accounted as usual.

    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Jerome Marchand
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This series is based on Jerome Marchand's [1] so let me quote the first
    paragraph from there:

    There are several shortcomings with the accounting of shared memory
    (sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
    values in /proc//status and statm don't allow to distinguish
    between shmem memory and a shared mapping to a regular file, even though
    their implications on memory usage are quite different: at reclaim, file
    mapping can be dropped or written back on disk while shmem needs a place
    in swap. As for shmem pages that are swapped-out or in swap cache, they
    aren't accounted at all.

    The original motivation for myself is that a customer found (IMHO
    rightfully) confusing that e.g. top output for process swap usage is
    unreliable with respect to swapped out shmem pages, which are not
    accounted for.

    The fundamental difference between private anonymous and shmem pages is
    that the latter has PTE's converted to pte_none, and not swapents. As
    such, they are not accounted to the number of swapents visible e.g. in
    /proc/pid/status VmSwap row. It might be theoretically possible to use
    swapents when swapping out shmem (without extra cost, as one has to
    change all mappers anyway), and on swap in only convert the swapent for
    the faulting process, leaving swapents in other processes until they
    also fault (so again no extra cost). But I don't know how many
    assumptions this would break, and it would be too disruptive change for
    a relatively small benefit.

    Instead, my approach is to document the limitation of VmSwap, and
    provide means to determine the swap usage for shmem areas for those who
    are interested and willing to pay the price, using /proc/pid/smaps.
    Because outside of ipcs, I don't think it's possible to currently to
    determine the usage at all. The previous patchset [1] did introduce new
    shmem-specific fields into smaps output, and functions to determine the
    values. I take a simpler approach, noting that smaps output already has
    a "Swap: X kB" line, where currently X == 0 always for shmem areas. I
    think we can just consider this a bug and provide the proper value by
    consulting the radix tree, as e.g. mincore_page() does. In the patch
    changelog I explain why this is also not perfect (and cannot be without
    swapents), but still arguably much better than showing a 0.

    The last two patches are adapted from Jerome's patchset and provide a
    VmRSS breakdown to RssAnon, RssFile and RssShm in /proc/pid/status.
    Hugh noted that this is a welcome addition, and I agree that it might
    help e.g. debugging process memory usage at albeit non-zero, but still
    rather low cost of extra per-mm counter and some page flag checks.

    [1] http://lwn.net/Articles/611966/

    This patch (of 6):

    The documentation for /proc/pid/status does not mention that the value
    of VmSwap counts only swapped out anonymous private pages, and not
    swapped out pages of the underlying shmem objects (for shmem mappings).
    This is not obvious, so document this limitation.

    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Jerome Marchand
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Make memmap_valid_within return bool due to this particular function
    only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • To make the intention clearer, use list_{next,first}_entry instead of
    list_entry.

    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
    __GFP_NOFAIL is requested. This is fragile because we are basically
    relying on somebody else to make the reclaim (be it the direct reclaim
    or OOM killer) for us. The caller might be holding resources (e.g.
    locks) which block other other reclaimers from making any progress for
    example. Remove the retry loop and rely on __alloc_pages_slowpath to
    invoke all allowed reclaim steps and retry logic.

    We have to be careful about __GFP_NOFAIL allocations from the
    PF_MEMALLOC context even though this is a very bad idea to begin with
    because no progress can be gurateed at all. We shouldn't break the
    __GFP_NOFAIL semantic here though. It could be argued that this is
    essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
    is much harder to check for existing users because they might happen
    deep down the code path performed much later after setting the flag so
    we cannot really rule out there is no kernel path triggering this
    combination.

    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __alloc_pages_high_priority doesn't do anything special other than it
    calls get_page_from_freelist and loops around GFP_NOFAIL allocation
    until it succeeds. It would be better if the first part was done in
    __alloc_pages_slowpath where we modify the zonelist because this would
    be easier to read and understand. Opencoding the function into its only
    caller allows to simplify it a bit as well.

    This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Hardcoding index to zonelists array in gfp_zonelist() is not a good
    idea, let's enumerate it to improve readability.

    No functional change.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
    [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
    Signed-off-by: Yaowei Bai
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Since commit a0b8cab3b9b2 ("mm: remove lru parameter from
    __pagevec_lru_add and remove parts of pagevec API") there's no
    user of this function anymore, so remove it.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make memblock_is_memory() and memblock_is_reserved return bool to
    improve readability due to these particular functions only using either
    one or zero as their return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make is_file_hugepages() return bool to improve readability due to this
    particular function only using either one or zero as its return value.

    This patch also removed the if condition to make is_file_hugepages
    return directly.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Move node_id zone_idx shrink flags into trace function, so thay we don't
    need caculate these args if the trace is disabled, and will make this
    function have less arguments.

    Signed-off-by: yalin wang
    Reviewed-by: Steven Rostedt
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Now, we have tracepoint in test_pages_isolated() to notify pfn which
    cannot be isolated. But, in alloc_contig_range(), some error path
    doesn't call test_pages_isolated() so it's still hard to know exact pfn
    that causes allocation failure.

    This patch change this situation by calling test_pages_isolated() in
    almost error path. In allocation failure case, some overhead is added
    by this change, but, allocation failure is really rare event so it would
    not matter.

    In fatal signal pending case, we don't call test_pages_isolated()
    because this failure is intentional one.

    There was a bogus outer_start problem due to unchecked buddy order and
    this patch also fix it. Before this patch, it didn't matter, because
    end result is same thing. But, after this patch, tracepoint will report
    failed pfn so it should be accurate.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Acked-by: Michal Nazarewicz
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • cma allocation should be guranteeded to succeed. But sometimes it can
    fail in the current implementation. To track down the problem, we need
    to know which page is problematic and this new tracepoint will report
    it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is preparation step to report test failed pfn in new tracepoint to
    analyze cma allocation failure problem. There is no functional change
    in this patch.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Michal Nazarewicz
    Cc: Minchan Kim
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • __phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
    semmetric:

    - y = (phys_addr_t)x << PAGE_SHIFT

    - y >> PAGE_SHIFT = (phys_add_t)x

    - (unsigned long)(y >> PAGE_SHIFT) = x

    [akpm@linux-foundation.org: use macro arg name `x']
    [arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
    Signed-off-by: Chen Gang
    Cc: Oleg Nesterov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Move trace_reclaim_flags() into trace function, so that we don't need
    caculate these flags if the trace is disabled.

    Signed-off-by: yalin wang
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Simplify may_expand_vm().

    [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
    Signed-off-by: Chen Gang
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Before usage page pointer initialized by NULL is reinitialized by
    follow_page_mask(). Drop useless init of page pointer in the beginning
    of loop.

    Signed-off-by: Alexey Klimov
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Klimov
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Make vmalloc family functions allocate vmalloc area pages with
    alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
    to memcg. This is needed, at least, to account alloc_fdmem allocations.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So this patch switches kmem accounting to the white-policy: now only
    those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
    memcg. Currently, no kmem allocations are marked like this. The
    following patches will mark several kmem allocations that are known to
    be easily triggered from userspace and therefore should be accounted to
    memcg.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This reverts commit 8f4fc071b192 ("gfp: add __GFP_NOACCOUNT").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch
    reverts bits introducing the black-list policy. The white-list policy
    will be introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
    alloc_kmem_pages call) are accounted to memory cgroup automatically.
    Callers have to explicitly opt out if they don't want/need accounting
    for some reason. Such a design decision leads to several problems:

    - kmalloc users are highly sensitive to failures, many of them
    implicitly rely on the fact that kmalloc never fails, while memcg
    makes failures quite plausible.

    - A lot of objects are shared among different containers by design.
    Accounting such objects to one of containers is just unfair.
    Moreover, it might lead to pinning a dead memcg along with its kmem
    caches, which aren't tiny, which might result in noticeable increase
    in memory consumption for no apparent reason in the long run.

    - There are tons of short-lived objects. Accounting them to memcg will
    only result in slight noise and won't change the overall picture, but
    we still have to pay accounting overhead.

    For more info, see

    - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
    - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

    Therefore this patchset switches to the white list policy. Now kmalloc
    users have to explicitly opt in by passing __GFP_ACCOUNT flag.

    Currently, the list of accounted objects is quite limited and only
    includes those allocations that (1) are known to be easily triggered
    from userspace and (2) can fail gracefully (for the full list see patch
    no. 6) and it still misses many object types. However, accounting only
    those objects should be a satisfactory approximation of the behavior we
    used to have for most sane workloads.

    This patch (of 6):

    Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
    to memcg").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch reverts
    bits introducing the black-list policy. The white-list policy will be
    introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Add a new helper function get_first_slab() that get the first slab from
    a kmem_cache_node.

    Signed-off-by: Geliang Tang
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Simplify the code with list_for_each_entry().

    Signed-off-by: Geliang Tang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Simplify the code with list_first_entry_or_null().

    Signed-off-by: Geliang Tang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • A little cleanup - the invocation site provdes the semicolon.

    Cc: Rasmus Villemoes
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • lksb flags are defined both in dlmapi.h and dlmcommon.h. So clean them
    up from dlmcommon.h.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Found this when do patch review, remove to make it clear and save a
    little cpu time.

    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • In ocfs2_orphan_del, currently it finds and deletes entry first, and
    then access orphan dir dinode. This will have a problem once
    ocfs2_journal_access_di fails. In this case, entry will be removed from
    orphan dir, but in deed the inode hasn't been deleted successfully. In
    other words, the file is missing but not actually deleted. So we should
    access orphan dinode first like unlink and rename.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • When two processes are migrating the same lockres,
    dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
    list. dlm_migrate_lockres() will detach the old mle and free the new
    one which is already in hash list, that will destroy the list.

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    xuejiufei
     
  • We have found that migration source will trigger a BUG that the refcount
    of mle is already zero before put when the target is down during
    migration. The situation is as follows:

    dlm_migrate_lockres
    dlm_add_migration_mle
    dlm_mark_lockres_migrating
    dlm_get_mle_inuse
    <<<<<< Now the refcount of the mle is 2.
    dlm_send_one_lockres and wait for the target to become the
    new master.
    <<<<<< o2hb detect the target down and clean the migration
    mle. Now the refcount is 1.

    dlm_migrate_lockres woken, and put the mle twice when found the target
    goes down which trigger the BUG with the following message:

    "ERROR: bad mle: ".

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    xuejiufei
     
  • DLM does not cache locks. So, blocking lock and unlock will only make
    the performance worse where contention over the locks is high.

    Signed-off-by: Goldwyn Rodrigues
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • The following case will lead to slot overwritten.

    N1 N2
    mount ocfs2 volume, find and
    allocate slot 0, then set
    osb->slot_num to 0, begin to
    write slot info to disk
    mount ocfs2 volume, wait for super lock
    write block fail because of
    storage link down, unlock
    super lock
    got super lock and also allocate slot 0
    then unlock super lock

    mount fail and then dismount,
    since osb->slot_num is 0, try to
    put invalid slot to disk. And it
    will succeed if storage link
    restores.
    N2 slot info is now overwritten

    Once another node say N3 mount, it will find and allocate slot 0 again,
    which will lead to mount hung because journal has already been locked by
    N2. so when write slot info failed, invalidate slot in advance to avoid
    overwrite slot.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • dlm_grab() may return NULL when the node is doing unmount. When doing
    code review, we found that some dlm handlers may return error to caller
    when dlm_grab() returns NULL and make caller BUG or other problems.
    Here is an example:

    Node 1 Node 2
    receives migration message
    from node 3, and send
    migrate request to others
    start unmounting

    receives migrate request
    from node 1 and call
    dlm_migrate_request_handler()

    unmount thread unregisters
    domain handlers and removes
    dlm_context from dlm_domains

    dlm_migrate_request_handlers()
    returns -EINVAL to node 1
    Exit migration neither clearing the
    migration state nor sending
    assert master message to node 3 which
    cause node 3 hung.

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Reviewed-by: Yiwen Jiang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei