20 Jan, 2017

1 commit

  • commit f05714293a591038304ddae7cb0dd747bb3786cc upstream.

    During developemnt for zram-swap asynchronous writeback, I found strange
    corruption of compressed page, resulting in:

    Modules linked in: zram(E)
    CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff88007620b840 task.stack: ffff880078090000
    RIP: set_freeobj.part.43+0x1c/0x1f
    RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
    RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
    RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
    RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
    R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
    FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
    Call Trace:
    obj_malloc+0x22b/0x260
    zs_malloc+0x1e4/0x580
    zram_bvec_rw+0x4cd/0x830 [zram]
    page_requests_rw+0x9c/0x130 [zram]
    zram_thread+0xe6/0x173 [zram]
    kthread+0xca/0xe0
    ret_from_fork+0x25/0x30

    With investigation, it reveals currently stable page doesn't support
    anonymous page. IOW, reuse_swap_page can reuse the page without waiting
    writeback completion so it can overwrite page zram is compressing.

    Unfortunately, zram has used per-cpu stream feature from v4.7.
    It aims for increasing cache hit ratio of scratch buffer for
    compressing. Downside of that approach is that zram should ask
    memory space for compressed page in per-cpu context which requires
    stricted gfp flag which could be failed. If so, it retries to
    allocate memory space out of per-cpu context so it could get memory
    this time and compress the data again, copies it to the memory space.

    In this scenario, zram assumes the data should never be changed
    but it is not true unless stable page supports. So, If the data is
    changed under us, zram can make buffer overrun because second
    compression size could be bigger than one we got in previous trial
    and blindly, copy bigger size object to smaller buffer which is
    buffer overrun. The overrun breaks zsmalloc free object chaining
    so system goes crash like above.

    I think below is same problem.
    https://bugzilla.suse.com/show_bug.cgi?id=997574

    Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
    writeback in there so the approach in this patch is simply return false if
    we found it needs stable page. Although it increases memory footprint
    temporarily, it happens rarely and it should be reclaimed easily althoug
    it happened. Also, It would be better than waiting of IO completion,
    which is critial path for application latency.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
    Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Darrick J. Wong
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

12 Nov, 2016

1 commit

  • When root activates a swap partition whose header has the wrong
    endianness, nr_badpages elements of badpages are swabbed before
    nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

    This normally is not a security issue because it can only be exploited
    by root (more specifically, a process with CAP_SYS_ADMIN or the ability
    to modify a swap file/partition), and such a process can already e.g.
    modify swapped-out memory of any other userspace process on the system.

    Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Acked-by: Jerome Marchand
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

08 Oct, 2016

2 commits

  • This patch is to improve the performance of swap cache operations when
    the type of the swap device is not 0. Originally, the whole swap entry
    value is used as the key of the swap cache, even though there is one
    radix tree for each swap device. If the type of the swap device is not
    0, the height of the radix tree of the swap cache will be increased
    unnecessary, especially on 64bit architecture. For example, for a 1GB
    swap device on the x86_64 architecture, the height of the radix tree of
    the swap cache is 11. But if the offset of the swap entry is used as
    the key of the swap cache, the height of the radix tree of the swap
    cache is 4. The increased height causes unnecessary radix tree
    descending and increased cache footprint.

    This patch reduces the height of the radix tree of the swap cache via
    using the offset of the swap entry instead of the whole swap entry value
    as the key of the swap cache. In 32 processes sequential swap out test
    case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
    for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
    when the type of the swap device is 1.

    Use the whole swap entry as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

    Use the swap offset as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

    Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This is a code clean up patch without functionality changes. The
    swap_cluster_list data structure and its operations are introduced to
    provide some better encapsulation for the free cluster and discard
    cluster list operations. This avoid some code duplication, improved the
    code readability, and reduced the total line number.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

20 Sep, 2016

1 commit

  • Commit 62c230bc1790 ("mm: add support for a filesystem to activate
    swap files and use direct_IO for writing swap pages") replaced the
    swap_aops dirty hook from __set_page_dirty_no_writeback() with
    swap_set_page_dirty().

    For normal cases without these special SWP flags code path falls back to
    __set_page_dirty_no_writeback() so the behaviour is expected to be the
    same as before.

    But swap_set_page_dirty() makes use of the page_swap_info() helper to
    get the swap_info_struct to check for the flags like SWP_FILE,
    SWP_BLKDEV etc as desired for those features. This helper has
    BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
    set_page_dirty_lock() path.

    For the set_page_dirty() path which is often needed for cases to be
    called from irq context, kswapd() can toggle the flag behind the back
    while the call is getting executed when system is low on memory and
    heavy swapping is ongoing.

    This ends up with undesired kernel panic.

    This patch just moves the check outside the helper to its users
    appropriately to fix kernel panic for the described path. Couple of
    users of helpers already take care of SwapCache condition so I skipped
    them.

    Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
    Signed-off-by: Santosh Shilimkar
    Cc: Mel Gorman
    Cc: Joe Perches
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: David S. Miller
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Al Viro
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     

27 Jul, 2016

1 commit

  • I have noticed that frontswap.h first declares "frontswap_enabled" as
    extern bool variable, and then overrides it with "#define
    frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled. The
    bool variable isn't actually instantiated anywhere.

    This all looks like an unfinished attempt to make frontswap_enabled
    reflect whether a backend is instantiated. But in the current state,
    all frontswap hooks call unconditionally into frontswap.c just to check
    if frontswap_ops is non-NULL. This should at least be checked inline,
    but we can further eliminate the overhead when CONFIG_FRONTSWAP is
    enabled and no backend registered, using a static key that is initially
    disabled, and gets enabled only upon first backend registration.

    Thus, checks for "frontswap_enabled" are replaced with
    "frontswap_enabled()" wrapping the static key check. There are two
    exceptions:

    - xen's selfballoon_process() was testing frontswap_enabled in code guarded
    by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable.
    The patch just removes this check. Using frontswap_enabled() does not sound
    correct here, as this can be true even without xen's own backend being
    registered.

    - in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP)
    as it seems the bitmap allocation cannot currently be postponed until a
    backend is registered. This means that frontswap will still have some
    memory overhead by being configured, but without a backend.

    After the patch, we can expect that some functions in frontswap.c are
    called only when frontswap_ops is non-NULL. Change the checks there to
    VM_BUG_ONs. While at it, convert other BUG_ONs to VM_BUG_ONs as
    frontswap has been stable for some time.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Juergen Gross
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

13 May, 2016

1 commit

  • This will provide fully accuracy to the mapcount calculation in the
    write protect faults, so page pinning will not get broken by false
    positive copy-on-writes.

    total_mapcount() isn't the right calculation needed in
    reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
    that is effectively the full accurate return value for page_mapcount()
    if dealing with Transparent Hugepages, however we only use the
    page_trans_huge_mapcount() during COW faults where it strictly needed,
    due to its higher runtime cost.

    This also provide at practical zero cost the total_mapcount
    information which is needed to know if we can still relocate the page
    anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
    can reuse the page no matter if it's a pte or a pmd_trans_huge
    triggering the fault, but we can only relocate the page anon_vma to
    the local vma->anon_vma if we're sure it's only this "vma" mapping the
    whole THP physical range.

    Kirill A. Shutemov discovered the problem with moving the page
    anon_vma to the local vma->anon_vma in a previous version of this
    patch and another problem in the way page_move_anon_rmap() was called.

    Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
    previous version, because reuse_swap_page must be a macro to call
    page_trans_huge_mapcount from swap.h, so this uses a macro again
    instead of an inline function. With this change at least it's a less
    dangerous usage than it was before, because "page" is used only once
    now, while with the previous code reuse_swap_page(page++) would have
    called page_mapcount on page+1 and it would have increased page twice
    instead of just once.

    Dean Luick noticed an uninitialized variable that could result in a
    rmap inefficiency for the non-THP case in a previous version.

    Mike Marciniszyn said:

    : Our RDMA tests are seeing an issue with memory locking that bisects to
    : commit 61f5d698cc97 ("mm: re-enable THP")
    :
    : The test program registers two rather large MRs (512M) and RDMA
    : writes data to a passive peer using the first and RDMA reads it back
    : into the second MR and compares that data. The sizes are chosen randomly
    : between 0 and 1024 bytes.
    :
    : The test will get through a few (
    Reviewed-by: "Kirill A. Shutemov"
    Reviewed-by: Dean Luick
    Tested-by: Alex Williamson
    Tested-by: Mike Marciniszyn
    Tested-by: Josh Collier
    Cc: Marc Haber
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Mar, 2016

1 commit

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for 4.6 kernel.

    Overall the coolest thing here for me is the nouveau maxwell signed
    firmware support from NVidia, it's taken a long while to extract this
    from them.

    I also wish the ARM vendors just designed one set of display IP, ARM
    display block proliferation is definitely increasing.

    Core:
    - drm_event cleanups
    - Internal API cleanup making mode_fixup optional.
    - Apple GMUX vga switcheroo support.
    - DP AUX testing interface

    Panel:
    - Refactoring of DSI core for use over more transports.

    New driver:
    - ARM hdlcd driver

    i915:
    - FBC/PSR (framebuffer compression, panel self refresh) enabled by default.
    - Ongoing atomic display support work
    - Ongoing runtime PM work
    - Pixel clock limit checks
    - VBT DSI description support
    - GEM fixes
    - GuC firmware scheduler enhancements

    amdkfd:
    - Deferred probing fixes to avoid make file or link ordering.

    amdgpu/radeon:
    - ACP support for i2s audio support.
    - Command Submission/GPU scheduler/GPUVM optimisations
    - Initial GPU reset support for amdgpu

    vmwgfx:
    - Support for DX10 gen mipmaps
    - Pageflipping and other fixes.

    exynos:
    - Exynos5420 SoC support for FIMD
    - Exynos5422 SoC support for MIPI-DSI

    nouveau:
    - GM20x secure boot support - adds acceleration for Maxwell GPUs.
    - GM200 support
    - GM20B clock driver support
    - Power sensors work

    etnaviv:
    - Correctness fixes for GPU cache flushing
    - Better support for i.MX6 systems.

    imx-drm:
    - VBlank IRQ support
    - Fence support
    - OF endpoint support

    msm:
    - HDMI support for 8996 (snapdragon 820)
    - Adreno 430 support
    - Timestamp queries support

    virtio-gpu:
    - Fixes for Android support.

    rockchip:
    - Add support for Innosilicion HDMI

    rcar-du:
    - Support for 4 crtcs
    - R8A7795 support
    - RCar Gen 3 support

    omapdrm:
    - HDMI interlace output support
    - dma-buf import support
    - Refactoring to remove a lot of legacy code.

    tilcdc:
    - Rewrite of pageflipping code
    - dma-buf support
    - pinctrl support

    vc4:
    - HDMI modesetting bug fixes
    - Significant 3D performance improvement.

    fsl-dcu (FreeScale):
    - Lots of fixes

    tegra:
    - Two small fixes

    sti:
    - Atomic support for planes
    - Improved HDMI support"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1063 commits)
    drm/amdgpu: release_pages requires linux/pagemap.h
    drm/sti: restore mode_fixup callback
    drm/amdgpu/gfx7: add MTYPE definition
    drm/amdgpu: removing BO_VAs shouldn't be interruptible
    drm/amd/powerplay: show uvd/vce power gate enablement for tonga.
    drm/amd/powerplay: show uvd/vce power gate info for fiji
    drm/amdgpu: use sched fence if possible
    drm/amdgpu: move ib.fence to job.fence
    drm/amdgpu: give a fence param to ib_free
    drm/amdgpu: include the right version of gmc header files for iceland
    drm/radeon: fix indentation.
    drm/amd/powerplay: add uvd/vce dpm enabling flag to fix the performance issue for CZ
    drm/amdgpu: switch back to 32bit hw fences v2
    drm/amdgpu: remove amdgpu_fence_is_signaled
    drm/amdgpu: drop the extra fence range check v2
    drm/amdgpu: signal fences directly in amdgpu_fence_process
    drm/amdgpu: cleanup amdgpu_fence_wait_empty v2
    drm/amdgpu: keep all fences in an RCU protected array v2
    drm/amdgpu: add number of hardware submissions to amdgpu_fence_driver_init_ring
    drm/amdgpu: RCU protected amd_sched_fence_release
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

09 Feb, 2016

1 commit

  • - support for v3 vbt dsi blocks (Jani)
    - improve mmio debug checks (Mika Kuoppala)
    - reorg the ddi port translation table entries and related code (Ville)
    - reorg gen8 interrupt handling for future platforms (Tvrtko)
    - refactor tile width/height computations for framebuffers (Ville)
    - kerneldoc integration for intel_pm.c (Jani)
    - move default context from engines to device-global dev_priv (Dave Gordon)
    - make seqno/irq ordering coherent with execlist (Chris)
    - decouple internal engine number from UABI (Chris&Tvrtko)
    - tons of small fixes all over, as usual

    * tag 'drm-intel-next-2016-01-24' of git://anongit.freedesktop.org/drm-intel: (148 commits)
    drm/i915: Update DRIVER_DATE to 20160124
    drm/i915: Seal busy-ioctl uABI and prevent leaking of internal ids
    drm/i915: Decouple execbuf uAPI from internal implementation
    drm/i915: Use ordered seqno write interrupt generation on gen8+ execlists
    drm/i915: Limit the auto arming of mmio debugs on vlv/chv
    drm/i915: Tune down "GT register while GT waking disabled" message
    drm/i915: tidy up a few leftovers
    drm/i915: abolish separate per-ring default_context pointers
    drm/i915: simplify allocation of driver-internal requests
    drm/i915: Fix NULL plane->fb oops on SKL
    drm/i915: Do not put big intel_crtc_state on the stack
    Revert "drm/i915: Add two-stage ILK-style watermark programming (v10)"
    drm/i915: add DOC: headline to RC6 kernel-doc
    drm/i915: turn some bogus kernel-doc comments to normal comments
    drm/i915/sdvo: revert bogus kernel-doc comments to normal comments
    drm/i915/gen9: Correct max save/restore register count during gpu reset with GuC
    drm/i915: Demote user facing DMC firmware load failure message
    drm/i915: use hlist_for_each_entry
    drm/i915: skl_update_scaler() wants a rotation bitmask instead of bit number
    drm/i915: Don't reject primary plane windowing with color keying enabled on SKL+
    ...

    Dave Airlie
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2016

2 commits

  • Swap cache pages are freed aggressively if swap is nearly full (>50%
    currently), because otherwise we are likely to stop scanning anonymous
    when we near the swap limit even if there is plenty of freeable swap cache
    pages. We should follow the same trend in case of memory cgroup, which
    has its own swap limit.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset introduces swap accounting to cgroup2.

    This patch (of 7):

    In the legacy hierarchy we charge memsw, which is dubious, because:

    - memsw.limit must be >= memory.limit, so it is impossible to limit
    swap usage less than memory usage. Taking into account the fact that
    the primary limiting mechanism in the unified hierarchy is
    memory.high while memory.limit is either left unset or set to a very
    large value, moving memsw.limit knob to the unified hierarchy would
    effectively make it impossible to limit swap usage according to the
    user preference.

    - memsw.usage != memory.usage + swap.usage, because a page occupying
    both swap entry and a swap cache page is charged only once to memsw
    counter. As a result, it is possible to effectively eat up to
    memory.limit of memory pages *and* memsw.limit of swap entries, which
    looks unexpected.

    That said, we should provide a different swap limiting mechanism for
    cgroup2.

    This patch adds mem_cgroup->swap counter, which charges the actual number
    of swap entries used by a cgroup. It is only charged in the unified
    hierarchy, while the legacy hierarchy memsw logic is left intact.

    The swap usage can be monitored using new memory.swap.current file and
    limited using memory.swap.max.

    Note, to charge swap resource properly in the unified hierarchy, we have
    to make swap_entry_free uncharge swap only when ->usage reaches zero, not
    just ->count, i.e. when all references to a swap entry, including the one
    taken by swap cache, are gone. This is necessary, because otherwise
    swap-in could result in uncharging swap even if the page is still in swap
    cache and hence still occupies a swap entry. At the same time, this
    shouldn't break memsw counter logic, where a page is never charged twice
    for using both memory and swap, because in case of legacy hierarchy we
    uncharge swap on commit (see mem_cgroup_commit_charge).

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Jan, 2016

4 commits

  • Both s390 and powerpc have hit the issue of swapoff hanging, when
    CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
    quite as x86_64 had them. I think it would be much clearer if
    HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
    determine whether the MEM_SOFT_DIRTY option should be offered, and the
    actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.

    But won't embark on that change myself: instead make swapoff more
    robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
    without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
    whether the bit in question is defined as 0 or the asm-generic fallback
    is used, unless soft dirty is fully turned on.

    Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().

    Signed-off-by: Hugh Dickins
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Cyrill Gorcunov
    Cc: Laurent Dufour
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • With new refcounting we will be able map the same compound page with
    PTEs and PMDs. It requires adjustment to conditions when we can reuse
    the page on write-protection fault.

    For PTE fault we can't reuse the page if it's part of huge page.

    For PMD we can only reuse the page if nobody else maps the huge page or
    it's part. We can do it by checking page_mapcount() on each sub-page,
    but it's expensive.

    The cheaper way is to check page_count() to be equal 1: every mapcount
    takes page reference, so this way we can guarantee, that the PMD is the
    only mapping.

    This approach can give false negative if somebody pinned the page, but
    that doesn't affect correctness.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with rmap, with new refcounting we cannot rely on PageTransHuge() to
    check if we need to charge size of huge page form the cgroup. We need
    to get information from caller to know whether it was mapped with PMD or
    PTE.

    We do uncharge when last reference on the page gone. At that point if
    we see PageTransHuge() it means we need to unchange whole huge page.

    The tricky part is partial unmap -- when we try to unmap part of huge
    page. We don't do a special handing of this situation, meaning we don't
    uncharge the part of huge page unless last user is gone or
    split_huge_page() is triggered. In case of cgroup memory pressure
    happens the partial unmapped page will be split through shrinker. This
    should be good enough.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

2 commits


05 Jan, 2016

1 commit

  • Some modules, like i915.ko, use swappable objects and may try to swap
    them out under memory pressure (via the shrinker). Before doing so, they
    want to check using get_nr_swap_pages() to see if any swap space is
    available as otherwise they will waste time purging the object from the
    device without recovering any memory for the system. This requires the
    nr_swap_pages counter to be exported to the modules.

    Signed-off-by: Chris Wilson
    Cc: "Goel, Akash"
    Cc: Johannes Weiner
    Cc: linux-mm@kvack.org
    Link: http://patchwork.freedesktop.org/patch/msgid/1449244734-25733-1-git-send-email-chris@chris-wilson.co.uk
    Acked-by: Andrew Morton
    Acked-by: Johannes Weiner
    Signed-off-by: Daniel Vetter

    Chris Wilson
     

09 Sep, 2015

1 commit

  • We want to know per-process workingset size for smart memory management
    on userland and we use swap(ex, zram) heavily to maximize memory
    efficiency so workingset includes swap as well as RSS.

    On such system, if there are lots of shared anonymous pages, it's really
    hard to figure out exactly how many each process consumes memory(ie, rss
    + wap) if the system has lots of shared anonymous memory(e.g, android).

    This patch introduces SwapPss field on /proc//smaps so we can get
    more exact workingset size per process.

    Bongkyu tested it. Result is below.

    1. 50M used swap
    SwapTotal: 461976 kB
    SwapFree: 411192 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    48236
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    141184

    2. 240M used swap
    SwapTotal: 461976 kB
    SwapFree: 216808 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    230315
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    1387744

    [akpm@linux-foundation.org: simplify kunmap_atomic() call]
    Signed-off-by: Minchan Kim
    Reported-by: Bongkyu Kim
    Tested-by: Bongkyu Kim
    Cc: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Jonathan Corbet
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

21 Aug, 2015

1 commit

  • While running KernelThreadSanitizer (ktsan) on upstream kernel with
    trinity, we got a few reports from SyS_swapon, here is one of them:

    Read of size 8 by thread T307 (K7621):
    [< inlined >] SyS_swapon+0x3c0/0x1850 SYSC_swapon mm/swapfile.c:2395
    [] SyS_swapon+0x3c0/0x1850 mm/swapfile.c:2345
    [] ia32_do_call+0x1b/0x25

    Looks like the swap_lock should be taken when iterating through the
    swap_info array on lines 2392 - 2401: q->swap_file may be reset to
    NULL by another thread before it is dereferenced for f_mapping.

    But why is that iteration needed at all? Doesn't the claim_swapfile()
    which follows do all that is needed to check for a duplicate entry -
    FMODE_EXCL on a bdev, testing IS_SWAPFILE under i_mutex on a regfile?

    Well, not quite: bd_may_claim() allows the same "holder" to claim the
    bdev again, so we do need to use a different holder than "sys_swapon";
    and we should not replace appropriate -EBUSY by inappropriate -EINVAL.

    Index i was reused in a cpu loop further down: renamed cpu there.

    Reported-by: Andrey Konovalov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Al Viro

    Hugh Dickins
     

24 Jun, 2015

1 commit


16 Apr, 2015

1 commit

  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

11 Dec, 2014

1 commit

  • Now that the external page_cgroup data structure and its lookup is gone,
    the only code remaining in there is swap slot accounting.

    Rename it and move the conditional compilation into mm/Makefile.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Aug, 2014

2 commits

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Jun, 2014

3 commits

  • Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we
    can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
    gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that
    the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
    has already become a dead code snippet, and it should have been deleted.

    This patch is to delete the redundant loop as Hugh and Shaohua
    suggested.

    [hughd@google.com: fix comment, simplify code]
    Signed-off-by: Chen Yucong
    Cc: Shaohua Li
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

07 Feb, 2014

1 commit

  • swapoff clear swap_info's SWP_USED flag prematurely and free its
    resources after that. A concurrent swapon will reuse this swap_info
    while its previous resources are not cleared completely.

    These late freed resources are:
    - p->percpu_cluster
    - swap_cgroup_ctrl[type]
    - block_device setting
    - inode->i_flags &= ~S_SWAPFILE

    This patch clears the SWP_USED flag after all its resources are freed,
    so that swapon can reuse this swap_info by alloc_swap_info() safely.

    [akpm@linux-foundation.org: tidy up code comment]
    Signed-off-by: Weijie Yang
    Acked-by: Hugh Dickins
    Cc: Krzysztof Kozlowski
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     

24 Jan, 2014

2 commits

  • In the second half of scan_swap_map()'s scan loop, offset is set to
    si->lowest_bit and then incremented before entering the loop for the
    first time, causing si->swap_map[si->lowest_bit] to be skipped.

    Signed-off-by: Jamie Liu
    Cc: Shaohua Li
    Acked-by: Hugh Dickins
    Cc: Minchan Kim
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jamie Liu
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

13 Nov, 2013

2 commits

  • During swapoff the frontswap_map was NULL-ified before calling
    frontswap_invalidate_area(). However the frontswap_invalidate_area()
    exits early if frontswap_map is NULL. Invalidate was never called
    during swapoff.

    This patch moves frontswap_map_set() in swapoff just after calling
    frontswap_invalidate_area() so outside of locks (swap_lock and
    swap_info_struct->lock). This shouldn't be a problem as during swapon
    the frontswap_map_set() is called also outside of any locks.

    Signed-off-by: Krzysztof Kozlowski
    Reviewed-by: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • Signed-off-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

17 Oct, 2013

1 commit

  • Fix race between swapoff and swapon. Swapoff used old_block_size from
    swap_info outside of swapon_mutex so it could be overwritten by
    concurrent swapon.

    The race has visible effect only if more than one swap block device
    exists with different block sizes (e.g. /dev/sda1 with block size 4096
    and /dev/sdb1 with 512). In such case it leads to setting the blocksize
    of swapped off device with wrong blocksize.

    The bug can be triggered with multiple concurrent swapoff and swapon:
    0. Swap for some device is on.
    1. swapoff:
    First the swapoff is called on this device and "struct swap_info_struct
    *p" is assigned. This is done under swap_lock however this lock is
    released for the call try_to_unuse().

    2. swapon:
    After the assignment above (and before acquiring swapon_mutex &
    swap_lock by swapoff) the swapon is called on the same device.
    The p->old_block_size is assigned to the value of block_size the device.
    This block size should be the same as previous but sometimes it is not.
    The swapon ends successfully.

    3. swapoff:
    Swapoff resumes, grabs the locks and mutex and continues to disable this
    swap device. Now it sets the block size to value taken from swap_info
    which was overwritten by swapon in 2.

    Signed-off-by: Krzysztof Kozlowski
    Reported-by: Weijie Yang
    Cc: Bob Liu
    Cc: Konrad Rzeszutek Wilk
    Cc: Shaohua Li
    Cc: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     

12 Sep, 2013

3 commits

  • swap cluster allocation is to get better request merge to improve
    performance. But the cluster is shared globally, if multiple tasks are
    doing swap, this will cause interleave disk access. While multiple tasks
    swap is quite common, for example, each numa node has a kswapd thread
    doing swap and multiple threads/processes doing direct page reclaim.

    ioscheduler can't help too much here, because tasks don't send swapout IO
    down to block layer in the meantime. Block layer does merge some IOs, but
    a lot not, depending on how many tasks are doing swapout concurrently. In
    practice, I've seen a lot of small size IO in swapout workloads.

    We makes the cluster allocation per-cpu here. The interleave disk access
    issue goes away. All tasks swapout to their own cluster, so swapout will
    become sequential, which can be easily merged to big size IO. If one CPU
    can't get its per-cpu cluster (for example, there is no free cluster
    anymore in the swap), it will fallback to scan swap_map. The CPU can
    still continue swap. We don't need recycle free swap entries of other
    CPUs.

    In my test (swap to a 2-disk raid0 partition), this improves around 10%
    swapout throughput, and request size is increased significantly.

    How does this impact swap readahead is uncertain though. On one side,
    page reclaim always isolates and swaps several adjancent pages, this will
    make page reclaim write the pages sequentially and benefit readahead. On
    the other side, several CPU write pages interleave means the pages don't
    live _sequentially_ but relatively _near_. In the per-cpu allocation
    case, if adjancent pages are written by different cpus, they will live
    relatively _far_. So how this impacts swap readahead depends on how many
    pages page reclaim isolates and swaps one time. If the number is big,
    this patch will benefit swap readahead. Of course, this is about
    sequential access pattern. The patch has no impact for random access
    pattern, because the new cluster allocation algorithm is just for SSD.

    Alternative solution is organizing swap layout to be per-mm instead of
    this per-cpu approach. In the per-mm layout, we allocate a disk range for
    each mm, so pages of one mm live in swap disk adjacently. per-mm layout
    has potential issues of lock contention if multiple reclaimers are swap
    pages from one mm. For a sequential workload, per-mm layout is better to
    implement swap readahead, because pages from the mm are adjacent in disk.
    But per-cpu layout isn't very bad in this workload, as page reclaim always
    isolates and swaps several pages one time, such pages will still live in
    disk sequentially and readahead can utilize this. For a random workload,
    per-mm layout isn't beneficial of request merge, because it's quite
    possible pages from different mm are swapout in the meantime and IO can't
    be merged in per-mm layout. while with per-cpu layout we can merge
    requests from any mm. Considering random workload is more popular in
    workloads with swap (and per-cpu approach isn't too bad for sequential
    workload too), I'm choosing per-cpu layout.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • The previous patch can expose races, according to Hugh:

    swapoff was sometimes failing with "Cannot allocate memory", coming from
    try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
    on a free entry temporarily SWAP_MAP_BAD while being discarded.

    We should use ACCESS_ONCE() there, and whenever accessing swap_map
    locklessly; but rather than peppering it throughout try_to_unuse(), just
    declare *swap_map with volatile.

    try_to_unuse() is accustomed to *swap_map going down racily, but not
    necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
    prevent that transition once SWP_WRITEOK is switched off, when it's a
    waste of time to issue discards anyway (swapon can do a whole discard).

    Another issue is:

    In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
    because we don't check if readahead swap entry is bad. This doesn't break
    anything but such swapin page is wasteful and can only be freed at page
    reclaim. We should avoid read such swap entry. And in discard, we mark
    swap entry SWAP_MAP_BAD and then switch it to normal when discard is
    finished. If readahead reads such swap entry, we have the same issue, so
    we much check if swap entry is bad too.

    Thanks Hugh to inspire swapin_readahead could use bad swap entry.

    [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
    Signed-off-by: Shaohua Li
    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • swap can do cluster discard for SSD, which is good, but there are some
    problems here:

    1. swap do the discard just before page reclaim gets a swap entry and
    writes the disk sectors. This is useless for high end SSD, because an
    overwrite to a sector implies a discard to original sector too. A
    discard + overwrite == overwrite.

    2. the purpose of doing discard is to improve SSD firmware garbage
    collection. Idealy we should send discard as early as possible, so
    firmware can do something smart. Sending discard just after swap entry
    is freed is considered early compared to sending discard before write.
    Of course, if workload is already bound to gc speed, sending discard
    earlier or later doesn't make

    3. block discard is a sync API, which will delay scan_swap_map()
    significantly.

    4. Write and discard command can be executed parallel in PCIe SSD.
    Making swap discard async can make execution more efficiently.

    This patch makes swap discard async and moves discard to where swap entry
    is freed. Discard and write have no dependence now, so above issues can
    be avoided. Idealy we should do discard for any freed sectors, but some
    SSD discard is very slow. This patch still does discard for a whole
    cluster.

    My test does a several round of 'mmap, write, unmap', which will trigger a
    lot of swap discard. In a fusionio card, with this patch, the test
    runtime is reduced to 18% of the time without it, so around 5.5x faster.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li