16 Apr, 2015

40 commits

  • In check_hung_uninterruptible_tasks() avoid the use of deprecated
    while_each_thread().

    The "max_count" logic will prevent a livelock - see commit 0c740d0a
    ("introduce for_each_thread() to replace the buggy while_each_thread()").
    Having said this let's use for_each_process_thread().

    Signed-off-by: Aaron Tomlin
    Acked-by: Oleg Nesterov
    Cc: David Rientjes
    Cc: Dave Wysochanski
    Cc: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     
  • All users of __check_region(), check_region(), and check_mem_region() are
    gone. We got rid of the last user in v4.0-rc1. Remove them.

    bloat-o-meter on x86_64 shows:

    add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
    function old new delta
    __kstrtab___check_region 15 - -15
    __ksymtab___check_region 16 - -16
    __check_region 71 - -71

    Signed-off-by: Jakub Sitnicki
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jakub Sitnicki
     
  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     
  • const char *...[] is not const, but an array of pointer to const. So
    these arrays cannot be __initconst, but must be __initdata

    This fixes section conflicts with LTO.

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The verbose module parameter can be set to 2 for extremely verbose
    messages so the type should be int instead of bool.

    Signed-off-by: Dan Carpenter
    Cc: Tim Waugh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • Commit 607ca46e97a1 ("UAPI: (Scripted) Disintegrate include/linux") left
    behind some empty conditional blocks. Since they are useless and may
    cause a reader to wonder whether something is missing, remove them.

    Signed-off-by: Rasmus Villemoes
    Cc: Geert Uytterhoeven
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • If some issues occurred inside a container guest, host user could not know
    which process is in trouble just by guest pid: the users of container
    guest only knew the pid inside containers. This will bring obstacle for
    trouble shooting.

    This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:

    a) In init_pid_ns, nothing changed;

    b) In one pidns, will tell the pid inside containers:
    NStgid: 21776 5 1
    NSpid: 21776 5 1
    NSpgid: 21776 5 1
    NSsid: 21729 1 0
    ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.

    c) If pidns is nested, it depends on which pidns are you in.
    NStgid: 5 1
    NSpid: 5 1
    NSpgid: 5 1
    NSsid: 1 0
    ** Views from level 1

    [akpm@linux-foundation.org: add CONFIG_PID_NS ifdef]
    Signed-off-by: Chen Hanxiao
    Acked-by: Serge Hallyn
    Acked-by: "Eric W. Biederman"
    Tested-by: Serge Hallyn
    Tested-by: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Hanxiao
     
  • Return a negative error code on failure.

    A simplified version of the semantic match that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @@
    identifier ret; expression e1,e2;
    @@
    (
    if (\(ret < 0\|ret != 0\))
    { ... return ret; }
    |
    ret = 0
    )
    ... when != ret = e1
    when != &ret
    *if(...)
    {
    ... when != ret = e2
    when forall
    return ret;
    }
    //

    Signed-off-by: Julia Lawall
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Do not perform cond_resched() before the busy compaction loop in
    __zs_compact(), because this loop does it when needed.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • There is no point in overriding the size class below. It causes fatal
    corruption on the next chunk on the 3264-bytes size class, which is the
    last size class that is not huge.

    For example, if the requested size was exactly 3264 bytes, current
    zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
    not 4096. User access to this chunk may overwrite head of the next
    adjacent chunk.

    Here is the panic log captured when freelist was corrupted due to this:

    Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
    Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
    Modules linked in:
    exynos-snapshot: core register saved(CPU:5)
    CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
    exynos-snapshot: context saved(CPU:5)
    exynos-snapshot: item - log_kevents is disabled
    CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
    task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
    PC is at obj_idx_to_offset+0x0/0x1c
    LR is at obj_malloc+0x44/0xe8
    pc : [] lr : [] pstate: a0000045
    sp : ffffffc0b71eb790
    x29: ffffffc0b71eb790 x28: ffffffc00204c000
    x27: 000000000001d96f x26: 0000000000000000
    x25: ffffffc098cc3500 x24: ffffffc0a13f2810
    x23: ffffffc098cc3501 x22: ffffffc0a13f2800
    x21: 000011e1a02006e3 x20: ffffffc0a13f2800
    x19: ffffffbc02a7e000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000feb
    x15: 0000000000000000 x14: 00000000a01003e3
    x13: 0000000000000020 x12: fffffffffffffff0
    x11: ffffffc08b264000 x10: 00000000e3a01004
    x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
    x7 : ffffffc000307d24 x6 : 0000000000000000
    x5 : 0000000000000038 x4 : 000000000000011e
    x3 : ffffffbc00003e90 x2 : 0000000000000cc0
    x1 : 00000000d0100371 x0 : ffffffbc00003e90

    Reported-by: Sooyong Suk
    Signed-off-by: Heesub Shin
    Tested-by: Sooyong Suk
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heesub Shin
     
  • In putback_zspage, we don't need to insert a zspage into list of zspage
    in size_class again to just fix fullness group. We could do directly
    without reinsertion so we could save some instuctions.

    Reported-by: Heesub Shin
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Ganesh Mahendran
    Cc: Luigi Semenzato
    Cc: Gunho Lee
    Cc: Juneho Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • A micro-optimization. Avoid additional branching and reduce (a bit)
    registry pressure (f.e. s_off += size; d_off += size; may be calculated
    twise: first for >= PAGE_SIZE check and later for offset update in "else"
    clause).

    scripts/bloat-o-meter shows some improvement

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
    function old new delta
    zs_object_copy 550 540 -10

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Do not synchronize rcu in zs_compact(). Neither zsmalloc not
    zram use rcu.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Add Documentation/ABI/obsolete/sysfs-block-zram file and list obsolete and
    deprecated attributes there. The patch also adds additional information
    to zram documentation and describes the basic strategy:

    - the existing RW nodes will be downgraded to WO nodes (in 4.11)
    - deprecated RO sysfs nodes will eventually be removed (in 4.11)

    Users will be additionally notified about deprecated attr usage by
    pr_warn_once() (added to every deprecated attr _show()), as suggested by
    Minchan Kim.

    User space is advised to use zram/stat, zram/io_stat and
    zram/mm_stat files.

    Signed-off-by: Sergey Senozhatsky
    Reported-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Per-device `zram/mm_stat' file provides mm statistics of a particular
    zram device in a format similar to block layer statistics. The file
    consists of a single line and represents the following stats (separated by
    whitespace):

    orig_data_size
    compr_data_size
    mem_used_total
    mem_limit
    mem_used_max
    zero_pages
    num_migrated

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Per-device `zram/io_stat' file provides accumulated I/O statistics of
    particular zram device in a format similar to block layer statistics. The
    file consists of a single line and represents the following stats
    (separated by whitespace):

    failed_reads
    failed_writes
    invalid_io
    notify_free

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Briefly describe exported device stat attrs in zram documentation. We
    will eventually get rid of per-stat sysfs nodes and, thus, clean up
    Documentation/ABI/testing/sysfs-block-zram file, which is the only source
    of information about device sysfs nodes.

    Add `num_migrated' description, since there is no independent
    `num_migrated' sysfs node (and no corresponding sysfs-block-zram entry),
    it will be exported via zram/mm_stat file.

    At this point we can provide minimal description, because sysfs-block-zram
    still contains detailed information.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Use bio generic_start_io_acct() and generic_end_io_acct() to account
    device's block layer statistics. This will let users to monitor zram
    activities using sysstat and similar packages/tools.

    Apart from the usual per-stat sysfs attr, zram IO stats are now also
    available in '/sys/block/zram/stat' and '/proc/diskstats' files.

    We will slowly get rid of per-stat sysfs files.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • A cosmetic change. We have a new code layout and keep zram per-device
    sysfs store and show functions in one place. Move compact_store() to that
    handlers block to conform to current layout.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This patch introduces rework to zram stats. We have per-stat sysfs nodes,
    and it makes things a bit hard to use in user space: it doesn't give an
    immediate stats 'snapshot', it requires user space to use more syscalls -
    open, read, close for every stat file, with appropriate error checks on
    every step, etc.

    First, zram now accounts block layer statistics, available in
    /sys/block/zram/stat and /proc/diskstats files. So some new stats are
    available (see Documentation/block/stat.txt), besides, zram's activities
    now can be monitored by sysstat's iostat or similar tools.

    Example:
    cat /sys/block/zram0/stat
    248 0 1984 0 251029 0 2008232 5120 0 5116 5116

    Second, group currently exported on per-stat basis nodes into two
    categories (files):

    -- zram/io_stat
    accumulates device's IO stats, that are not accounted by block layer,
    and contains:
    failed_reads
    failed_writes
    invalid_io
    notify_free

    Example:
    cat /sys/block/zram0/io_stat
    0 0 0 652572

    -- zram/mm_stat
    accumulates zram mm stats and contains:
    orig_data_size
    compr_data_size
    mem_used_total
    mem_limit
    mem_used_max
    zero_pages
    num_migrated

    Example:
    cat /sys/block/zram0/mm_stat
    434634752 270288572 279158784 0 579895296 15060 0

    per-stat sysfs nodes are now considered to be deprecated and we plan to
    remove them (and clean up some of the existing stat code) in two years (as
    of now, there is no warning printed to syslog about deprecated stats being
    used). User space is advised to use the above mentioned 3 files.

    This patch (of 7):

    Remove sysfs `num_migrated' attribute. We are moving away from per-stat
    device attrs towards 3 stat files that will accumulate io and mm stats in
    a format similar to block layer statistics in /sys/block//stat. That
    will be easier to use in user space, and reduce the number of syscalls
    needed to read zram device statistics.

    `num_migrated' will return back in zram/mm_stat file.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Signed-off-by: Yinghao Xie
    Suggested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghao Xie
     
  • Create zsmalloc doc which explains design concept and stat information.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • During investigating compaction, fullness information of each class is
    helpful for investigating how the compaction works well. With that, we
    could know how compaction works well more clear on each size class.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We store handle on header of each allocated object so it increases the
    size of each object by sizeof(unsigned long).

    If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
    4104B-class to add handle.

    However, 4104B-class has 1-pages_per_zspage so wasted size by internal
    fragment is 8192 - 4104, which is terrible.

    So this patch records the handle in page->private on such huge object(ie,
    pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
    object so we could use 4096B-class, not 4104B-class.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now that zsmalloc supports compaction, zram can use it. For the first
    step, this patch exports compact knob via sysfs so user can do compaction
    via "echo 1 > /sys/block/zram0/compact".

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
    under 1/4 used objects(ie, fullness_threshold_frac). It could make result
    in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.

    This patch changes the rule so that zsmalloc makes zspage which has above
    3/4 used object ZS_ALMOST_FULL so it could make tight packing.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch provides core functions for migration of zsmalloc. Migraion
    policy is simple as follows.

    for each size class {
    while {
    src_page = get zs_page from ZS_ALMOST_EMPTY
    if (!src_page)
    break;
    dst_page = get zs_page from ZS_ALMOST_FULL
    if (!dst_page)
    dst_page = get zs_page from ZS_ALMOST_EMPTY
    if (!dst_page)
    break;
    migrate(from src_page, to dst_page);
    }
    }

    For migration, we need to identify which objects in zspage are allocated
    to migrate them out. We could know it by iterating of freed objects in a
    zspage because first_page of zspage keeps free objects singly-linked list
    but it's not efficient. Instead, this patch adds a tag(ie,
    OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
    whether the object is allocated easily.

    This patch adds another status bit in handle to synchronize between user
    access through zs_map_object and migration. During migration, we cannot
    move objects user are using due to data coherency between old object and
    new object.

    [akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In later patch, migration needs some part of functions in zs_malloc and
    zs_free so this patch factor out them.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Recently, we started to use zram heavily and some of issues
    popped.

    1) external fragmentation

    I got a report from Juneho Choi that fork failed although there are plenty
    of free pages in the system. His investigation revealed zram is one of
    the culprit to make heavy fragmentation so there was no more contiguous
    16K page for pgd to fork in the ARM.

    2) non-movable pages

    Other problem of zram now is that inherently, user want to use zram as
    swap in small memory system so they use zRAM with CMA to use memory
    efficiently. However, unfortunately, it doesn't work well because zRAM
    cannot use CMA's movable pages unless it doesn't support compaction. I
    got several reports about that OOM happened with zram although there are
    lots of swap space and free space in CMA area.

    3) internal fragmentation

    zRAM has started support memory limitation feature to limit memory usage
    and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
    harmonized with zram-swap to stop anonymous page reclaim if zram consumed
    memory up to the limit although there are free space on the swap. One
    problem for that direction is zram has no way to know any hole in memory
    space zsmalloc allocated by internal fragmentation so zram would regard
    swap is full although there are free space in zsmalloc. For solving the
    issue, zram want to trigger compaction of zsmalloc before it decides full
    or not.

    This patchset is first step to support above issues. For that, it adds
    indirect layer between handle and object location and supports manual
    compaction to solve 3th problem first of all.

    After this patchset got merged, next step is to make VM aware of zsmalloc
    compaction so that generic compaction will move zsmalloced-pages
    automatically in runtime.

    In my imaginary experiment(ie, high compress ratio data with heavy swap
    in/out on 8G zram-swap), data is as follows,

    Before =
    zram allocated object : 60212066 bytes
    zram total used: 140103680 bytes
    ratio: 42.98 percent
    MemFree: 840192 kB

    Compaction

    After =
    frag ratio after compaction
    zram allocated object : 60212066 bytes
    zram total used: 76185600 bytes
    ratio: 79.03 percent
    MemFree: 901932 kB

    Juneho reported below in his real platform with small aging.
    So, I think the benefit would be bigger in real aging system
    for a long time.

    - frag_ratio increased 3% (ie, higher is better)
    - memfree increased about 6MB
    - In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased

    frag ratio after swap fragment
    used : 156677 kbytes
    total: 166092 kbytes
    frag_ratio : 94
    meminfo before compaction
    MemFree: 83724 kB
    Node 0, zone Normal 13642 1364 57 10 61 17 9 5 4 0 0
    Node 0, zone HighMem 425 29 1 0 0 0 0 0 0 0 0

    num_migrated : 23630
    compaction done

    frag ratio after compaction
    used : 156673 kbytes
    total: 160564 kbytes
    frag_ratio : 97
    meminfo after compaction
    MemFree: 89060 kB
    Node 0, zone Normal 14076 1544 67 14 61 17 9 5 4 0 0
    Node 0, zone HighMem 863 50 1 0 0 0 0 0 0 0 0

    This patchset adds more logics(about 480 lines) in zsmalloc but when I
    tested heavy swapin/out program, the regression for swapin/out speed is
    marginal because most of overheads were caused by compress/decompress and
    other MM reclaim stuff.

    This patch (of 7):

    Currently, handle of zsmalloc encodes object's location directly so it
    makes support of migration hard.

    This patch decouples handle and object via adding indirect layer. For
    that, it allocates handle dynamically and returns it to user. The handle
    is the address allocated by slab allocation so it's unique and we could
    keep object's location in the memory space allocated for handle.

    With it, we can change object's position without changing handle itself.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]

    Reported-by: Fengguang Wu
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The original dax patchset split the ext2/4_file_operations because of the
    two NULL splice_read/splice_write in the dax case.

    In the vfs if splice_read/splice_write are NULL we then call
    default_splice_read/write.

    What we do here is make generic_file_splice_read aware of IS_DAX() so the
    original ext2/4_file_operations can be used as is.

    For write it appears that iter_file_splice_write is just fine. It uses
    the regular f_op->write(file,..) or new_sync_write(file, ...).

    Signed-off-by: Boaz Harrosh
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • From: Yigal Korman

    [v1]
    Without this patch, c/mtime is not updated correctly when mmap'ed page is
    first read from and then written to.

    A new xfstest is submitted for testing this (generic/080)

    [v2]
    Jan Kara has pointed out that if we add the
    sb_start/end_pagefault pair in the new pfn_mkwrite we
    are then fixing another bug where: A user could start
    writing to the page while filesystem is frozen.

    Signed-off-by: Yigal Korman
    Signed-off-by: Boaz Harrosh
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
    get notified when access is a write to a read-only PFN.

    This can happen if we mmap() a file then first mmap-read from it to
    page-in a read-only PFN, than we mmap-write to the same page.

    We need this functionality to fix a DAX bug, where in the scenario above
    we fail to set ctime/mtime though we modified the file. An xfstest is
    attached to this patchset that shows the failure and the fix. (A DAX
    patch will follow)

    This functionality is extra important for us, because upon dirtying of a
    pmem page we also want to RDMA the page to a remote cluster node.

    We define a new pfn_mkwrite and do not reuse page_mkwrite because
    1 - The name ;-)
    2 - But mainly because it would take a very long and tedious
    audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
    users. To make sure they do not now CRASH. For example current
    DAX code (which this is for) would crash.
    If we would want to reuse page_mkwrite, We will need to first
    patch all users, so to not-crash-on-no-page. Then enable this
    patch. But even if I did that I would not sleep so well at night.
    Adding a new vector is the safest thing to do, and is not that
    expensive. an extra pointer at a static function vector per driver.
    Also the new vector is better for performance, because else we
    Will call all current Kernel vectors, so to:
    check-ha-no-page-do-nothing and return.

    No need to call it from do_shared_fault because do_wp_page is called to
    change pte permissions anyway.

    Signed-off-by: Yigal Korman
    Signed-off-by: Boaz Harrosh
    Acked-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • A lot of filesystems use generic_file_mmap() and filemap_fault(),
    f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

    This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
    (which is almost always implemented and filesystem-specific).

    Example:

    [ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
    [ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
    [ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
    [ 23.677896] page dumped because: bad pte
    [ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
    [ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

    [akpm@linux-foundation.org: use pr_alert, per Kirill]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Mempools keep allocated objects in reserved for situations when ordinary
    allocation may not be possible to satisfy. These objects shouldn't be
    accessed before they leave the pool.

    This patch poison elements when get into the pool and unpoison when they
    leave it. This will let KASan to detect use-after-free of mempool's
    elements.

    Signed-off-by: Andrey Ryabinin
    Tested-by: David Rientjes
    Cc: Catalin Marinas
    Cc: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Like EXPORT_SYMBOL(): the positioning communicates that the macro pertains
    to the immediately preceding function.

    Cc: Dmitry Safonov
    Cc: Michal Nazarewicz
    Cc: Stefan Strogin
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: Pintu Kumar
    Cc: Weijie Yang
    Cc: Laurent Pinchart
    Cc: Vyacheslav Tyrtov
    Cc: Aleksei Mateosian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Here are two functions that provide interface to compute/get used size and
    size of biggest free chunk in cma region. Add that information to
    debugfs.

    [akpm@linux-foundation.org: move debug code from cma.c into cma_debug.c]
    [stefan.strogin@gmail.com: move code from cma_get_used() and cma_get_maxchunk() to cma_used_get() and cma_maxchunk_get()]
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Stefan Strogin
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: Pintu Kumar
    Cc: Weijie Yang
    Cc: Laurent Pinchart
    Cc: Vyacheslav Tyrtov
    Cc: Aleksei Mateosian
    Signed-off-by: Stefan Strogin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • Few trivial cleanups:

    - no need to call set_recommended_min_free_kbytes() from
    late_initcall() -- start_khugepaged() calls it;

    - no need to call set_recommended_min_free_kbytes() from
    start_khugepaged() if khugepaged is not started;

    - there isn't much point in running start_khugepaged() if we've just
    set transparent_hugepage_flags to zero;

    - start_khugepaged() is misnamed -- it also used to stop the thread;

    Signed-off-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Most-used page->mapping helper -- page_mapping() -- has already uninlined.
    Let's uninline also page_rmapping() and page_anon_vma(). It saves us
    depending on configuration around 400 bytes in text:

    text data bss dec hex filename
    660318 99254 410000 1169572 11d8a4 mm/built-in.o-before
    659854 99254 410000 1169108 11d6d4 mm/built-in.o

    I also tried to make code a bit more clean.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add trace events for cma_alloc() and cma_release().

    The cma_alloc tracepoint is used both for successful and failed allocations,
    in case of allocation failure pfn=-1UL is stored and printed.

    Signed-off-by: Stefan Strogin
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Laurent Pinchart
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Strogin