06 Feb, 2016

12 commits

  • We need to iterate over split_queue, not local empty list to get
    anything split from the shrinker.

    Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()")
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
    anon_vma appeared between lock and unlock. We have to check anon_vma
    first or call anon_vma_prepare() to be sure that it's here. There are
    only few users of these legacy helpers. Let's get rid of them.

    This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
    isn't required here, read lock is enough.

    And reorders expand_downwards/expand_upwards: security_mmap_addr() and
    wrapping-around check don't have to be under anon vma lock.

    Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Dmitry Vyukov
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
    at runtime") has added the runtime gigantic page allocation via
    alloc_contig_range(), making this support available only when CONFIG_CMA
    is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
    associated infrastructure, it is possible with few simple adjustments to
    require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

    After this patch, alloc_contig_range() and related functions are
    available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
    enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
    supporting runtime gigantic pages without the CMA-specific checks in
    page allocator fastpaths.

    Signed-off-by: Vlastimil Babka
    Cc: Luiz Capitulino
    Cc: Kirill A. Shutemov
    Cc: Zhang Yanfei
    Cc: Yasuaki Ishimatsu
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Attempting to preallocate 1G gigantic huge pages at boot time with
    "hugepagesz=1G hugepages=1" on the kernel command line will prevent
    booting with the following:

    kernel BUG at mm/hugetlb.c:1218!

    When mapcount accounting was reworked, the setting of
    compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As
    a result, the validation of mapcount in free_huge_page fails.

    The "BUG_ON" checks in free_huge_page were also changed to
    "VM_BUG_ON_PAGE" to assist with debugging.

    Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
    Signed-off-by: Mike Kravetz
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Tested-by: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Jerome Marchand
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Calling isolate_lru_page() is wrong and shouldn't happen, but it not
    nessesary fatal: the page just will not be isolated if it's not on LRU.

    Let's downgrade the VM_BUG_ON_PAGE() to WARN_RATELIMIT().

    Signed-off-by: Kirill A. Shutemov
    Cc: Dmitry Vyukov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Maybe I miss some point, but I don't see a reason why we try to queue
    pages from non migratable VMAs.

    This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():

    #include
    #include
    #include
    #include
    #include

    #define SIZE 0x2000

    int foo;

    int main()
    {
    int fd;
    char *p;
    unsigned long mask = 2;

    fd = open("/dev/sg0", O_RDWR);
    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    /* Faultin pages */
    foo = p[0] + p[0x1000];
    mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
    return 0;
    }

    The only case when we can queue pages from such VMA is MPOL_MF_STRICT
    plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
    but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
    in vma_migratable()). That's looks like a bug to me.

    Let's filter out non-migratable vma at start of queue_pages_test_walk()
    and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL flag is set.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Dmitry Vyukov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Jan Stancek has reported that system occasionally hanging after "oom01"
    testcase from LTP triggers OOM. Guessing from a result that there is a
    kworker thread doing memory allocation and the values between "Node 0
    Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
    up-to-date for some reason.

    According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
    discover memory reclaim doesn't make any progress"), it meant to force
    the kworker thread to take a short sleep, but it by error used
    schedule_timeout(1). We missed that schedule_timeout() in state
    TASK_RUNNING doesn't do anything.

    Fix it by using schedule_timeout_uninterruptible(1) which forces the
    kworker thread to take a short sleep in order to make sure that vmstat
    is up-to-date.

    Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
    Signed-off-by: Tetsuo Handa
    Reported-by: Jan Stancek
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and
    shut down on idle") made vmstat_shepherd deferrable. vmstat_update
    itself is still useing standard timer which might interrupt idle task.
    This is possible because "mm, vmstat: make quiet_vmstat lighter" removed
    cancel_delayed_work from the quiet_vmstat.

    Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless
    wakeups from the idle context.

    Acked-by: Christoph Lameter
    Signed-off-by: Michal Hocko
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Mike has reported a considerable overhead of refresh_cpu_vm_stats from
    the idle entry during pipe test:

    12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12
    4.75% [kernel] [k] __schedule
    4.70% [kernel] [k] mutex_unlock
    3.14% [kernel] [k] __switch_to

    This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
    deferrable again and shut down on idle") which has placed quiet_vmstat
    into cpu_idle_loop. The main reason here seems to be that the idle
    entry has to get over all zones and perform atomic operations for each
    vmstat entry even though there might be no per cpu diffs. This is a
    pointless overhead for _each_ idle entry.

    Make sure that quiet_vmstat is as light as possible.

    First of all it doesn't make any sense to do any local sync if the
    current cpu is already set in oncpu_stat_off because vmstat_update puts
    itself there only if there is nothing to do.

    Then we can check need_update which should be a cheap way to check for
    potential per-cpu diffs and only then do refresh_cpu_vm_stats.

    The original patch also did cancel_delayed_work which we are not doing
    here. There are two reasons for that. Firstly cancel_delayed_work from
    idle context will blow up on RT kernels (reported by Mike):

    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
    Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
    Call Trace:
    dump_stack+0x49/0x67
    ___might_sleep+0xf5/0x180
    rt_spin_lock+0x20/0x50
    try_to_grab_pending+0x69/0x240
    cancel_delayed_work+0x26/0xe0
    quiet_vmstat+0x75/0xa0
    cpu_idle_loop+0x38/0x3e0
    cpu_startup_entry+0x13/0x20
    start_secondary+0x114/0x140

    And secondly, even on !RT kernels it might add some non trivial overhead
    which is not necessary. Even if the vmstat worker wakes up and preempts
    idle then it will be most likely a single shot noop because the stats
    were already synced and so it would end up on the oncpu_stat_off anyway.
    We just need to teach both vmstat_shepherd and vmstat_update to stop
    scheduling the worker if there is nothing to do.

    [mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU]
    Signed-off-by: Michal Hocko
    Reported-by: Mike Galbraith
    Acked-by: Christoph Lameter
    Signed-off-by: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The description mentions kswapd threads, while the deferred struct page
    initialization is actually done by one-off "pgdatinitX" threads.

    Fix the description so that potentially users are not confused about
    pgdatinit threads using CPU after boot instead of kswapd.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • At the moment memblock_phys_mem_size() is marked as __init, and so is
    discarded after boot. This is different from most of the memblock
    functions which are marked __init_memblock, and are only discarded after
    boot if memory hotplug is not configured.

    To allow for upcoming code which will need memblock_phys_mem_size() in
    the hotplug path, change it from __init to __init_memblock.

    Signed-off-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • The mmap_sem for reading in validate_mm called from expand_stack is not
    enough to prevent the argumented rbtree rb_subtree_gap information to
    change from under us because expand_stack may be running from other
    threads concurrently which will hold the mmap_sem for reading too.

    The argumented rbtree is updated with vma_gap_update under the
    page_table_lock so use it in browse_rb() too to avoid false positives.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Feb, 2016

10 commits

  • Merge fixes from Andrew Morton:
    "18 fixes"

    [ The 18 fixes turned into 17 commits, because one of the fixes was a
    fix for another patch in the series that I just folded in by editing
    the patch manually - hopefully correctly - Linus ]

    * emailed patches from Andrew Morton :
    mm: fix memory leak in copy_huge_pmd()
    drivers/hwspinlock: fix race between radix tree insertion and lookup
    radix-tree: fix race in gang lookup
    mm/vmpressure.c: fix subtree pressure detection
    mm: polish virtual memory accounting
    mm: warn about VmData over RLIMIT_DATA
    Documentation: cgroup-v2: add memory.stat::sock description
    mm: memcontrol: drop superfluous entry in the per-memcg stats array
    drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration
    proc: revert /proc//maps [stack:TID] annotation
    numa: fix /proc//numa_maps for hugetlbfs on s390
    MAINTAINERS: update Seth email
    ocfs2/cluster: fix memory leak in o2hb_region_release
    lib/test-string_helpers.c: fix and improve string_get_size() tests
    thp: limit number of object to scan on deferred_split_scan()
    thp: change deferred_split_count() to return number of THP in queue
    thp: make split_queue per-node

    Linus Torvalds
     
  • Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit
    cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared
    areas"). The warning has served its purpose, nobody was harmed by that
    change, so just remove the warning to generate less noise from Trinity.

    Which reminds me of the comment I wrongly left behind with that commit
    (but was spotted at the time by Kirill), which has since moved into a
    separate function, and become even more obscure: delete it.

    Reported-by: Dave Jones
    Suggested-by: Kirill A. Shutemov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We allocate a pgtable but do not attach it to anything if the PMD is in
    a DAX VMA, causing it to leak.

    We certainly try to not free pgtables associated with the huge zero page
    if the zero page is in a DAX VMA, so I think this is the right solution.
    This needs to be properly audited.

    Signed-off-by: Matthew Wilcox
    Cc: Dan Williams
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • When vmpressure is called for the entire subtree under pressure we
    mistakenly use vmpressure->scanned instead of vmpressure->tree_scanned
    when checking if vmpressure work is to be scheduled. This results in
    suppressing all vmpressure events in the legacy cgroup hierarchy. Fix it.

    Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure")
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • * add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture
    * always account VMAs with flag VM_STACK as stack (as it was before)
    * cleanup classifying helpers
    * update comments and documentation

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Sudip Mukherjee
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch provides a way of working around a slight regression
    introduced by commit 84638335900f ("mm: rework virtual memory
    accounting").

    Before that commit RLIMIT_DATA have control only over size of the brk
    region. But that change have caused problems with all existing versions
    of valgrind, because it set RLIMIT_DATA to zero.

    This patch fixes rlimit check (limit actually in bytes, not pages) and
    by default turns it into warning which prints at first VmData misuse:

    "mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon."

    Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
    /sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y".

    [akpm@linux-foundation.org: tweak kernel-parameters.txt text[
    Signed-off-by: Konstantin Khlebnikov
    Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
    Reported-by: Christian Borntraeger
    Cc: Cyrill Gorcunov
    Cc: Linus Torvalds
    Cc: Vegard Nossum
    Cc: Peter Zijlstra
    Cc: Vladimir Davydov
    Cc: Andy Lutomirski
    Cc: Quentin Casasnovas
    Cc: Kees Cook
    Cc: Willy Tarreau
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") added [stack:TID] annotation to /proc//maps.

    Finding the task of a stack VMA requires walking the entire thread list,
    turning this into quadratic behavior: a thousand threads means a
    thousand stacks, so the rendering of /proc//maps needs to look at a
    million combinations.

    The cost is not in proportion to the usefulness as described in the
    patch.

    Drop the [stack:TID] annotation to make /proc//maps (and
    /proc//numa_maps) usable again for higher thread counts.

    The [stack] annotation inside /proc//task//maps is retained, as
    identifying the stack VMA there is an O(1) operation.

    Siddesh said:
    "The end users needed a way to identify thread stacks programmatically and
    there wasn't a way to do that. I'm afraid I no longer remember (or have
    access to the resources that would aid my memory since I changed
    employers) the details of their requirement. However, I did do this on my
    own time because I thought it was an interesting project for me and nobody
    really gave any feedback then as to its utility, so as far as I am
    concerned you could roll back the main thread maps information since the
    information is available in the thread-specific files"

    Signed-off-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Siddhesh Poyarekar
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If we have a lot of pages in queue to be split, deferred_split_scan()
    can spend unreasonable amount of time under spinlock with disabled
    interrupts.

    Let's cap number of pages to split on scan by sc->nr_to_scan.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • I've got meaning of shrinker::count_objects() wrong: it should return
    number of potentially freeable objects, which is not necessary correlate
    with freeable memory.

    Returning 256 per THP in queue is not reasonable:
    shrinker::scan_objects() never called with nr_to_scan > 128 in my setup.

    Let's return 1 per THP and correct scan_object accordingly.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Andrea Arcangeli suggested to make split queue per-node to improve
    scalability. Let's do it.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Feb, 2016

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
    area for storing a struct page array.

    2/ Fixes for dax operations on a raw block device to prevent pagecache
    collisions with dax mappings.

    3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
    pointer de-reference.

    These have received build success notification from the kbuild robot
    across 153 configs and pass the latest ndctl tests"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    phys_to_pfn_t: use phys_addr_t
    mm: fix pfn_t to page conversion in vm_insert_mixed
    block: use DAX for partition table reads
    block: revert runtime dax control of the raw block device
    fs, block: force direct-I/O for dax-enabled block devices
    devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
    libnvdimm, pfn: fix restoring memmap location
    libnvdimm: fix mode determination for e820 devices

    Linus Torvalds
     

01 Feb, 2016

1 commit

  • pfn_t_to_page() honors the flags in the pfn_t value to determine if a
    pfn is backed by a page. However, vm_insert_mixed() was originally
    written to use pfn_valid() to make this determination. To restore the
    old/correct behavior, ignore the pfn_t flags in the !pfn_t_devmap() case
    and fallback to trusting pfn_valid().

    Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t")
    Cc: Dave Hansen
    Cc: David Airlie
    Reported-by: Tomi Valkeinen
    Tested-by: Tomi Valkeinen
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Jan, 2016

1 commit


27 Jan, 2016

1 commit


25 Jan, 2016

1 commit

  • If we detect that there is nothing to do just set the flag and do not
    check if it was already set before. Races really do not matter. If the
    flag is set by any code then the shepherd will start dealing with the
    situation and reenable the vmstat workers when necessary again.

    Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again
    and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
    a particular cpu to be handled by vmstat_shepherd. This might trigger a
    VM_BUG_ON in vmstat_update because the work item might have been
    sleeping during the idle period and see the cpu_stat_off updated after
    the wake up. The VM_BUG_ON is therefore misleading and no more
    appropriate. Moreover it doesn't really suite any protection from real
    bugs because vmstat_shepherd will simply reschedule the vmstat_work
    anytime it sees a particular cpu set or vmstat_update would do the same
    from the worker context directly. Even when the two would race the
    result wouldn't be incorrect as the counters update is fully idempotent.

    Reported-by: Sasha Levin
    Signed-off-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

6 commits

  • There are many locations that do

    if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
    else
    kfree(ptr);

    but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
    using is_vmalloc_addr(). Unless callers have special reasons, we can
    replace this branch with kvfree(). Please check and reply if you found
    problems.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Jan Kara
    Acked-by: Russell King
    Reviewed-by: Andreas Dilger
    Acked-by: "Rafael J. Wysocki"
    Acked-by: David Rientjes
    Cc: "Luck, Tony"
    Cc: Oleg Drokin
    Cc: Boris Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • To properly handle fsync/msync in an efficient way DAX needs to track
    dirty pages so it is able to flush them durably to media on demand.

    The tracking of dirty pages is done via the radix tree in struct
    address_space. This radix tree is already used by the page writeback
    infrastructure for tracking dirty pages associated with an open file,
    and it already has support for exceptional (non struct page*) entries.
    We build upon these features to add exceptional entries to the radix
    tree for DAX dirty PMD or PTE pages at fault time.

    [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add find_get_entries_tag() to the family of functions that include
    find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
    needed for DAX dirty page handling because we need a list of both page
    offsets and radix tree entries ('indices' and 'entries' in this
    function) that are marked with the PAGECACHE_TAG_TOWRITE tag.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Al Viro

    Al Viro
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

22 Jan, 2016

3 commits

  • This crash is caused by NULL pointer deference, in page_to_pfn() marco,
    when page == NULL :

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    Internal error: Oops: 94000006 [#1] SMP
    Modules linked in:
    CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
    PC is at khugepaged+0x378/0x1af8
    LR is at khugepaged+0x418/0x1af8
    Process khugepaged (pid: 26, stack limit = 0xffffffc079638020)
    Call trace:
    khugepaged+0x378/0x1af8
    kthread+0xdc/0xf4
    ret_from_fork+0xc/0x40
    Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0)
    ---[ end trace 637503d8e28ae69e ]---
    Kernel panic - not syncing: Fatal exception
    CPU2: stopping
    CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
    Hardware name: linux,dummy-virt (DT)

    [akpm@linux-foundation.org: fix fat-fingered merge resolution]
    Signed-off-by: yalin wang
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Tetsuo Handa reported underflow of NR_MLOCK on munlock.

    Testcase:

    #include
    #include
    #include

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL << 21)

    int main(int argc, char *argv[])
    {
    void *addr;

    system("grep Mlocked /proc/meminfo");
    addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
    -1, 0);
    if (addr == MAP_FAILED)
    printf("mmap() failed\n"), exit(1);
    munmap(addr, SIZE);
    system("grep Mlocked /proc/meminfo");
    return 0;
    }

    It happens on munlock_vma_page() due to unfortunate choice of nr_pages
    data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

    For unsigned int nr_pages, implicitly casted to long in
    __mod_zone_page_state(), it becomes something around UINT_MAX.

    munlock_vma_page() usually called for THP as small pages go though
    pagevec.

    Let's make nr_pages signed int.

    Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
    mod_zone_page_state()") used `long' type, but `int' here is OK for a
    count of the number of sub-pages in a huge page.

    Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Cc: Michel Lespinasse
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After THP refcounting rework we have only two possible return values
    from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
    ptl doesn't make much sense in this case.

    Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
    failure.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Cc: Minchan Kim
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2016

3 commits

  • Provide statistics on how much of a cgroup's memory footprint is made up
    of socket buffers from network connections owned by the group.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Provide a cgroup2 memory.stat that provides statistics on LRU memory
    and fault event counters. More consumers and breakdowns will follow.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Changing page->mem_cgroup of a live page is tricky and fragile. In
    particular, the memcg writeback code relies on that mapping being stable
    and users of mem_cgroup_replace_page() not overlapping with dirtyable
    inodes.

    Page cache replacement doesn't have to do that, though. Instead of being
    clever and transferring the charge from the old page to the new,
    force-charge the new page and leave the old page alone. A temporary
    overcharge won't matter in practice, and the old page is going to be freed
    shortly after this anyway. And this is not performance critical.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner