Commit 0f8975ec4db2c8b5bd111b211292ca9be0feb6b8

Authored by Pavel Emelyanov
Committed by Linus Torvalds
1 parent 2b0a9f0175

mm: soft-dirty bits for user memory changes tracking

The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to.  In order to do this tracking one should

  1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
  2. Wait some time.
  3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is.  Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast.  This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 11 changed files with 158 additions and 10 deletions Side-by-side Diff

Documentation/filesystems/proc.txt
... ... @@ -473,7 +473,8 @@
473 473 enabled.
474 474  
475 475 The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
476   -bits on both physical and virtual pages associated with a process.
  476 +bits on both physical and virtual pages associated with a process, and the
  477 +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
477 478 To clear the bits for all the pages associated with the process
478 479 > echo 1 > /proc/PID/clear_refs
479 480  
... ... @@ -482,6 +483,10 @@
482 483  
483 484 To clear the bits for the file mapped pages associated with the process
484 485 > echo 3 > /proc/PID/clear_refs
  486 +
  487 +To clear the soft-dirty bit
  488 + > echo 4 > /proc/PID/clear_refs
  489 +
485 490 Any other value written to /proc/PID/clear_refs will have no effect.
486 491  
487 492 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
Documentation/vm/soft-dirty.txt
  1 + SOFT-DIRTY PTEs
  2 +
  3 + The soft-dirty is a bit on a PTE which helps to track which pages a task
  4 +writes to. In order to do this tracking one should
  5 +
  6 + 1. Clear soft-dirty bits from the task's PTEs.
  7 +
  8 + This is done by writing "4" into the /proc/PID/clear_refs file of the
  9 + task in question.
  10 +
  11 + 2. Wait some time.
  12 +
  13 + 3. Read soft-dirty bits from the PTEs.
  14 +
  15 + This is done by reading from the /proc/PID/pagemap. The bit 55 of the
  16 + 64-bit qword is the soft-dirty one. If set, the respective PTE was
  17 + written to since step 1.
  18 +
  19 +
  20 + Internally, to do this tracking, the writable bit is cleared from PTEs
  21 +when the soft-dirty bit is cleared. So, after this, when the task tries to
  22 +modify a page at some virtual address the #PF occurs and the kernel sets
  23 +the soft-dirty bit on the respective PTE.
  24 +
  25 + Note, that although all the task's address space is marked as r/o after the
  26 +soft-dirty bits clear, the #PF-s that occur after that are processed fast.
  27 +This is so, since the pages are still mapped to physical memory, and thus all
  28 +the kernel does is finds this fact out and puts both writable and soft-dirty
  29 +bits on the PTE.
  30 +
  31 +
  32 + This feature is actively used by the checkpoint-restore project. You
  33 +can find more details about it on http://criu.org
  34 +
  35 +
  36 +-- Pavel Emelyanov, Apr 9, 2013
... ... @@ -365,6 +365,9 @@
365 365 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
366 366 bool
367 367  
  368 +config HAVE_ARCH_SOFT_DIRTY
  369 + bool
  370 +
368 371 config HAVE_MOD_ARCH_SPECIFIC
369 372 bool
370 373 help
... ... @@ -102,6 +102,7 @@
102 102 select HAVE_ARCH_SECCOMP_FILTER
103 103 select BUILDTIME_EXTABLE_SORT
104 104 select GENERIC_CMOS_UPDATE
  105 + select HAVE_ARCH_SOFT_DIRTY
105 106 select CLOCKSOURCE_WATCHDOG
106 107 select GENERIC_CLOCKEVENTS
107 108 select ARCH_CLOCKSOURCE_DATA if X86_64
arch/x86/include/asm/pgtable.h
... ... @@ -207,7 +207,7 @@
207 207  
208 208 static inline pte_t pte_mkdirty(pte_t pte)
209 209 {
210   - return pte_set_flags(pte, _PAGE_DIRTY);
  210 + return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
211 211 }
212 212  
213 213 static inline pte_t pte_mkyoung(pte_t pte)
... ... @@ -271,7 +271,7 @@
271 271  
272 272 static inline pmd_t pmd_mkdirty(pmd_t pmd)
273 273 {
274   - return pmd_set_flags(pmd, _PAGE_DIRTY);
  274 + return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
275 275 }
276 276  
277 277 static inline pmd_t pmd_mkhuge(pmd_t pmd)
... ... @@ -292,6 +292,26 @@
292 292 static inline pmd_t pmd_mknotpresent(pmd_t pmd)
293 293 {
294 294 return pmd_clear_flags(pmd, _PAGE_PRESENT);
  295 +}
  296 +
  297 +static inline int pte_soft_dirty(pte_t pte)
  298 +{
  299 + return pte_flags(pte) & _PAGE_SOFT_DIRTY;
  300 +}
  301 +
  302 +static inline int pmd_soft_dirty(pmd_t pmd)
  303 +{
  304 + return pmd_flags(pmd) & _PAGE_SOFT_DIRTY;
  305 +}
  306 +
  307 +static inline pte_t pte_mksoft_dirty(pte_t pte)
  308 +{
  309 + return pte_set_flags(pte, _PAGE_SOFT_DIRTY);
  310 +}
  311 +
  312 +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
  313 +{
  314 + return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY);
295 315 }
296 316  
297 317 /*
arch/x86/include/asm/pgtable_types.h
... ... @@ -55,6 +55,18 @@
55 55 #define _PAGE_HIDDEN (_AT(pteval_t, 0))
56 56 #endif
57 57  
  58 +/*
  59 + * The same hidden bit is used by kmemcheck, but since kmemcheck
  60 + * works on kernel pages while soft-dirty engine on user space,
  61 + * they do not conflict with each other.
  62 + */
  63 +
  64 +#ifdef CONFIG_MEM_SOFT_DIRTY
  65 +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
  66 +#else
  67 +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0))
  68 +#endif
  69 +
58 70 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
59 71 #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
60 72 #else
... ... @@ -11,6 +11,7 @@
11 11 #include <linux/rmap.h>
12 12 #include <linux/swap.h>
13 13 #include <linux/swapops.h>
  14 +#include <linux/mmu_notifier.h>
14 15  
15 16 #include <asm/elf.h>
16 17 #include <asm/uaccess.h>
17 18  
18 19  
... ... @@ -692,13 +693,32 @@
692 693 CLEAR_REFS_ALL = 1,
693 694 CLEAR_REFS_ANON,
694 695 CLEAR_REFS_MAPPED,
  696 + CLEAR_REFS_SOFT_DIRTY,
695 697 CLEAR_REFS_LAST,
696 698 };
697 699  
698 700 struct clear_refs_private {
699 701 struct vm_area_struct *vma;
  702 + enum clear_refs_types type;
700 703 };
701 704  
  705 +static inline void clear_soft_dirty(struct vm_area_struct *vma,
  706 + unsigned long addr, pte_t *pte)
  707 +{
  708 +#ifdef CONFIG_MEM_SOFT_DIRTY
  709 + /*
  710 + * The soft-dirty tracker uses #PF-s to catch writes
  711 + * to pages, so write-protect the pte as well. See the
  712 + * Documentation/vm/soft-dirty.txt for full description
  713 + * of how soft-dirty works.
  714 + */
  715 + pte_t ptent = *pte;
  716 + ptent = pte_wrprotect(ptent);
  717 + ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY);
  718 + set_pte_at(vma->vm_mm, addr, pte, ptent);
  719 +#endif
  720 +}
  721 +
702 722 static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
703 723 unsigned long end, struct mm_walk *walk)
704 724 {
... ... @@ -718,6 +738,11 @@
718 738 if (!pte_present(ptent))
719 739 continue;
720 740  
  741 + if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
  742 + clear_soft_dirty(vma, addr, pte);
  743 + continue;
  744 + }
  745 +
721 746 page = vm_normal_page(vma, addr, ptent);
722 747 if (!page)
723 748 continue;
... ... @@ -759,6 +784,7 @@
759 784 mm = get_task_mm(task);
760 785 if (mm) {
761 786 struct clear_refs_private cp = {
  787 + .type = type,
762 788 };
763 789 struct mm_walk clear_refs_walk = {
764 790 .pmd_entry = clear_refs_pte_range,
... ... @@ -766,6 +792,8 @@
766 792 .private = &cp,
767 793 };
768 794 down_read(&mm->mmap_sem);
  795 + if (type == CLEAR_REFS_SOFT_DIRTY)
  796 + mmu_notifier_invalidate_range_start(mm, 0, -1);
769 797 for (vma = mm->mmap; vma; vma = vma->vm_next) {
770 798 cp.vma = vma;
771 799 if (is_vm_hugetlb_page(vma))
... ... @@ -786,6 +814,8 @@
786 814 walk_page_range(vma->vm_start, vma->vm_end,
787 815 &clear_refs_walk);
788 816 }
  817 + if (type == CLEAR_REFS_SOFT_DIRTY)
  818 + mmu_notifier_invalidate_range_end(mm, 0, -1);
789 819 flush_tlb_mm(mm);
790 820 up_read(&mm->mmap_sem);
791 821 mmput(mm);
... ... @@ -827,6 +857,7 @@
827 857 /* in "new" pagemap pshift bits are occupied with more status bits */
828 858 #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
829 859  
  860 +#define __PM_SOFT_DIRTY (1LL)
830 861 #define PM_PRESENT PM_STATUS(4LL)
831 862 #define PM_SWAP PM_STATUS(2LL)
832 863 #define PM_FILE PM_STATUS(1LL)
... ... @@ -868,6 +899,7 @@
868 899 {
869 900 u64 frame, flags;
870 901 struct page *page = NULL;
  902 + int flags2 = 0;
871 903  
872 904 if (pte_present(pte)) {
873 905 frame = pte_pfn(pte);
874 906  
875 907  
... ... @@ -888,13 +920,15 @@
888 920  
889 921 if (page && !PageAnon(page))
890 922 flags |= PM_FILE;
  923 + if (pte_soft_dirty(pte))
  924 + flags2 |= __PM_SOFT_DIRTY;
891 925  
892   - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags);
  926 + *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
893 927 }
894 928  
895 929 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
896 930 static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
897   - pmd_t pmd, int offset)
  931 + pmd_t pmd, int offset, int pmd_flags2)
898 932 {
899 933 /*
900 934 * Currently pmd for thp is always present because thp can not be
901 935  
... ... @@ -903,13 +937,13 @@
903 937 */
904 938 if (pmd_present(pmd))
905 939 *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
906   - | PM_STATUS2(pm->v2, 0) | PM_PRESENT);
  940 + | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
907 941 else
908 942 *pme = make_pme(PM_NOT_PRESENT(pm->v2));
909 943 }
910 944 #else
911 945 static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
912   - pmd_t pmd, int offset)
  946 + pmd_t pmd, int offset, int pmd_flags2)
913 947 {
914 948 }
915 949 #endif
916 950  
... ... @@ -926,12 +960,15 @@
926 960 /* find the first VMA at or above 'addr' */
927 961 vma = find_vma(walk->mm, addr);
928 962 if (vma && pmd_trans_huge_lock(pmd, vma) == 1) {
  963 + int pmd_flags2;
  964 +
  965 + pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0);
929 966 for (; addr != end; addr += PAGE_SIZE) {
930 967 unsigned long offset;
931 968  
932 969 offset = (addr & ~PAGEMAP_WALK_MASK) >>
933 970 PAGE_SHIFT;
934   - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset);
  971 + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
935 972 err = add_to_pagemap(addr, &pme, pm);
936 973 if (err)
937 974 break;
include/asm-generic/pgtable.h
... ... @@ -396,6 +396,28 @@
396 396 #define arch_start_context_switch(prev) do {} while (0)
397 397 #endif
398 398  
  399 +#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
  400 +static inline int pte_soft_dirty(pte_t pte)
  401 +{
  402 + return 0;
  403 +}
  404 +
  405 +static inline int pmd_soft_dirty(pmd_t pmd)
  406 +{
  407 + return 0;
  408 +}
  409 +
  410 +static inline pte_t pte_mksoft_dirty(pte_t pte)
  411 +{
  412 + return pte;
  413 +}
  414 +
  415 +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
  416 +{
  417 + return pmd;
  418 +}
  419 +#endif
  420 +
399 421 #ifndef __HAVE_PFNMAP_TRACKING
400 422 /*
401 423 * Interfaces that can be used by architecture code to keep track of
... ... @@ -477,4 +477,16 @@
477 477 and swap data is stored as normal on the matching swap device.
478 478  
479 479 If unsure, say Y to enable frontswap.
  480 +
  481 +config MEM_SOFT_DIRTY
  482 + bool "Track memory changes"
  483 + depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY
  484 + select PROC_PAGE_MONITOR
  485 + help
  486 + This option enables memory changes tracking by introducing a
  487 + soft-dirty bit on pte-s. This bit it set when someone writes
  488 + into a page just as regular dirty bit, but unlike the latter
  489 + it can be cleared by hands.
  490 +
  491 + See Documentation/vm/soft-dirty.txt for more details.
... ... @@ -1429,7 +1429,7 @@
1429 1429 if (ret == 1) {
1430 1430 pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
1431 1431 VM_BUG_ON(!pmd_none(*new_pmd));
1432   - set_pmd_at(mm, new_addr, new_pmd, pmd);
  1432 + set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
1433 1433 spin_unlock(&mm->page_table_lock);
1434 1434 }
1435 1435 out:
... ... @@ -126,7 +126,7 @@
126 126 continue;
127 127 pte = ptep_get_and_clear(mm, old_addr, old_pte);
128 128 pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
129   - set_pte_at(mm, new_addr, new_pte, pte);
  129 + set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
130 130 }
131 131  
132 132 arch_leave_lazy_mmu_mode();