Eric Lee / smarc-fsl-linux-kernel

20 Jan, 2017

1 commit

1e26cec60 mm/hugetlb.c: fix reservation race when freeing surplus pages ... Browse Code »

commit e5bbc8a6c992901058bc09e2ce01d16c111ff047 upstream.

return_unused_surplus_pages() decrements the global reservation count,
and frees any unused surplus pages that were backing the reservation.

Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
return_unused_surplus_pages()") added a call to cond_resched_lock in the
loop freeing the pages.

As a result, the hugetlb_lock could be dropped, and someone else could
use the pages that will be freed in subsequent iterations of the loop.
This could result in inconsistent global hugetlb page state, application
api failures (such as mmap) failures or application crashes.

When dropping the lock in return_unused_surplus_pages, make sure that
the global reservation count (resv_huge_pages) remains sufficiently
large to prevent someone else from claiming pages about to be freed.

Analyzed by Paul Cassella.

Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reported-by: Paul Cassella
Suggested-by: Michal Hocko
Cc: Masayoshi Mizuma
Cc: Naoya Horiguchi
Cc: Aneesh Kumar
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mike Kravetz
2017-01-20 03:17:59 +0800

12 Jan, 2017

1 commit

66c677037 mm/hugetlb.c: use the right pte val for compare in hugetlb_cow ... Browse Code »

commit 3999f52e3198e76607446ab1a4610c1ddc406c56 upstream.

We cannot use the pte value used in set_pte_at for pte_same comparison,
because archs like ppc64, filter/add new pte flag in set_pte_at.
Instead fetch the pte value inside hugetlb_cow. We are comparing pte
value to make sure the pte didn't change since we dropped the page table
lock. hugetlb_cow get called with page table lock held, and we can take
a copy of the pte value before we drop the page table lock.

With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
previous mapping (huge_pte_none entries), by forcing a cow in the fault
path. This avoid take an addition fault to covert a read-only mapping
to read/write. Here we were comparing a recently instantiated pte (via
set_pte_at) to the pte values from linux page table. As explained above
on ppc64 such pte_same check returned wrong result, resulting in us
taking an additional fault on ppc64.

Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit")
Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V
Reported-by: Jan Stancek
Acked-by: Hillf Danton
Cc: Mike Kravetz
Cc: Scott Wood
Cc: Michael Ellerman
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Aneesh Kumar K.V
2017-01-12 18:39:32 +0800

12 Nov, 2016

1 commit

96b96a96d mm/hugetlb: fix huge page reservation leak in private mapping error paths ... Browse Code »

Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
allocated huge page.

If a reservation was associated with the huge page, alloc_huge_page()
consumed the reservation while allocating. When the newly allocated
page is freed in free_huge_page(), it will increment the global
reservation count. However, the reservation entry in the reserve map
will remain.

This is not an issue for shared mappings as the entry in the reserve map
indicates a reservation exists. But, an entry in a private mapping
reserve map indicates the reservation was consumed and no longer exists.
This results in an inconsistency between the reserve map and the global
reservation count. This 'leaks' a reserved huge page.

Create a new routine restore_reserve_on_error() to restore the reserve
entry in these specific error paths. This routine makes use of a new
function vma_add_reservation() which will add a reserve entry for a
specific address/page.

In general, these error paths were rarely (if ever) taken on most
architectures. However, powerpc contained arch specific code that that
resulted in an extra fault and execution of these error paths on all
private mappings.

Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reported-by: Jan Stancek
Tested-by: Jan Stancek
Reviewed-by: Aneesh Kumar K.V
Acked-by: Hillf Danton
Cc: Naoya Horiguchi
Cc: Michal Hocko
Cc: Kirill A . Shutemov
Cc: Dave Hansen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2016-11-12 00:12:37 +0800

08 Oct, 2016

5 commits

72e2936c0 mm: remove unnecessary condition in remove_inode_hugepages ... Browse Code »

When the huge page is added to the page cahce (huge_add_to_page_cache),
the page private flag will be cleared. since this code
(remove_inode_hugepages) will only be called for pages in the page
cahce, PagePrivate(page) will always be false.

The patch remove the code without any functional change.

Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Reviewed-by: Naoya Horiguchi
Reviewed-by: Mike Kravetz
Tested-by: Mike Kravetz
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2016-10-08 09:46:29 +0800
461a71843 mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE ... Browse Code »

Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
page. No functional change with this patch.

Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Hanjun Guo
Cc: Will Deacon
Cc: Dave Hansen
Cc: Sudeep Holla
Cc: Catalin Marinas
Cc: Mark Rutland
Cc: Rob Herring
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yisheng Xie
2016-10-08 09:46:29 +0800
eb03aa008 mm/hugetlb: improve locking in dissolve_free_huge_pages() ... Browse Code »

For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
the page is not huge at all or a hugepage that is in-use.

Improve this by doing the PageHuge() and page_count() checks already in
dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In
dissolve_free_huge_page(), when holding the spinlock, those checks need
to be revalidated.

Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-10-08 09:46:29 +0800
082d5b6b6 mm/hugetlb: check for reserved hugepages during memory offline ... Browse Code »

In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.

Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-10-08 09:46:29 +0800
2247bb335 mm/hugetlb: fix memory offline with hugepage size > memory block size ... Browse Code »

Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

This addresses several issues with hugepages and memory offline. While
the first patch fixes a panic, and is therefore rather important, the
last patch is just a performance optimization.

The second patch fixes a theoretical issue with reserved hugepages,
while still leaving some ugly usability issue, see description.

This patch (of 3):

dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
list corruption and addressing exception when trying to set a memory
block offline that is part (but not the first part) of a "gigantic"
hugetlb page with a size > memory block size.

When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
will trigger directly. In the other case we will run into an addressing
exception later, because dissolve_free_huge_page() will not work on the
head page of the compound hugetlb page which will result in a NULL
hstate from page_hstate().

To fix this, first remove the VM_BUG_ON() because it is wrong, and then
use the compound head page in dissolve_free_huge_page(). This means
that an unused pre-allocated gigantic page that has any part of itself
inside the memory block that is going offline will be dissolved
completely. Losing an unused gigantic hugepage is preferable to failing
the memory offline, for example in the situation where a (possibly
faulty) memory DIMM needs to go offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-10-08 09:46:29 +0800

12 Aug, 2016

1 commit

c1470b33b mm/hugetlb: fix incorrect hugepages count during mem hotplug ... Browse Code »

When memory hotplug operates, free hugepages will be freed if the
movable node is offline. Therefore, /proc/sys/vm/nr_hugepages will be
incorrect.

Fix it by reducing max_huge_pages when the node is offlined.

n-horiguchi@ah.jp.nec.com said:

: dissolve_free_huge_page intends to break a hugepage into buddy, and the
: destination hugepage is supposed to be allocated from the pool of the
: destination node, so the system-wide pool size is reduced. So adding
: h->max_huge_pages-- makes sense to me.

Link: http://lkml.kernel.org/r/1470624546-902-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Cc: Mike Kravetz
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2016-08-12 07:58:13 +0800

05 Aug, 2016

1 commit

2cfd716d2 Merge tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull more powerpc updates from Michael Ellerman:
"These were delayed for various reasons, so I let them sit in next a
bit longer, rather than including them in my first pull request.

Fixes:
- Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
- Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
- Move register_process_table() out of ppc_md from Michael Ellerman

Use jump_label use for [cpu|mmu]_has_feature():
- Add mmu_early_init_devtree() from Michael Ellerman
- Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
- Do hash device tree scanning earlier from Michael Ellerman
- Do radix device tree scanning earlier from Michael Ellerman
- Do feature patching before MMU init from Michael Ellerman
- Check features don't change after patching from Michael Ellerman
- Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
- Convert mmu_has_feature() to returning bool from Michael Ellerman
- Convert cpu_has_feature() to returning bool from Michael Ellerman
- Define radix_enabled() in one place & use static inline from Michael Ellerman
- Add early_[cpu|mmu]_has_feature() from Michael Ellerman
- Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
- jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
- Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
- Remove mfvtb() from Kevin Hao
- Move cpu_has_feature() to a separate file from Kevin Hao
- Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
- Add option to use jump label for cpu_has_feature() from Kevin Hao
- Add option to use jump label for mmu_has_feature() from Kevin Hao
- Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
- Annotate jump label assembly from Michael Ellerman

TLB flush enhancements from Aneesh Kumar K.V:
- radix: Implement tlb mmu gather flush efficiently
- Add helper for finding SLBE LLP encoding
- Use hugetlb flush functions
- Drop multiple definition of mm_is_core_local
- radix: Add tlb flush of THP ptes
- radix: Rename function and drop unused arg
- radix/hugetlb: Add helper for finding page size
- hugetlb: Add flush_hugetlb_tlb_range
- remove flush_tlb_page_nohash

Add new ptrace regsets from Anshuman Khandual and Simon Guo:
- elf: Add powerpc specific core note sections
- Add the function flush_tmregs_to_thread
- Enable in transaction NT_PRFPREG ptrace requests
- Enable in transaction NT_PPC_VMX ptrace requests
- Enable in transaction NT_PPC_VSX ptrace requests
- Adapt gpr32_get, gpr32_set functions for transaction
- Enable support for NT_PPC_CGPR
- Enable support for NT_PPC_CFPR
- Enable support for NT_PPC_CVMX
- Enable support for NT_PPC_CVSX
- Enable support for TM SPR state
- Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
- Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
- Enable support for EBB registers
- Enable support for Performance Monitor registers"

* tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
powerpc/mm: Move register_process_table() out of ppc_md
powerpc/perf: Fix incorrect event codes in power9-event-list
powerpc/32: Fix early access to cpu_spec relocation
powerpc/ptrace: Enable support for Performance Monitor registers
powerpc/ptrace: Enable support for EBB registers
powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
powerpc/ptrace: Enable support for TM SPR state
powerpc/ptrace: Enable support for NT_PPC_CVSX
powerpc/ptrace: Enable support for NT_PPC_CVMX
powerpc/ptrace: Enable support for NT_PPC_CFPR
powerpc/ptrace: Enable support for NT_PPC_CGPR
powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
powerpc/process: Add the function flush_tmregs_to_thread
elf: Add powerpc specific core note sections
powerpc/mm: remove flush_tlb_page_nohash
powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
...

Linus Torvalds
2016-08-05 21:00:54 +0800

03 Aug, 2016

2 commits

4e666314d mm, hugetlb: fix huge_pte_alloc BUG_ON ... Browse Code »

Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
runs his database load with memory online and offline running in
parallel. The reason is that huge_pmd_share might detect a shared pmd
which is currently migrated and so it has migration pte which is
!pte_huge.

There doesn't seem to be any easy way to prevent from the race and in
fact seeing the migration swap entry is not harmful. Both callers of
huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
will copy the swap entry and make it COW if needed. hugetlb_fault will
back off and so the page fault is retries if the page is still under
migration and waits for its completion in hugetlb_fault.

That means that the BUG_ON is wrong and we should update it. Let's
simply check that all present ptes are pte_huge instead.

Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: zhongjiang
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-08-03 05:31:41 +0800
649920c6a mm/hugetlb: avoid soft lockup in set_max_huge_pages() ... Browse Code »

In powerpc servers with large memory(32TB), we watched several soft
lockups for hugepage under stress tests.

The call traces are as follows:
1.
get_page_from_freelist+0x2d8/0xd50
__alloc_pages_nodemask+0x180/0xc20
alloc_fresh_huge_page+0xb0/0x190
set_max_huge_pages+0x164/0x3b0

2.
prep_new_huge_page+0x5c/0x100
alloc_fresh_huge_page+0xc8/0x190
set_max_huge_pages+0x164/0x3b0

This patch fixes such soft lockups. It is safe to call cond_resched()
there because it is out of spin_lock/unlock section.

Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He
Reviewed-by: Naoya Horiguchi
Acked-by: Michal Hocko
Acked-by: Dave Hansen
Cc: Mike Kravetz
Cc: "Kirill A. Shutemov"
Cc: Paul Gortmaker
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jia He
2016-08-03 05:31:41 +0800

01 Aug, 2016

1 commit

5491ae7b6 powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range ... Browse Code »

Some archs like ppc64 need to do special things when flushing tlb for
hugepage. Add a new helper to flush hugetlb tlb range. This helps us to
avoid flushing the entire tlb mapping for the pid.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman

Aneesh Kumar K.V
2016-08-01 09:15:13 +0800

29 Jul, 2016

3 commits

1c88e19b0 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"The rest of MM"

* emailed patches from Andrew Morton : (101 commits)
mm, compaction: simplify contended compaction handling
mm, compaction: introduce direct compaction priority
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
mm, page_alloc: make THP-specific decisions more generic
mm, page_alloc: restructure direct compaction handling in slowpath
mm, page_alloc: don't retry initial attempt in slowpath
mm, page_alloc: set alloc_flags only once in slowpath
lib/stackdepot.c: use __GFP_NOWARN for stack allocations
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
mm, kasan: account for object redzone in SLUB's nearest_obj()
mm: fix use-after-free if memory allocation failed in vma_adjust()
zsmalloc: Delete an unnecessary check before the function call "iput"
mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
mm: optimize copy_page_to/from_iter_iovec
mm: add cond_resched() to generic_swapfile_activate()
Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
mm: hwpoison: remove incorrect comments
make __section_nr() more efficient
...

Linus Torvalds
2016-07-29 07:36:48 +0800
7c7fd8255 mm: hwpoison: remove incorrect comments ... Browse Code »

dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
let's remove incorrect comment.

The reason why the page lock is not really needed is that
dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
are just allocated but not linked to active list yet, even without
taking page lock.

Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi
Reported-by: Zhan Chen
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-07-29 07:07:41 +0800
6784725ab Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.

Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.

Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...

Linus Torvalds
2016-07-29 03:59:05 +0800

27 Jul, 2016

4 commits

0e06f5c0d Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- a few misc bits

- ocfs2

- most(?) of MM

* emailed patches from Andrew Morton : (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...

Linus Torvalds
2016-07-27 10:55:54 +0800
e77b0852b mm/mmu_gather: track page size with mmu gather and force flush if page size change ... Browse Code »

This allows an arch which needs to do special handing with respect to
different page size when flushing tlb to implement the same in mmu
gather.

Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: David Rientjes
Cc: Vlastimil Babka
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2016-07-27 07:19:19 +0800
31d49da5a mm/hugetlb: simplify hugetlb unmap ... Browse Code »

For hugetlb like THP (and unlike regular page), we do tlb flush after
dropping ptl. Because of the above, we don't need to track force_flush
like we do now. Instead we can simply call tlb_remove_page() which will
do the flush if needed.

No functionality change in this patch.

Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2016-07-27 07:19:19 +0800
015cd867e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"There are a couple of new things for s390 with this merge request:

- a new scheduling domain "drawer" is added to reflect the unusual
topology found on z13 machines. Performance tests showed up to 8
percent gain with the additional domain.

- the new crc-32 checksum crypto module uses the vector-galois-field
multiply and sum SIMD instruction to speed up crc-32 and crc-32c.

- proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
the generic vmlinux.lds linker script definitions.

- kcov instrumentation support. A prerequisite for that is the
inline assembly basic block cleanup, which is the reason for the
net/iucv/iucv.c change.

- support for 2GB pages is added to the hugetlbfs backend.

Then there are two removals:

- the oprofile hardware sampling support is dead code and is removed.
The oprofile user space uses the perf interface nowadays.

- the ETR clock synchronization is removed, this has been superseeded
be the STP clock synchronization. And it always has been
"interesting" code..

And the usual bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
s390/smp: clean up a condition
s390/cio/chp : Remove deprecated create_singlethread_workqueue
s390/chsc: improve channel path descriptor determination
s390/chsc: sanitize fmt check for chp_desc determination
s390/cio: make fmt1 channel path descriptor optional
s390/chsc: fix ioctl CHSC_INFO_CU command
s390/cio/device_ops: fix kernel doc
s390/cio: allow to reset channel measurement block
s390/console: Make preferred console handling more consistent
s390/mm: fix gmap tlb flush issues
s390/mm: add support for 2GB hugepages
s390: have unique symbol for __switch_to address
s390/cpuinfo: show maximum thread id
s390/ptrace: clarify bits in the per_struct
s390: stack address vs thread_info
s390: remove pointless load within __switch_to
s390: enable kcov support
s390/cpumf: use basic block for ecctr inline assembly
s390/hypfs: use basic block for diag inline assembly
...

Linus Torvalds
2016-07-27 03:22:51 +0800

15 Jul, 2016

1 commit

5a49973d7 mm: thp: refix false positive BUG in page_move_anon_rmap() ... Browse Code »

The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
worth: the syzkaller fuzzer hit it again. It's still wrong for some THP
cases, because linear_page_index() was never intended to apply to
addresses before the start of a vma.

That's easily fixed with a signed long cast inside linear_page_index();
and Dmitry has tested such a patch, to verify the false positive. But
why extend linear_page_index() just for this case? when the avoidance in
page_move_anon_rmap() has already grown ugly, and there's no reason for
the check at all (nothing else there is using address or index).

Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
remove CONFIG_DEBUG_VM PageTransHuge adjustment.

And one more thing: should the compound_head(page) be done inside or
outside page_move_anon_rmap()? It's usually pushed down to the lowest
level nowadays (and mm/memory.c shows no other explicit use of it), so I
think it's better done in page_move_anon_rmap() than by caller.

Fixes: 0798d3c022dc ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils
Signed-off-by: Hugh Dickins
Reported-by: Dmitry Vyukov
Acked-by: Kirill A. Shutemov
Cc: Mika Westerberg
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-07-15 13:54:27 +0800

06 Jul, 2016

1 commit

d08de8e2d s390/mm: add support for 2GB hugepages ... Browse Code »

This adds support for 2GB hugetlbfs pages on s390.

Reviewed-by: Martin Schwidefsky
Signed-off-by: Gerald Schaefer
Signed-off-by: Martin Schwidefsky

Gerald Schaefer
2016-07-06 14:46:43 +0800

01 Jul, 2016

1 commit

b223f4e21 Merge branch 'd_real' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into work.misc Browse Code »

Al Viro
2016-07-01 11:34:49 +0800

25 Jun, 2016

2 commits

c8cc708a3 mm/hugetlb: clear compound_mapcount when freeing gigantic pages ... Browse Code »

While working on s390 support for gigantic hugepages I ran into the
following "Bad page state" warning when freeing gigantic pages:

BUG: Bad page state in process bash pfn:580001
page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
flags: 0x7fffc0000000000()
page dumped because: non-NULL mapping

This is because page->compound_mapcount, which is part of a union with
page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
and not cleared again during destroy_compound_gigantic_page(). Fix this
by clearing the compound_mapcount in destroy_compound_gigantic_page()
before clearing compound_head.

Interestingly enough, the warning will not show up on x86_64, although
this should not be architecture specific. Apparently there is an
endianness issue, combined with the fact that the union contains both a
64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
members. The resulting bogus page->mapping on x86_64 therefore contains
00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
trigger the PageAnon() check in free_pages_prepare() because
page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
like x86_64 in this case (the page is not compound anymore,
->compound_head was already cleared before). As a result, page->mapping
will be cleared before doing the checks in free_pages_check().

Not sure if the bogus "PageAnon() returning true" on x86_64 for the
first tail page of a gigantic page (at this stage) has other theoretical
implications, but they would also be fixed with this patch.

Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Reviewed-by: Mike Kravetz
Cc: Luiz Capitulino
Cc: Naoya Horiguchi
Cc: Hillf Danton
Cc: "Kirill A . Shutemov"
Cc: Dave Hansen
Cc: Paul Gortmaker
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-06-25 08:23:52 +0800
c17b1f425 hugetlb: fix nr_pmds accounting with shared page tables ... Browse Code »

We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().

If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.

By mistake, I increase nr_pmds again in this case. :-/ It will lead to
"BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.

Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.

Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process")
Signed-off-by: Kirill A. Shutemov
Reported-by: zhongjiang
Reviewed-by: Mike Kravetz
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-06-25 08:23:52 +0800

10 Jun, 2016

1 commit

67961f9db mm/hugetlb: fix huge page reserve accounting for private mappings ... Browse Code »

When creating a private mapping of a hugetlbfs file, it is possible to
unmap pages via ftruncate or fallocate hole punch. If subsequent faults
repopulate these mappings, the reserve counts will go negative. This is
because the code currently assumes all faults to private mappings will
consume reserves. The problem can be recreated as follows:

- mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
- write fault in pages in the mapping
- fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
- write fault in pages in the hole

This will result in negative huge page reserve counts and negative
subpool usage counts for the hugetlbfs. Note that this can also be
recreated with ftruncate, but fallocate is more straight forward.

This patch modifies the routines vma_needs_reserves and vma_has_reserves
to examine the reserve map associated with private mappings similar to
that for shared mappings. However, the reserve map semantics for
private and shared mappings are very different. This results in subtly
different code that is explained in the comments.

Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: Kirill Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Aneesh Kumar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2016-06-10 05:23:11 +0800

30 May, 2016

1 commit

93c76a3d4 file_inode(f)->i_mapping is f->f_mapping ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2016-05-30 06:56:09 +0800

24 May, 2016

1 commit

1f40c4957 Merge tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"The bulk of this update was stabilized before the merge window and
appeared in -next. The "device dax" implementation was revised this
week in response to review feedback, and to address failures detected
by the recently expanded ndctl unit test suite.

Not included in this pull request are two dax topic branches (dax
error handling, and dax radix-tree locking). These topics were
deferred to get a few more days of -next integration testing, and to
coordinate a branch baseline with Ted and the ext4 tree. Vishal and
Ross will send the error handling and locking topics respectively in
the next few days.

This branch has received a positive build result from the kbuild robot
across 226 configs.

Summary:

- Device DAX for persistent memory: Device DAX is the device-centric
analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
ranges to be allocated and mapped without need of an intervening
file system. Device DAX is strict, precise and predictable.
Specifically this interface:

a) Guarantees fault granularity with respect to a given page size
(pte, pmd, or pud) set at configuration time.

b) Enforces deterministic behavior by being strict about what
fault scenarios are supported.

Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance/feature
differentiated memory ranges.

- Support for the HPE DSM (device specific method) command formats.
This enables management of these first generation devices until a
unified DSM specification materializes.

- Further ACPI 6.1 compliance with support for the common dimm
identifier format.

- Various fixes and cleanups across the subsystem"

* tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
libnvdimm, dax: fix deletion
libnvdimm, dax: fix alignment validation
libnvdimm, dax: autodetect support
libnvdimm: release ida resources
Revert "block: enable dax for raw block devices"
/dev/dax, core: file operations and dax-mmap
/dev/dax, pmem: direct access to persistent memory
libnvdimm: stop requiring a driver ->remove() method
libnvdimm, dax: record the specified alignment of a dax-device instance
libnvdimm, dax: reserve space to store labels for device-dax
libnvdimm, dax: introduce device-dax infrastructure
nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
tools/testing/nvdimm: ND_CMD_CALL support
nfit: disable vendor specific commands
nfit: export subsystem ids as attributes
nfit: fix format interface code byte order per ACPI6.1
nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
nfit, libnvdimm: clarify "commands" vs "_DSMs"
libnvdimm: increase max envelope size for ioctl
acpi/nfit: Add sysfs "id" for NVDIMM ID
...

Linus Torvalds
2016-05-24 02:18:01 +0800

21 May, 2016

1 commit

dee410792 /dev/dax, core: file operations and dax-mmap ... Browse Code »

The "Device DAX" core enables dax mappings of performance / feature
differentiated memory. An open mapping or file handle keeps the backing
struct device live, but new mappings are only possible while the device
is enabled. Faults are handled under rcu_read_lock to synchronize
with the enabled state of the device.

Similar to the filesystem-dax case the backing memory may optionally
have struct page entries. However, unlike fs-dax there is no support
for private mappings, or mappings that are not backed by media (see
use of zero-page in fs-dax).

Mappings are always guaranteed to match the alignment of the dax_region.
If the dax_region is configured to have a 2MB alignment, all mappings
are guaranteed to be backed by a pmd entry. Contrast this determinism
with the fs-dax case where pmd mappings are opportunistic. If userspace
attempts to force a misaligned mapping, the driver will fail the mmap
attempt. See dax_dev_check_vma() for other scenarios that are rejected,
like MAP_PRIVATE mappings.

Cc: Hannes Reinecke
Cc: Jeff Moyer
Cc: Christoph Hellwig
Cc: Andrew Morton
Cc: Dave Hansen
Cc: Ross Zwisler
Acked-by: "Paul E. McKenney"
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dan Williams
2016-05-21 13:02:55 +0800

20 May, 2016

5 commits

f44b2dda8 mm/hugetlb: add same zone check in pfn_range_valid_gigantic() ... Browse Code »

This patchset deals with some problematic sites that iterate pfn ranges.

There is a system thats node's pfns are overlapped as follows:

-----pfn-------->
N0 N1 N2 N0 N1 N2

Therefore, we need to take care of this overlapping when iterating pfn
range.

I audit many iterating sites that uses pfn_valid(), pfn_valid_within(),
zone_start_pfn and etc. and others looks safe to me. This is a
preparation step for a new CMA implementation, ZONE_CMA
(https://lkml.org/lkml/2015/2/12/95), because it would be easily
overlapped with other zones. But, zone overlap check is also needed for
the general case so I send it separately.

This patch (of 5):

alloc_gigantic_page() uses alloc_contig_range() and this requires that
the requested range is in a single zone. To satisfy this requirement,
add this check to pfn_range_valid_gigantic().

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Laura Abbott
Cc: Minchan Kim
Cc: Marek Szyprowski
Cc: Michal Nazarewicz
Cc: "Aneesh Kumar K.V"
Cc: "Rafael J. Wysocki"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-05-20 10:12:14 +0800
54f18d352 mm/hugetlb.c: use first_memory_node ... Browse Code »

Instead of open-coding it.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-05-20 10:12:14 +0800
9fee021d1 mm/hugetlb: introduce hugetlb_bad_size() ... Browse Code »

When any unsupported hugepage size is specified, 'hugepagesz=' and
'hugepages=' should be ignored during command line parsing until any
supported hugepage size is found. But currently incorrect number of
hugepages are allocated when unsupported size is specified as it fails
to ignore the 'hugepages=' command.

Test case:

Note that this is specific to x86 architecture.

Boot the kernel with command line option 'hugepagesz=256M hugepages=X'.
After boot, dmesg output shows that X number of hugepages of the size 2M
is pre-allocated instead of 0.

So, to handle such command line options, introduce new routine
hugetlb_bad_size. The routine hugetlb_bad_size sets the global variable
parsed_valid_hugepagesz. We are using parsed_valid_hugepagesz to save
the state when unsupported hugepagesize is found so that we can ignore
the 'hugepages=' parameters after that and then reset the variable when
supported hugepage size is found.

The routine hugetlb_bad_size can be called while setting 'hugepagesz='
parameter in an architecture specific code.

Signed-off-by: Vaishali Thakkar
Reviewed-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Michal Hocko
Cc: Hillf Danton
Cc: Yaowei Bai
Cc: Dominik Dingel
Cc: Kirill A. Shutemov
Cc: Paul Gortmaker
Cc: Dave Hansen
Cc: Benjamin Herrenschmidt
Cc: James Hogan
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vaishali Thakkar
2016-05-20 10:12:14 +0800
09a95e29c mm/hugetlb: optimize minimum size (min_size) accounting ... Browse Code »

It was observed that minimum size accounting associated with the
hugetlbfs min_size mount option may not perform optimally and as
expected. As huge pages/reservations are released from the filesystem
and given back to the global pools, they are reserved for subsequent
filesystem use as long as the subpool reserved count is less than
subpool minimum size. It does not take into account used pages within
the filesystem. The filesystem size limits are not exceeded and this is
technically not a bug. However, better behavior would be to wait for
the number of used pages/reservations associated with the filesystem to
drop below the minimum size before taking reservations to satisfy
minimum size.

An optimization is also made to the hugepage_subpool_get_pages() routine
which is called when pages/reservations are allocated. This does not
change behavior, but simply avoids the accounting if all reservations
have already been taken (subpool reserved count == 0).

Signed-off-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Cc: Hillf Danton
Cc: David Rientjes
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2016-05-20 10:12:14 +0800
0edaf86cf include/linux/nodemask.h: create next_node_in() helper ... Browse Code »

Lots of code does

node = next_node(node, XXX);
if (node == MAX_NUMNODES)
node = first_node(XXX);

so create next_node_in() to do this and use it in various places.

[mhocko@suse.com: use next_node_in() helper]
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Signed-off-by: Michal Hocko
Cc: Xishi Qiu
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Naoya Horiguchi
Cc: Laura Abbott
Cc: Hui Zhu
Cc: Wang Xiaoqiang
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-05-20 10:12:14 +0800

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

18 Mar, 2016

1 commit

598d80914 mm: convert pr_warning to pr_warn ... Browse Code »

There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn
consistently.

Miscellanea:

- Coalesce formats
- Realign arguments

Signed-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-18 06:09:34 +0800

10 Mar, 2016

2 commits

86613628b mm/hugetlb: use EOPNOTSUPP in hugetlb sysctl handlers ... Browse Code »

Replace ENOTSUPP with EOPNOTSUPP. If hugepages are not supported, this
value is propagated to userspace. EOPNOTSUPP is part of uapi and is
widely supported by libc libraries.

It gives nicer message to user, rather than:

# cat /proc/sys/vm/nr_hugepages
cat: /proc/sys/vm/nr_hugepages: Unknown error 524

And also LTP's proc01 test was failing because this ret code (524)
was unexpected:

proc01 1 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages: errno=???(524): Unknown error 524
proc01 2 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages_mempolicy: errno=???(524): Unknown error 524
proc01 3 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_overcommit_hugepages: errno=???(524): Unknown error 524

Signed-off-by: Jan Stancek
Acked-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Acked-by: David Rientjes
Acked-by: Hillf Danton
Cc: Mike Kravetz
Cc: Dave Hansen
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Stancek
2016-03-10 07:43:42 +0800
910154d52 mm/hugetlb: hugetlb_no_page: rate-limit warning message ... Browse Code »

The warning message "killed due to inadequate hugepage pool" simply
indicates that SIGBUS was sent, not that the process was forcibly killed.
If the process has a signal handler installed does not fix the problem,
this message can rapidly spam the kernel log.

On my amd64 dev machine that does not have hugepages configured, I can
reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
4 megabytes of huge pages) and running something that sets a signal
handler and forks, like

#include
#include
#include
#include

sig_atomic_t counter = 10;
void handler(int signal)
{
if (counter-- == 0)
exit(0);
}

int main(void)
{
int status;
char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {perror("mmap"); return 1;}
*addr = 'x';
switch (fork()) {
case -1:
perror("fork"); return 1;
case 0:
signal(SIGBUS, handler);
*addr = 'x';
break;
default:
*addr = 'x';
wait(&status);
if (WIFSIGNALED(status)) {
psignal(WTERMSIG(status), "child");
}
break;
}
}

Signed-off-by: Geoffrey Thomas
Cc: Naoya Horiguchi
Cc: Hillf Danton
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geoffrey Thomas
2016-03-10 07:43:42 +0800

19 Feb, 2016

1 commit

f8b74815a mm/hugetlb.c: fix incorrect proc nr_hugepages value ... Browse Code »

Currently incorrect default hugepage pool size is reported by proc
nr_hugepages when number of pages for the default huge page size is
specified twice.

When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
indicates the current number of pre-allocated huge pages of the default
size. Basically /proc/sys/vm/nr_hugepages displays default_hstate->
max_huge_pages and after boot time pre-allocation, max_huge_pages should
equal the number of pre-allocated pages (nr_hugepages).

Test case:

Note that this is specific to x86 architecture.

Boot the kernel with command line option 'default_hugepagesz=1G
hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After
boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
returns the value X. However, dmesg output shows that Z huge pages were
pre-allocated.

So, the root cause of the problem here is that the global variable
default_hstate_max_huge_pages is set if a default huge page size is
specified (directly or indirectly) on the command line. After the command
line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
the value is assigned to default_hstae.max_huge_pages. However,
default_hstate.max_huge_pages may have already been set based on the
number of pre-allocated huge pages of default_hstate size.

The solution to this problem is if hstate->max_huge_pages is already set
then it should not set as a result of global max_huge_pages value.
Basically if the value of the variable hugepages is set multiple times on
a command line for a specific supported hugepagesize then proc layer
should consider the last specified value.

Signed-off-by: Vaishali Thakkar
Reviewed-by: Naoya Horiguchi
Cc: Mike Kravetz
Cc: Hillf Danton
Cc: Kirill A. Shutemov
Cc: Dave Hansen
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vaishali Thakkar
2016-02-19 08:23:24 +0800

06 Feb, 2016

1 commit

080fe2068 mm, hugetlb: don't require CMA for runtime gigantic pages ... Browse Code »

Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.

Signed-off-by: Vlastimil Babka
Cc: Luiz Capitulino
Cc: Kirill A. Shutemov
Cc: Zhang Yanfei
Cc: Yasuaki Ishimatsu
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Davidlohr Bueso
Cc: Hillf Danton
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-02-06 10:10:40 +0800