Commit 4779280d1ea4d361af13ae77ba55217fbcd16d4c
Committed by
Linus Torvalds
1 parent
91bf189c3a
Exists in
master
and in
4 other branches
mm: make get_user_pages() interruptible
The initial implementation of checking TIF_MEMDIE covers the cases of OOM killing. If the process has been OOM killed, the TIF_MEMDIE is set and it return immediately. This patch includes: 1. add the case that the SIGKILL is sent by user processes. The process can try to get_user_pages() unlimited memory even if a user process has sent a SIGKILL to it(maybe a monitor find the process exceed its memory limit and try to kill it). In the old implementation, the SIGKILL won't be handled until the get_user_pages() returns. 2. change the return value to be ERESTARTSYS. It makes no sense to return ENOMEM if the get_user_pages returned by getting a SIGKILL signal. Considering the general convention for a system call interrupted by a signal is ERESTARTNOSYS, so the current return value is consistant to that. Lee: An unfortunate side effect of "make-get_user_pages-interruptible" is that it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked, resulting in freeing of mlocked pages. Freeing of mlocked pages, in itself, is not so bad. We just count them now--altho' I had hoped to remove this stat and add PG_MLOCKED to the free pages flags check. However, consider pages in shared libraries mapped by more than one task that a task mlocked--e.g., via mlockall(). If the task that mlocked the pages exits via SIGKILL, these pages would be left mlocked and unevictable. Proposed fix: Add another GUP flag to ignore sigkill when calling get_user_pages from munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for the same purpose. We are not actually allocating memory in this case, which "make-get_user_pages-interruptible" intends to avoid. We're just munlocking pages that are already resident and mapped, and we're reusing get_user_pages() to access those pages. ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL into a single flag: GUP_FLAGS_MUNLOCK ??? [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock] Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rohit Seth <rohitseth@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Showing 3 changed files with 15 additions and 9 deletions Side-by-side Diff
mm/internal.h
... | ... | @@ -276,6 +276,7 @@ |
276 | 276 | #define GUP_FLAGS_WRITE 0x1 |
277 | 277 | #define GUP_FLAGS_FORCE 0x2 |
278 | 278 | #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 |
279 | +#define GUP_FLAGS_IGNORE_SIGKILL 0x8 | |
279 | 280 | |
280 | 281 | int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, |
281 | 282 | unsigned long start, int len, int flags, |
mm/memory.c
... | ... | @@ -1210,6 +1210,7 @@ |
1210 | 1210 | int write = !!(flags & GUP_FLAGS_WRITE); |
1211 | 1211 | int force = !!(flags & GUP_FLAGS_FORCE); |
1212 | 1212 | int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS); |
1213 | + int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL); | |
1213 | 1214 | |
1214 | 1215 | if (len <= 0) |
1215 | 1216 | return 0; |
1216 | 1217 | |
... | ... | @@ -1288,12 +1289,15 @@ |
1288 | 1289 | struct page *page; |
1289 | 1290 | |
1290 | 1291 | /* |
1291 | - * If tsk is ooming, cut off its access to large memory | |
1292 | - * allocations. It has a pending SIGKILL, but it can't | |
1293 | - * be processed until returning to user space. | |
1292 | + * If we have a pending SIGKILL, don't keep faulting | |
1293 | + * pages and potentially allocating memory, unless | |
1294 | + * current is handling munlock--e.g., on exit. In | |
1295 | + * that case, we are not allocating memory. Rather, | |
1296 | + * we're only unlocking already resident/mapped pages. | |
1294 | 1297 | */ |
1295 | - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) | |
1296 | - return i ? i : -ENOMEM; | |
1298 | + if (unlikely(!ignore_sigkill && | |
1299 | + fatal_signal_pending(current))) | |
1300 | + return i ? i : -ERESTARTSYS; | |
1297 | 1301 | |
1298 | 1302 | if (write) |
1299 | 1303 | foll_flags |= FOLL_WRITE; |
mm/mlock.c
... | ... | @@ -173,12 +173,13 @@ |
173 | 173 | (atomic_read(&mm->mm_users) != 0)); |
174 | 174 | |
175 | 175 | /* |
176 | - * mlock: don't page populate if page has PROT_NONE permission. | |
177 | - * munlock: the pages always do munlock althrough | |
178 | - * its has PROT_NONE permission. | |
176 | + * mlock: don't page populate if vma has PROT_NONE permission. | |
177 | + * munlock: always do munlock although the vma has PROT_NONE | |
178 | + * permission, or SIGKILL is pending. | |
179 | 179 | */ |
180 | 180 | if (!mlock) |
181 | - gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS; | |
181 | + gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS | | |
182 | + GUP_FLAGS_IGNORE_SIGKILL; | |
182 | 183 | |
183 | 184 | if (vma->vm_flags & VM_WRITE) |
184 | 185 | gup_flags |= GUP_FLAGS_WRITE; |