Doug / smarc-fsl-linux-kernel | Embedian Git Server

25 Sep, 2013

1 commit

53dad6d3a ipc: fix race with LSMs ... Browse Code »

Currently, IPC mechanisms do security and auditing related checks under
RCU. However, since security modules can free the security structure,
for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
race if the structure is freed before other tasks are done with it,
creating a use-after-free condition. Manfred illustrates this nicely,
for instance with shared mem and selinux:

-> do_shmat calls rcu_read_lock()
-> do_shmat calls shm_object_check().
Checks that the object is still valid - but doesn't acquire any locks.
Then it returns.
-> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
-> selinux_shm_shmat calls ipc_has_perm()
-> ipc_has_perm accesses ipc_perms->security

shm_close()
-> shm_close acquires rw_mutex & shm_lock
-> shm_close calls shm_destroy
-> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
-> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
-> ipc_free_security calls kfree(ipc_perms->security)

This patch delays the freeing of the security structures after all RCU
readers are done. Furthermore it aligns the security life cycle with
that of the rest of IPC - freeing them based on the reference counter.
For situations where we need not free security, the current behavior is
kept. Linus states:

"... the old behavior was suspect for another reason too: having the
security blob go away from under a user sounds like it could cause
various other problems anyway, so I think the old code was at least
_prone_ to bugs even if it didn't have catastrophic behavior."

I have tested this patch with IPC testcases from LTP on both my
quad-core laptop and on a 64 core NUMA server. In both cases selinux is
enabled, and tests pass for both voluntary and forced preemption models.
While the mentioned races are theoretical (at least no one as reported
them), I wanted to make sure that this new logic doesn't break anything
we weren't aware of.

Suggested-by: Linus Torvalds
Signed-off-by: Davidlohr Bueso
Acked-by: Manfred Spraul
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-25 00:36:53 +0800

12 Sep, 2013

10 commits

7a25dd9e0 ipc, shm: drop shm_lock_check ... Browse Code »

This function was replaced by a the lockless shm_obtain_object_check(),
and no longer has any users.

Signed-off-by: Davidlohr Bueso
Cc: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:44 +0800
530fcd16d ipc, shm: guard against non-existant vma in shmdt(2) ... Browse Code »

When !CONFIG_MMU there's a chance we can derefence a NULL pointer when the
VM area isn't found - check the return value of find_vma().

Also, remove the redundant -EINVAL return: retval is set to the proper
return code and *only* changed to 0, when we actually unmap the segments.

Signed-off-by: Davidlohr Bueso
Cc: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:44 +0800
d9a605e40 ipc: rename ids->rw_mutex ... Browse Code »

Since in some situations the lock can be shared for readers, we shouldn't
be calling it a mutex, rename it to rwsem.

Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:42 +0800
c2c737a04 ipc,shm: shorten critical region for shmat ... Browse Code »

Similar to other system calls, acquire the kern_ipc_perm lock after doing
the initial permission and security checks.

[sasha.levin@oracle.com: dont leave do_shmat with rcu lock held]
Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:42 +0800
f42569b13 ipc,shm: cleanup do_shmat pasta ... Browse Code »

Clean up some of the messy do_shmat() spaghetti code, getting rid of
out_free and out_put_dentry labels. This makes shortening the critical
region of this function in the next patch a little easier to do and read.

Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:42 +0800
2caacaa82 ipc,shm: shorten critical region for shmctl ... Browse Code »

With the *_INFO, *_STAT, IPC_RMID and IPC_SET commands already optimized,
deal with the remaining SHM_LOCK and SHM_UNLOCK commands. Take the
shm_perm lock after doing the initial auditing and security checks. The
rest of the logic remains unchanged.

Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:41 +0800
c97cb9cca ipc,shm: make shmctl_nolock lockless ... Browse Code »

While the INFO cmd doesn't take the ipc lock, the STAT commands do acquire
it unnecessarily. We can do the permissions and security checks only
holding the rcu lock.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:39 +0800
68eccc1dc ipc,shm: introduce shmctl_nolock ... Browse Code »

Similar to semctl and msgctl, when calling msgctl, the *_INFO and *_STAT
commands can be performed without acquiring the ipc object.

Add a shmctl_nolock() function and move the logic of *_INFO and *_STAT out
of msgctl(). Since we are just moving functionality, this change still
takes the lock and it will be properly lockless in the next patch.

Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:39 +0800
79ccf0f8c ipc,shm: shorten critical region in shmctl_down ... Browse Code »

Instead of holding the ipc lock for the entire function, use the
ipcctl_pre_down_nolock and only acquire the lock for specific commands:
RMID and SET.

Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:39 +0800
8b8d52ac3 ipc,shm: introduce lockless functions to obtain the ipc object ... Browse Code »

This is the third and final patchset that deals with reducing the amount
of contention we impose on the ipc lock (kern_ipc_perm.lock). These
changes mostly deal with shared memory, previous work has already been
done for semaphores and message queues:

http://lkml.org/lkml/2013/3/20/546 (sems)
http://lkml.org/lkml/2013/5/15/584 (mqueues)

With these patches applied, a custom shm microbenchmark stressing shmctl
doing IPC_STAT with 4 threads a million times, reduces the execution
time by 50%. A similar run, this time with IPC_SET, reduces the
execution time from 3 mins and 35 secs to 27 seconds.

Patches 1-8: replaces blindly taking the ipc lock for a smarter
combination of rcu and ipc_obtain_object, only acquiring the spinlock
when updating.

Patch 9: renames the ids rw_mutex to rwsem, which is what it already was.

Patch 10: is a trivial mqueue leftover cleanup

Patch 11: adds a brief lock scheme description, requested by Andrew.

This patch:

Add shm_obtain_object() and shm_obtain_object_check(), which will allow us
to get the ipc object without acquiring the lock. Just as with other
forms of ipc, these functions are basically wrappers around
ipc_obtain_object*().

Signed-off-by: Davidlohr Bueso
Tested-by: Sedat Dilek
Cc: Rik van Riel
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:38 +0800

10 Jul, 2013

4 commits

7b4cc5d84 ipc: move locking out of ipcctl_pre_down_nolock ... Browse Code »

This function currently acquires both the rw_mutex and the rcu lock on
successful lookups, leaving the callers to explicitly unlock them,
creating another two level locking situation.

Make the callers (including those that still use ipcctl_pre_down())
explicitly lock and unlock the rwsem and rcu lock.

Signed-off-by: Davidlohr Bueso
Cc: Andi Kleen
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-07-10 01:33:27 +0800
cf9d5d78d ipc: close open coded spin lock calls ... Browse Code »

Signed-off-by: Davidlohr Bueso
Cc: Andi Kleen
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-07-10 01:33:27 +0800
dbfcd91f0 ipc: move rcu lock out of ipc_addid ... Browse Code »

This patchset continues the work that began in the sysv ipc semaphore
scaling series, see

https://lkml.org/lkml/2013/3/20/546

Just like semaphores used to be, sysv shared memory and msg queues also
abuse the ipc lock, unnecessarily holding it for operations such as
permission and security checks.

This patchset mostly deals with mqueues, and while shared mem can be
done in a very similar way, I want to get these patches out in the open
first. It also does some pending cleanups, mostly focused on the two
level locking we have in ipc code, taking care of ipc_addid() and
ipcctl_pre_down_nolock() - yes there are still functions that need to be
updated as well.

This patch:

Make all callers explicitly take and release the RCU read lock.

This addresses the two level locking seen in newary(), newseg() and
newqueue(). For the last two, explicitly unlock the ipc object and the
rcu lock, instead of calling the custom shm_unlock and msg_unlock
functions. The next patch will deal with the open coded locking for
->perm.lock

Signed-off-by: Davidlohr Bueso
Cc: Andi Kleen
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-07-10 01:33:26 +0800
c103a4dc4 ipc/shmc.c: eliminate ugly 80-col tricks ... Browse Code »

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-07-10 01:33:26 +0800

10 May, 2013

1 commit

091d0d55b shm: fix null pointer deref when userspace specifies invalid hugepage size ... Browse Code »

Dave reported an oops triggered by trinity:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: newseg+0x10d/0x390
PGD cf8c1067 PUD cf8c2067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
...
Call Trace:
ipcget+0x182/0x380
SyS_shmget+0x5a/0x60
tracesys+0xdd/0xe2

This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap
failure in unaligned size request").

Reported-by: Dave Jones
Cc:
Signed-off-by: Li Zefan
Reviewed-by: Naoya Horiguchi
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Li Zefan
2013-05-10 05:22:47 +0800

08 May, 2013

1 commit

af73e4d95 hugetlbfs: fix mmap failure in unaligned size request ... Browse Code »

The current kernel returns -EINVAL unless a given mmap length is
"almost" hugepage aligned. This is because in sys_mmap_pgoff() the
given length is passed to vm_mmap_pgoff() as it is without being aligned
with hugepage boundary.

This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
alignment of huge page requests"), where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.

To fix this, this patch partially reverts that commit, and adds
alignment code in caller side. And it also introduces hstate_sizelog()
in order to get proper hstate to specified hugepage size.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

[akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
Signed-off-by: Naoya Horiguchi
Signed-off-by: Johannes Weiner
Reported-by:
Cc: Steven Truelove
Cc: Jianguo Wu
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-05-08 09:38:27 +0800

01 May, 2013

1 commit

d69f3bad4 ipc: sysv shared memory limited to 8TiB ... Browse Code »

Trying to run an application which was trying to put data into half of
memory using shmget(), we found that having a shmall value below 8EiB-8TiB
would prevent us from using anything more than 8TiB. By setting
kernel.shmall greater than 8EiB-8TiB would make the job work.

In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.

ipc/shm.c:
458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
459 {
...
465 int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
...
474 if (ns->shm_tot + numpages > ns->shm_ctlall)
475 return -ENOSPC;

[akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
Signed-off-by: Robin Holt
Reported-by: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2013-05-01 23:12:58 +0800

27 Feb, 2013

1 commit

d895cb1af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...

Linus Torvalds
2013-02-27 12:16:07 +0800

24 Feb, 2013

2 commits

41badc15c mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool ... Browse Code »

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
bebeb3d68 mm: introduce mm_populate() for populating new vmas ... Browse Code »

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse
Reported-by: Andy Lutomirski
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:10 +0800

23 Feb, 2013

2 commits

39b652527 fs: Preserve error code in get_empty_filp(), part 2 ... Browse Code »

Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Al Viro

Anatol Pomozov
2013-02-23 12:31:32 +0800
496ad9aa8 new helper: file_inode(file) ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-23 12:31:31 +0800

12 Dec, 2012

1 commit

42d7395fe mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB ... Browse Code »

There was some desire in large applications using MAP_HUGETLB or
SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
others. This is useful together with NUMA policy: use 2MB interleaving
on some mappings, but 1GB on local mappings.

This patch extends the IPC/SHM syscall interfaces slightly to allow
specifying the page size.

It borrows some upper bits in the existing flag arguments and allows
encoding the log of the desired page size in addition to the *_HUGETLB
flag. When 0 is specified the default size is used, this makes the
change fully compatible.

Extending the internal hugetlb code to handle this is straight forward.
Instead of a single mount it just keeps an array of them and selects the
right mount based on the specified page size. When no page size is
specified it uses the mount of the default page size.

The change is not visible in /proc/mounts because internal mounts don't
appear there. It also has very little overhead: the additional mounts
just consume a super block, but not more memory when not used.

I also exported the new flags to the user headers (they were previously
under __KERNEL__). Right now only symbols for x86 and some other
architecture for 1GB and 2MB are defined. The interface should already
work for all other architectures though. Only architectures that define
multiple hugetlb sizes actually need it (that is currently x86, tile,
powerpc). However tile and powerpc have user configurable hugetlb
sizes, so it's not easy to add defines. A program on those
architectures would need to query sysfs and use the appropiate log2.

[akpm@linux-foundation.org: cleanups]
[rientjes@google.com: fix build]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen
Cc: Michael Kerrisk
Acked-by: Rik van Riel
Acked-by: KAMEZAWA Hiroyuki
Cc: Hillf Danton
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-12-12 09:22:25 +0800

07 Sep, 2012

1 commit

1efdb69b0 userns: Convert ipc to use kuid and kgid where appropriate ... Browse Code »

- Store the ipc owner and creator with a kuid
- Store the ipc group and the crators group with a kgid.
- Add error handling to ipc_update_perms, allowing it to
fail if the uids and gids can not be converted to kuids
or kgids.
- Modify the proc files to display the ipc creator and
owner in the user namespace of the opener of the proc file.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-09-07 13:17:20 +0800

31 Jul, 2012

1 commit

079a96ae3 ipc: add COMPAT_SHMLBA support ... Browse Code »

If the SHMLBA definition for a native task differs from the definition for
a compat task, the do_shmat() function would need to handle both.

This patch introduces COMPAT_SHMLBA, which is used by the compat shmat
syscall when calling the ipc code and allows architectures such as AArch64
(where the native SHMLBA is 64k but the compat (AArch32) definition is
16k) to provide the correct semantics for compat IPC system calls.

Cc: David S. Miller
Cc: Chris Zankel
Cc: Arnd Bergmann
Acked-by: Catalin Marinas
Signed-off-by: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2012-07-31 08:25:20 +0800

08 Jun, 2012

1 commit

7d8a45695 ipc: shm: restore MADV_REMOVE functionality on shared memory segments ... Browse Code »

Commit 17cf28afea2a ("mm/fs: remove truncate_range") removed the
truncate_range inode operation in favour of the fallocate file
operation.

When using SYSV IPC shared memory segments, calling madvise with the
MADV_REMOVE advice on an area of shared memory will attempt to invoke
the .fallocate function for the shm_file_operations, which is NULL and
therefore returns -EOPNOTSUPP to userspace. The previous behaviour
would inherit the inode_operations from the underlying tmpfs file and
invoke truncate_range there.

This patch restores the previous behaviour by wrapping the underlying
fallocate function in shm_fallocate, as we do for fsync.

[hughd@google.com: use -ENOTSUPP in shm_fallocate()]
Signed-off-by: Will Deacon
Acked-by: Hugh Dickins
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2012-06-08 05:43:55 +0800

01 Jun, 2012

2 commits

e3fc629d7 switch aio and shm to do_mmap_pgoff(), make do_mmap() static ... Browse Code »

after all, 0 bytes and 0 pages is the same thing...

Signed-off-by: Al Viro

Al Viro
2012-06-01 22:37:17 +0800
8b3ec6814 take security_mmap_file() outside of ->mmap_sem ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-06-01 22:37:01 +0800

22 Mar, 2012

1 commit

40716e292 hugetlbfs: fix alignment of huge page requests ... Browse Code »

When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
PAGE_SIZE, but this is not sufficient.

Modify hugetlb_file_setup() to align requests to the huge page size, and
to accept an address argument so that all alignment checks can be
performed in hugetlb_file_setup(), rather than in its callers. Change
newseg() and mmap_pgoff() to match the new prototype and eliminate a now
redundant alignment check.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Steven Truelove
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Truelove
2012-03-22 08:54:59 +0800

24 Jan, 2012

2 commits

245132643 SHM_UNLOCK: fix Unevictable pages stranded after swap ... Browse Code »

Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.

The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages(). The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.

Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().

But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.

Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.

Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Eric Dumazet
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-24 00:38:48 +0800
85046579b SHM_UNLOCK: fix long unpreemptible section ... Browse Code »

scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
evictable again once the shared memory is unlocked. It does this with
pagevec_lookup()s across the whole object (which might occupy most of
memory), and takes 300ms to unlock 7GB here. A cond_resched() every
PAGEVEC_SIZE pages would be good.

However, KOSAKI-san points out that this is called under shmem.c's
info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
There is no strong reason for that: we need to take these pages off the
unevictable list soonish, but those locks are not required for it.

So move the call to scan_mapping_unevictable_pages() from shmem.c's
unlock handling up to shm.c's unlock handling. Remove the recently
added barrier, not needed now we have spin_unlock() before the scan.

Use get_file(), with subsequent fput(), to make sure we have a reference
to mapping throughout scan_mapping_unevictable_pages(): that's something
that was previously guaranteed by the shm_lock().

Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
time, and we lazily discover them to be Unevictable later, so it serves
no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
pages still on pagevec are not marked Unevictable.

The original code avoided redundant rescans by checking VM_LOCKED flag
at its level: now avoid them by checking shp's SHM_LOCKED.

The original code called scan_mapping_unevictable_pages() on a locked
area at shm_destroy() time: perhaps we once had accounting cross-checks
which required that, but not now, so skip the overhead and just let
inode eviction deal with them.

Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
more as comment than to save space; comment them used for SHM_UNLOCK.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Eric Dumazet
Cc: Johannes Weiner
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-24 00:38:48 +0800

05 Aug, 2011

1 commit

140d0b210 Do 'shm_init_ns()' in an early pure_initcall ... Browse Code »

This isn't really critical any more, since other patches (commit
298507d4d2cf: "shm: optimize exit_shm()") have caused us to not actually
need to touch the rw_mutex unless there are actual shm segments
associated with the namespace, but we really should do tne shm_init_ns()
earlier than we do now.

This, together with commit 288d5abec831 ("Boot up with usermodehelper
disabled") will mean that we really do initialize the initial ipc
namespace data structure before we run any tasks.

Tested-by: Marc Zyngier
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-08-05 13:35:59 +0800

04 Aug, 2011

2 commits

298507d4d shm: optimize exit_shm() ... Browse Code »

We may optimistically check .in_use == 0 without holding the rw_mutex:
it's the common case, and if it's zero, there certainly won't be any
segments associated with us.

After taking the lock, the idr_for_each() will do the right thing, so we
could now drop the re-check inside the lock without any real cost. But
it won't hurt.

Signed-off-by: Vasiliy Kulikov
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2011-08-04 08:45:55 +0800
33a30ed4b shm: fix wrong tests ... Browse Code »

Commit 4c677e2eefdb ("shm: optimize locking and ipc_namespace getting")
introduced a copy-paste bug. Due to the bug cycle optimizations were
disabled.

Signed-off-by: Vasiliy Kulikov
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2011-08-04 08:45:55 +0800

31 Jul, 2011

2 commits

4c677e2ee shm: optimize locking and ipc_namespace getting ... Browse Code »

shm_lock() does a lookup of shm segment in shm_ids(ns).ipcs_idr, which
is redundant as we already know shmid_kernel address. An actual lock is
also not required for reads until we really want to destroy the segment.

exit_shm() and shm_destroy_orphaned() may avoid the loop by checking
whether there is at least one segment in current ipc_namespace.

The check of nsproxy and ipc_ns against NULL is redundant as exit_shm()
is called from do_exit() before the call to exit_notify(), so the
dereferencing current->nsproxy->ipc_ns is guaranteed to be safe.

Reported-by: Oleg Nesterov
Signed-off-by: Vasiliy Kulikov
Acked-by: Serge Hallyn
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2011-07-31 02:44:20 +0800
5774ed014 shm: handle separate PID namespaces case ... Browse Code »

shm_try_destroy_orphaned() and shm_try_destroy_current() didn't handle
the case of separate PID namespaces, but a single IPC namespace. If
there are tasks with the same PID values using the same shmem object,
the wrong destroy decision could be reached.

On shm segment creation store the pointer to the creator task in
shmid_kernel->shm_creator field and zero it on task exit. Then
use the ->shm_creator insread of shm_cprid in both functions. As
shmid_kernel object is already locked at this stage, no additional
locking is needed.

Signed-off-by: Vasiliy Kulikov
Acked-by: Serge Hallyn
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2011-07-31 02:44:19 +0800

27 Jul, 2011

1 commit

b34a6b1da ipc: introduce shm_rmid_forced sysctl ... Browse Code »

Add support for the shm_rmid_forced sysctl. If set to 1, all shared
memory objects in current ipc namespace will be automatically forced to
use IPC_RMID.

The POSIX way of handling shmem allows one to create shm objects and
call shmdt(), leaving shm object associated with no process, thus
consuming memory not counted via rlimits.

With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.

It obviously breaks POSIX - some programs relying on the feature would
stop working. So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory. Use shm_rmid_forced=0 by default for compatability
reasons.

The feature was previously impemented in -ow as a configure option.

[akpm@linux-foundation.org: fix documentation, per Randy]
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: readability/conventionality tweaks]
[akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
Signed-off-by: Vasiliy Kulikov
Cc: Randy Dunlap
Cc: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Cc: Daniel Lezcano
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Ingo Molnar
Cc: Alan Cox
Cc: Solar Designer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2011-07-27 07:49:44 +0800

21 Jul, 2011

1 commit

02c24a821 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers ... Browse Code »

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara
Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:59 +0800

27 May, 2011

1 commit

ca16d140a mm: don't access vm_flags as 'int' ... Browse Code »

The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro
[ Changed to use a typedef - we'll extend it to cover more cases
later, since there has been discussion about making it a 64-bit
type.. - Linus ]
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-27 00:20:31 +0800