Eric Lee / smarc-fsl-linux-kernel

09 Mar, 2010

1 commit

e10154189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (62 commits)
msi-laptop: depends on RFKILL
msi-laptop: Detect 3G device exists by standard ec command
msi-laptop: Add resume method for set the SCM load again
msi-laptop: Support some MSI 3G netbook that is need load SCM
msi-laptop: Add threeg sysfs file for support query 3G state by standard 66/62 ec command
msi-laptop: Support standard ec 66/62 command on MSI notebook and nebook
Driver core: create lock/unlock functions for struct device
sysfs: fix for thinko with sysfs_bin_attr_init()
sysfs: Kill unused sysfs_sb variable.
sysfs: Pass super_block to sysfs_get_inode
driver core: Use sysfs_rename_link in device_rename
sysfs: Implement sysfs_rename_link
sysfs: Pack sysfs_dirent more tightly.
sysfs: Serialize updates to the vfs inode
sysfs: windfarm: init sysfs attributes
sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributes
sysfs: Document sysfs_attr_init and sysfs_bin_attr_init
sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributes
sysfs: Use one lockdep class per sysfs attribute.
sysfs: Only take active references on attributes.
...

Linus Torvalds
2010-03-09 02:17:20 +0800

08 Mar, 2010

15 commits

d4014030d FS-Cache: Remove the EXPERIMENTAL flag ... Browse Code »

Remove the EXPERIMENTAL flag from FS-Cache so that Ubuntu can make use of the
facility.

Signed-off-by: Christian Kujau
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds

Christian Kujau
2010-03-08 23:32:34 +0800
0f4288ec6 sysfs: Kill unused sysfs_sb variable. ... Browse Code »

Now that there are no more users we can remove
the sysfs_sb variable.

Acked-by: Tejun Heo
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:52 +0800
fac2622bb sysfs: Pass super_block to sysfs_get_inode ... Browse Code »

Currently sysfs_get_inode magically returns an inode on
sysfs_sb. Make the super_block parameter explicit and
the code becomes clearer.

Acked-by: Tejun Heo
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:52 +0800
7cb32942d sysfs: Implement sysfs_rename_link ... Browse Code »

Because of rename ordering problems we occassionally give false
warnings about invalid sysfs operations. So using sysfs_rename
create a sysfs_rename_link function that doesn't need strange
workarounds.

Cc: Benjamin Thery
Cc: Daniel Lezcano
Acked-by: Serge Hallyn
Acked-by: Tejun Heo
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:52 +0800
19c38b632 sysfs: Pack sysfs_dirent more tightly. ... Browse Code »

Placing the 16bit s_mode between a pointer and a long doesn't pack well
especailly on 64bit where we wast 48 bits. So move s_mode and
declare it as a unsigned short. This is the sysfs backing store
after all we don't need fields extra large just in case someday
we want userspace to be able to use a larger value.

Acked-by: Tejun Heo
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:52 +0800
f8d4f618f sysfs: Serialize updates to the vfs inode ... Browse Code »

The vfs depends upon filesystem methods to update the
vfs inode. Sysfs adds to the normal number of places
where the vfs inode is updated by also updatng the
vfs inode in sysfs_refresh_inode.

Typically the inode mutex is used to serialize updates
to the vfs inode, but grabbing the inode mutex in
sysfs_permission and sysfs_getattr causes deadlocks,
because sometimes the vfs calls those operations with
the inode mutex held. Therefore sysfs can not use the
inode mutex to serial updates to the vfs inode.

The sysfs_mutex is acquired in all of the routines
where sysfs updates the vfs inode, and with a small
change we can consistently protext sysfs vfs inode
updates with the sysfs_mutex. To protect the sysfs
vfs inode updates with the sysfs_mutex simply requires
extending the scope of sysfs_mutex in sysfs_setattr
over inode_setattr, and over inode_change_ok (so we
have an unchanging inode when we perform the check).

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:52 +0800
6992f5334 sysfs: Use one lockdep class per sysfs attribute. ... Browse Code »
47

Acknowledge that the logical sysfs rwsem has one instance per
sysfs attribute with different locking depencencies for different
attributes.

There is a sysfs idiom where writing to one sysfs file causes the
addition or removal of other sysfs files. Lumping all of the
sysfs attributes together in one lock class causes lockdep to
generate lots of false positives.

This introduces the requirement that non-static sysfs attributes
need to be initialized with sysfs_attr_init or sysfs_bin_attr_init.
Strictly speaking this requirement only exists when lockdep is
enabled, and when lockdep is enabled we get a bit fat warning
if this requirement is not met.

Signed-off-by: Eric W. Biederman
Acked-by: WANG Cong
Cc: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:51 +0800
a2db68428 sysfs: Only take active references on attributes. ... Browse Code »

If we exclude directories and symlinks from the set of sysfs
dirents where we need active references we are left with
sysfs attributes (binary or not).

- Tweak sysfs_deactivate to only do something on attributes
- Move lockdep initialization into sysfs_file_add_mode to
limit it to just attributes.

Signed-off-by: Eric W. Biederman
Acked-by: WANG Cong
Cc: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:51 +0800
e72ceb8cc sysfs: Remove sysfs_get/put_active_two ... Browse Code »

It turns out that holding an active reference on a directory is
pointless. The purpose of the active references are to allows us to
block when removing sysfs entries that have custom methods so we don't
remove modules while running modular code and to keep those custom
methods from accessing data structures after the files have been
removed. Further sysfs_remove_dir remove all elements in the
directory before removing the directory itself, so there is no chance
we will remove a directory with active children.

Signed-off-by: Eric W. Biederman
Cc: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:51 +0800
52cf25d0a Driver core: Constify struct sysfs_ops in struct kobj_type ... Browse Code »

Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy
Acked-by: David Teigland
Acked-by: Matt Domsch
Acked-by: Maciej Sosnowski
Acked-by: Hans J. Koch
Acked-by: Pekka Enberg
Acked-by: Jens Axboe
Acked-by: Stephen Hemminger
Signed-off-by: Greg Kroah-Hartman

Emese Revfy
2010-03-08 09:04:49 +0800
9cd43611c kobject: Constify struct kset_uevent_ops ... Browse Code »

Constify struct kset_uevent_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy
Signed-off-by: Greg Kroah-Hartman

Emese Revfy
2010-03-08 09:04:49 +0800
1e5289c97 sysfs: Cache the last sysfs_dirent to improve readdir scalability v2 ... Browse Code »

When sysfs_readdir stops short we now cache the next
sysfs_dirent to return to user space in filp->private_data.
There is no impact on the rest of sysfs by doing this and
in the common case it allows us to pick up exactly where
we left off with no seeking.

Additionally I drop and regrab the sysfs_mutex around
filldir to avoid a page fault abritrarily increasing the
hold time on the sysfs_mutex.

v2: Returned to using INT_MAX as the EOF condition.
seekdir is ambiguous unless all directory entries have
a unique f_pos value.

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14949

Signed-off-by: Eric W. Biederman
Cc: Linus Torvalds
Cc: stable
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2010-03-08 09:04:48 +0800
1c205ae18 sysfs: Add sysfs_add/remove_files utility functions ... Browse Code »

Adding/Removing a whole array of attributes is very common. Add a standard
utility function to do this with a simple function call, instead of
requiring drivers to open code this.

Signed-off-by: Andi Kleen
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2010-03-08 09:04:47 +0800
138860b95 seq_file: fix new kernel-doc warnings ... Browse Code »

Fix kernel-doc notation in new seq-file functions and
correct spelling.

Signed-off-by: Randy Dunlap
Cc: Li Zefan
Cc: Alexander Viro
Signed-off-by: Linus Torvalds

Randy Dunlap
2010-03-08 07:48:26 +0800
b8fa05719 Revert "lib: build list_sort() only if needed" ... Browse Code »

This reverts commit a069c266ae5fdfbf5b4aecf2c672413aa33b2504.

It turns ou that not only was it missing a case (XFS) that needed it,
but perhaps more importantly, people sometimes want to enable new
modules that they hadn't had enabled before, and if such a module uses
list_sort(), it can't easily be inserted any more.

So rather than add a "select LIST_SORT" to the XFS case, just leave it
compiled in. It's not all _that_ big, after all, and the inconvenience
isn't worth it.

Requested-by: Alexey Dobriyan
Cc: Christoph Hellwig
Cc: Don Mullis
Cc: Andrew Morton
Cc: Dave Chinner
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-03-08 01:54:44 +0800

07 Mar, 2010

24 commits

66b89159c Merge git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs:
[LogFS] Change magic number
[LogFS] Remove h_version field
[LogFS] Check feature flags
[LogFS] Only write journal if dirty
[LogFS] Fix bdev erases
[LogFS] Silence gcc
[LogFS] Prevent 64bit divisions in hash_index
[LogFS] Plug memory leak on error paths
[LogFS] Add MAINTAINERS entry
[LogFS] add new flash file system

Fixed up trivial conflict in lib/Kconfig, and a semantic conflict in
fs/logfs/inode.c introduced by write_inode() being changed to use
writeback_control' by commit a9185b41a4f84971b930c519f0c63bd450c4810d
("pass writeback_control to ->write_inode")

Linus Torvalds
2010-03-07 05:18:03 +0800
66ce3cf84 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs ... Browse Code »

* 'for-linus' of git://oss.sgi.com/xfs/xfs: (21 commits)
xfs: return inode fork offset in bulkstat for fsr
xfs: Increase the default size of the reserved blocks pool
xfs: truncate delalloc extents when IO fails in writeback
xfs: check for more work before sleeping in xfssyncd
xfs: Fix a build warning in xfs_aops.c
xfs: fix locking for inode cache radix tree tag updates
xfs: remove xfs_ipin/xfs_iunpin
xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait
xfs: kill xfs_lrw.h
xfs: factor common xfs_trans_bjoin code
xfs: stop passing opaque handles to xfs_log.c routines
xfs: split xfs_bmap_btalloc
xfs: fix xfs_fsblock_t tracing
xfs: fix inode pincount check in fsync
xfs: Non-blocking inode locking in IO completion
xfs: implement optimized fdatasync
xfs: remove wrapper for the fsync file operation
xfs: remove wrappers for read/write file operations
xfs: merge xfs_lrw.c into xfs_file.c
xfs: fix dquota trace format
...

Linus Torvalds
2010-03-07 03:32:21 +0800
05c5cb31e Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linux ... Browse Code »

* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd4: fix minor memory leak
svcrpc: treat uid's as unsigned
nfsd: ensure sockets are closed on error
Revert "sunrpc: move the close processing after do recvfrom method"
Revert "sunrpc: fix peername failed on closed listener"
sunrpc: remove unnecessary svc_xprt_put
NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN
xfs_export_operations.commit_metadata
commit_metadata export operation replacing nfsd_sync_dir
lockd: don't clear sm_monitored on nsm_reboot_lookup
lockd: release reference to nsm_handle in nlm_host_rebooted
nfsd: Use vfs_fsync_range() in nfsd_commit
NFSD: Create PF_INET6 listener in write_ports
SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found"
SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt()
NFSD: Support AF_INET6 in svc_addsock() function
SUNRPC: Use rpc_pton() in ip_map_parse()
nfsd: 4.1 has an rfc number
nfsd41: Create the recovery entry for the NFSv4.1 client
nfsd: use vfs_fsync for non-directories
...

Linus Torvalds
2010-03-07 03:31:38 +0800
76595f79d coredump: suppress uid comparison test if core output files are pipes ... Browse Code »

Modify uid check in do_coredump so as to not apply it in the case of
pipes.

This just got noticed in testing. The end of do_coredump validates the
uid of the inode for the created file against the uid of the crashing
process to ensure that no one can pre-create a core file with different
ownership and grab the information contained in the core when they
shouldn' tbe able to. This causes failures when using pipes for a core
dumps if the crashing process is not root, which is the uid of the pipe
when it is created.

The fix is simple. Since the check for matching uid's isn't relevant for
pipes (a process can't create a pipe that the uermodehelper code will open
anyway), we can just just skip it in the event ispipe is non-zero

Reverts a pipe-affecting change which was accidentally made in

: commit c46f739dd39db3b07ab5deb4e3ec81e1c04a91af
: Author: Ingo Molnar
: AuthorDate: Wed Nov 28 13:59:18 2007 +0100
: Commit: Linus Torvalds
: CommitDate: Wed Nov 28 10:58:01 2007 -0800
:
: vfs: coredumping fix

Signed-off-by: Neil Horman
Cc: Andi Kleen
Cc: Oleg Nesterov
Cc: Alan Cox
Cc: Al Viro
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2010-03-07 03:26:46 +0800
5c99cbf49 coredump: set ->group_exit_code for other CLONE_VM tasks too ... Browse Code »

User visible change.

do_coredump() kills all threads which share the same ->mm but only the
coredumping process gets the proper exit_code. Other tasks which share
the same ->mm die "silently" and return status == 0 to parent.

This is historical behaviour, not actually a bug. But I think Frank
Heckenbach rightly dislikes the current behaviour. Simple test-case:

#include
#include
#include
#include

int main(void)
{
int stat;

if (!fork()) {
if (!vfork())
kill(getpid(), SIGQUIT);
}

wait(&stat);
printf("stat=%x\n", stat);
return 0;
}

Before this patch it prints "stat=0" despite the fact the child was killed
by SIGQUIT. After this patch the output is "stat=3" which obviously makes
more sense.

Even with this patch, only the task which originates the coredumping gets
"|= 0x80" if the core was actually dumped, but at least the coredumping
signal is visible to do_wait/etc.

Reported-by: Frank Heckenbach
Signed-off-by: Oleg Nesterov
Acked-by: WANG Cong
Cc: Roland McGrath
Cc: Neil Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2010-03-07 03:26:46 +0800
30736a4d4 coredump: pass mm->flags as a coredump parameter for consistency ... Browse Code »

Pass mm->flags as a coredump parameter for consistency.

---
1787 if (mm->core_state || !get_dumpable(mm)) { mmap_sem);
1789 put_cred(cred);
1790 goto fail;
1791 }
1792
[...]
1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode */ fsuid = 0; /* Dump root private */
1801 }
---

Since dumpable bits are not protected by lock, there is a chance to change
these bits between (1) and (2).

To solve this issue, this patch copies mm->flags to
coredump_params.mm_flags at the beginning of do_coredump() and uses it
instead of get_dumpable() while dumping core.

This copy is also passed to binfmt->core_dump, since elf*_core_dump() uses
dump_filter bits in mm->flags.

[akpm@linux-foundation.org: fix merge]
Signed-off-by: Masami Hiramatsu
Acked-by: Roland McGrath
Cc: Hidehiro Kawai
Cc: Oleg Nesterov
Cc: Ingo Molnar
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masami Hiramatsu
2010-03-07 03:26:46 +0800
8d9032bbe elf coredump: add extended numbering support ... Browse Code »

The current ELF dumper implementation can produce broken corefiles if
program headers exceed 65535. This number is determined by the number of
vmas which the process have. In particular, some extreme programs may use
more than 65535 vmas. (If you google max_map_count, you can find some
users facing this problem.) This kind of program never be able to generate
correct coredumps.

This patch implements ``extended numbering'' that uses sh_info field of
the first section header instead of e_phnum field in order to represent
upto 4294967295 vmas.

This is supported by
AMD64-ABI(http://www.x86-64.org/documentation.html) and
Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
Of course, we are preparing patches for gdb and binutils.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:46 +0800
93eb211e6 elf coredump: make offset calculation process and writing process explicit ... Browse Code »

By the next patch, elf_core_dump() and elf_fdpic_core_dump() will support
extended numbering and so will produce the corefiles with section header
table in a special case.

The problem is the process of writing a file header offset of the section
header table into e_shoff field of the ELF header. ELF header is
positioned at the beginning of the corefile, while section header at the
end. So, we need to take which of the following ways:

1. Seek backward to retry writing operation for ELF header
after writing process for a whole part

2. Make offset calculation process and writing process
totally sequential

The clause 1. is not always possible: one cannot assume that file system
supports seek function. Consider the no_llseek case.

Therefore, this patch adopts the clause 2.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:46 +0800
1fcccbac8 elf coredump: replace ELF_CORE_EXTRA_* macros by functions ... Browse Code »

elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
macro for hiding _multiline_ logics in functions. This patch removes
#ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For
architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
order to reduce a range of modification.

This cleanup is for my next patches, but I think this cleanup itself is
worth doing regardless of my firnal purpose.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:45 +0800
088e7af73 coredump: move dump_write() and dump_seek() into a header file ... Browse Code »

My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting
them into other newly created *.c files. Then, each files will contain
dump_write(), where each pair of binfmt_*.c and elfcore.c should be the
same. So, this patch moves them into a header file with dump_seek().
Also, the patch deletes confusing DUMP_WRITE macros in each files.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:45 +0800
05f47fda9 coredump: unify dump_seek() implementations for each binfmt_*.c ... Browse Code »

The current ELF dumper can produce broken corefiles if program headers
exceed 65535. In particular, the program in 64-bit environment often
demands more than 65535 mmaps. If you google max_map_count, then you can
find many users facing this problem.

Solaris has already dealt with this issue, and other OSes have also
adopted the same method as in Solaris. Currently, Sun's document and AMD
64 ABI include the description for the extension, where they call the
extension Extended Numbering. See Reference for further information.

I believe that linux kernel should adopt the same way as they did, so I've
written this patch.

I am also preparing for patches of GDB and binutils.

How to fix
==========

In new dumping process, there are two cases according to weather or
not the number of program headers is equal to or more than 65535.

- if less than 65535, the produced corefile format is exactly the same
as the ordinary one.

- if equal to or more than 65535, then e_phnum field is set to newly
introduced constant PN_XNUM(0xffff) and the actual number of program
headers is set to sh_info field of the section header at index 0.

Compatibility Concern
=====================

* As already mentioned in Summary, Sun and AMD64 has already adopted
this. See Reference.

* There are four combinations according to whether kernel and userland
tools are respectively modified or not. The next table summarizes
shortly for each combination.

---------------------------------------------
Original Kernel | Modified Kernel
---------------------------------------------
< 65535 | >= 65535 | < 65535 | >= 65535
-------------------------------------------------------------
Original Tools | OK | broken | OK | broken (#)
-------------------------------------------------------------
Modified Tools | OK | broken | OK | OK
-------------------------------------------------------------

Note that there is no case that `OK' changes to `broken'.

(#) Although this case remains broken, O-M behaves better than
O-O. That is, while in O-O case e_phnum field would be extremely
small due to integer overflow, in O-M case it is guaranteed to be at
least 65535 by being set to PN_XNUM(0xFFFF), much closer to the
actual correct value than the O-O case.

Test Program
============

Here is a test program mkmmaps.c that is useful to produce the
corefile with many mmaps. To use this, please take the following
steps:

$ ulimit -c unlimited
$ sysctl vm.max_map_count=70000 # default 65530 is too small
$ sysctl fs.file-max=70000
$ mkmmaps 65535

Then, the program will abort and a corefile will be generated.

If failed, there are two cases according to the error message
displayed.

* ``out of memory'' means vm.max_map_count is still smaller

* ``too many open files'' means fs.file-max is still smaller

So, please change it to a larger value, and then retry it.

mkmmaps.c
==
#include
#include
#include
#include
#include
int main(int argc, char **argv)
{
int maps_num;
if (argc < 2) {
fprintf(stderr, "mkmmaps [number of maps to be created]\n");
exit(1);
}
if (sscanf(argv[1], "%d", &maps_num) == EOF) {
perror("sscanf");
exit(2);
}
if (maps_num < 0) {
fprintf(stderr, "%d is invalid\n", maps_num);
exit(3);
}
for (; maps_num > 0; --maps_num) {
if (MAP_FAILED == mmap((void *)NULL, (size_t) 1, PROT_READ,
MAP_SHARED | MAP_ANONYMOUS, (int) -1,
(off_t) NULL)) {
perror("mmap");
exit(4);
}
}
abort();
{
char buffer[128];
sprintf(buffer, "wc -l /proc/%u/maps", getpid());
system(buffer);
}
return 0;
}

Tested on i386, ia64 and um/sys-i386.
Built on sh4 (which covers fs/binfmt_elf_fdpic.c)

References
==========

- Sun microsystems: Linker and Libraries.
Part No: 817-1984-17, September 2008.
URL: http://docs.sun.com/app/docs/doc/817-1984

- System V ABI AMD64 Architecture Processor Supplement
Draft Version 0.99., May 11, 2009.
URL: http://www.x86-64.org/

This patch:

There are three different definitions for dump_seek() functions in
binfmt_aout.c, binfmt_elf.c and binfmt_elf_fdpic.c, respectively. The
only for binfmt_elf.c.

My next patch will move dump_seek() into a header file in order to share
the same implementations for dump_write() and dump_seek(). As the first
step, this patch unify these three definitions for dump_seek() by applying
the past commits that have been applied only for binfmt_elf.c.

Specifically, the modification made here is part of the following commits:

* d025c9db7f31fc0554ce7fb2dfc78d35a77f3487
* 7f14daa19ea36b200d237ad3ac5826ae25360461

This patch does not change a shape of corefiles.

Signed-off-by: Daisuke HATAYAMA
Cc: "Luck, Tony"
Cc: Jeff Dike
Cc: David Howells
Cc: Greg Ungerer
Cc: Roland McGrath
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Andi Kleen
Cc: Alan Cox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke HATAYAMA
2010-03-07 03:26:45 +0800
12bac0d9f proc: warn on non-existing proc entries ... Browse Code »

* warn if creation goes on to non-existent directory
* warn if removal goes on from non-existing directory
* warn if non-existing proc entry is removed

Signed-off-by: Alexey Dobriyan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2010-03-07 03:26:45 +0800
e17a5765f proc: do translation + unlink atomically at remove_proc_entry() ... Browse Code »

remove_proc_entry() does

lock
lookup parent
unlock
lock
unlink proc entry from lists
unlock

which can be made bit more correct by doing parent translation + unlink
without dropping lock.

Signed-off-by: Alexey Dobriyan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2010-03-07 03:26:45 +0800
45bf5cd7b fs/compat_ioctl.c: suppress two warnings ... Browse Code »

fs/compat_ioctl.c: In function 'do_ioctl_trans':
fs/compat_ioctl.c:534: warning: 'karg' may be used uninitialized in this function
fs/compat_ioctl.c:533: warning: 'kcmd' may be used uninitialized in this function
fs/compat_ioctl.c:656: warning: 'ret' may be used uninitialized in this function

Reduces text size by 44 bytes.

If someone calls one of these functions with an unexpected argument, the
code's buggy as-is.

Amerigo Wang
Cc: Alexander Viro
Acked-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2010-03-07 03:26:35 +0800
a069c266a lib: build list_sort() only if needed ... Browse Code »

Build list_sort() only for configs that need it -- those that don't save
~581 bytes (i386).

Signed-off-by: Don Mullis
Cc: Dave Airlie
Cc: Andi Kleen
Cc: Dave Chinner
Cc: Artem Bityutskiy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Don Mullis
2010-03-07 03:26:35 +0800
5ef097dd7 exec: create initial stack independent of PAGE_SIZE ... Browse Code »

Currently we create the initial stack based on the PAGE_SIZE. This is
unnecessary.

This creates this initial stack independent of the PAGE_SIZE.

It also bumps up the number of 4k pages allocated from 20 to 32, to
align with 64K page systems.

Signed-off-by: Michael Neuling
Cc: Helge Deller
Reviewed-by: KOSAKI Motohiro
Cc: Americo Wang
Cc: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Neuling
2010-03-07 03:26:33 +0800
d554ed895 fs: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:29 +0800
5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
42e496086 vfs: take f_lock on modifying f_mode after open time ... Browse Code »

We'll introduce FMODE_RANDOM which will be runtime modified. So protect
all runtime modification to f_mode with f_lock to avoid races.

Signed-off-by: Wu Fengguang
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Trond Myklebust
Cc: Chuck Lever
Cc: [2.6.33.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-03-07 03:26:25 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
d559db086 mm: clean up mm_counter ... Browse Code »

Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:23 +0800
984b3f574 bitops: rename for_each_bit() to for_each_set_bit() ... Browse Code »

Rename for_each_bit to for_each_set_bit in the kernel source tree. To
permit for_each_clear_bit(), should that ever be added.

The patch includes a macro to map the old for_each_bit() onto the new
for_each_set_bit(). This is a (very) temporary thing to ease the migration.

[akpm@linux-foundation.org: add temporary for_each_bit()]
Suggested-by: Alexey Dobriyan
Suggested-by: Andrew Morton
Signed-off-by: Akinobu Mita
Cc: "David S. Miller"
Cc: Russell King
Cc: David Woodhouse
Cc: Artem Bityutskiy
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2010-03-07 03:26:23 +0800
781b16775 Fix a dumb typo - use of & instead of && ... Browse Code »

We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774b2 ("Switch !O_CREAT case to use of do_last()")

Reported-by: Walter Sheets
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2010-03-07 02:54:48 +0800