Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

19 Dec, 2012

12 commits

ba6c496ed slab/slub: struct memcg_params ... Browse Code »

For the kmem slab controller, we need to record some extra information in
the kmem_cache structure.

Signed-off-by: Glauber Costa
Signed-off-by: Suleiman Souhlal
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:13 +0800
2ad306b17 fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs ... Browse Code »

Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.

This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.

This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.

For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path. Once the slab is also tracked by memcg, we can
get rid of that flag.

Tested to successfully protect against :(){ :|:& };:

Signed-off-by: Glauber Costa
Acked-by: Frederic Weisbecker
Acked-by: Kamezawa Hiroyuki
Reviewed-by: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: Mel Gorman
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:13 +0800
a8964b9b8 memcg: use static branches when code not in use ... Browse Code »

We can use static branches to patch the code in or out when not used.

Because the _ACTIVE bit on kmem_accounted is only set after the increment
is done, we guarantee that the root memcg will always be selected for kmem
charges until all call sites are patched (see memcg_kmem_enabled). This
guarantees that no mischarges are applied.

Static branch decrement happens when the last reference count from the
kmem accounting in memcg dies. This will only happen when the charges
drop down to 0.

When that happens, we need to disable the static branch only on those
memcgs that enabled it. To achieve this, we would be forced to complicate
the code by keeping track of which memcgs were the ones that actually
enabled limits, and which ones got it from its parents.

It is a lot simpler just to do static_key_slow_inc() on every child
that is accounted.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Acked-by: Kamezawa Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Johannes Weiner
Cc: Suleiman Souhlal
Cc: Tejun Heo
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: JoonSoo Kim
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:13 +0800
50bdd430c res_counter: return amount of charges after res_counter_uncharge() ... Browse Code »

It is useful to know how many charges are still left after a call to
res_counter_uncharge. While it is possible to issue a res_counter_read
after uncharge, this can be racy.

If we need, for instance, to take some action when the counters drop down
to 0, only one of the callers should see it. This is the same semantics
as the atomic variables in the kernel.

Since the current return value is void, we don't need to worry about
anything breaking due to this change: nobody relied on that, and only
users appearing from now on will be checking this value.

Signed-off-by: Glauber Costa
Reviewed-by: Michal Hocko
Acked-by: Kamezawa Hiroyuki
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Suleiman Souhlal
Cc: Tejun Heo
Cc: Christoph Lameter
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: JoonSoo Kim
Cc: Mel Gorman
Cc: Pekka Enberg
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:12 +0800
6a1a0d3b6 mm: allocate kernel pages to the right memcg ... Browse Code »

When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.

To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: Kamezawa Hiroyuki
Acked-by: David Rientjes
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Johannes Weiner
Cc: Suleiman Souhlal
Cc: Tejun Heo
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: JoonSoo Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:12 +0800
7ae1e1d0f memcg: kmem controller infrastructure ... Browse Code »

Introduce infrastructure for tracking kernel memory pages to a given
memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.

In memcontrol.h those functions are wrapped in inline acessors. The idea
is to later on, patch those with static branches, so we don't incur any
overhead when no mem cgroups with limited kmem are being used.

Users of this functionality shall interact with the memcg core code
through the following functions:

memcg_kmem_newpage_charge: will return true if the group can handle the
allocation. At this point, struct page is not
yet allocated.

memcg_kmem_commit_charge: will either revert the charge, if struct page
allocation failed, or embed memcg information
into page_cgroup.

memcg_kmem_uncharge_page: called at free time, will revert the charge.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Acked-by: Kamezawa Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: JoonSoo Kim
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Suleiman Souhlal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:12 +0800
7a64bf05b mm: add a __GFP_KMEMCG flag ... Browse Code »

This flag is used to indicate to the callees that this allocation is a
kernel allocation in process context, and should be accounted to current's
memcg.

Signed-off-by: Glauber Costa
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: Kamezawa Hiroyuki
Acked-by: Michal Hocko
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Suleiman Souhlal
Cc: Tejun Heo
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: JoonSoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:12 +0800
ae664dba2 Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB changes from Pekka Enberg:
"This contains preparational work from Christoph Lameter and Glauber
Costa for SLAB memcg and cleanups and improvements from Ezequiel
Garcia and Joonsoo Kim.

Please note that the SLOB cleanup commit from Arnd Bergmann already
appears in your tree but I had also merged it myself which is why it
shows up in the shortlog."

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm/sl[aou]b: Common alignment code
slab: Use the new create_boot_cache function to simplify bootstrap
slub: Use statically allocated kmem_cache boot structure for bootstrap
mm, sl[au]b: create common functions for boot slab creation
slab: Simplify bootstrap
slub: Use correct cpu_slab on dead cpu
mm: fix slab.c kernel-doc warnings
mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
slab: Ignore internal flags in cache creation
mm/slob: Use free_page instead of put_page for page-size kmalloc allocations
mm/sl[aou]b: Move common kmem_cache_size() to slab.h
mm/slob: Use object_size field in kmem_cache_size()
mm/slob: Drop usage of page->private for storing page-sized allocations
slub: Commonize slab_cache field in struct page
sl[au]b: Process slabinfo_show in common code
mm/sl[au]b: Move print_slabinfo_header to slab_common.c
mm/sl[au]b: Move slabinfo processing to slab_common.c
slub: remove one code path and reduce lock contention in __slab_free()

Linus Torvalds
2012-12-19 02:56:07 +0800
16e024f30 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

Pull powerpc update from Benjamin Herrenschmidt:
"The main highlight is probably some base POWER8 support. There's more
to come such as transactional memory support but that will wait for
the next one.

Overall it's pretty quiet, or rather I've been pretty poor at picking
things up from patchwork and reviewing them this time around and Kumar
no better on the FSL side it seems..."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits)
powerpc+of: Rename and fix OF reconfig notifier error inject module
powerpc: mpc5200: Add a3m071 board support
powerpc/512x: don't compile any platform DIU code if the DIU is not enabled
powerpc/mpc52xx: use module_platform_driver macro
powerpc+of: Export of_reconfig_notifier_[register,unregister]
powerpc/dma/raidengine: add raidengine device
powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct
powerpc/mpc85xx: Change spin table to cached memory
powerpc/fsl-pci: Add PCI controller ATMU PM support
powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI
drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS]
powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers
powerpc: Disable relocation on exceptions when kexecing
powerpc: Enable relocation on during exceptions at boot
powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
powerpc: Add wrappers to enable/disable relocation on exceptions
powerpc: Add set_mode hcall
powerpc: Setup relocation on exceptions for bare metal systems
powerpc: Move initial mfspr LPCR out of __init_LPCR
powerpc: Add relocation on exception vector handlers
...

Linus Torvalds
2012-12-19 01:58:09 +0800
a22180d26 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs update from Chris Mason:
"A big set of fixes and features.

In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.

Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.

Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.

I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.

Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...

Linus Torvalds
2012-12-19 01:42:05 +0800
2d4dce007 Merge tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client updates from Trond Myklebust:
"Features include:

- Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client
code. Remove altogether where possible, and replace with
WARN_ON_ONCE and appropriate error returns where not.
- NFSv4.1 client adds session dynamic slot table management. There
is matching server side code that has been submitted to Bruce for
consideration.

Together, this code allows the server to dynamically manage the
amount of memory it allocates to the duplicate request cache for
each client. It will constantly resize those caches to reserve
more memory for clients that are hot while shrinking caches for
those that are quiescent.

In addition, there are assorted bugfixes for the generic NFS write
code, fixes to deal with the drop_nlink() warnings, and yet another
fix for NFSv4 getacl."

* tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (106 commits)
SUNRPC: continue run over clients list on PipeFS event instead of break
NFS: Don't use SetPageError in the NFS writeback code
SUNRPC: variable 'svsk' is unused in function bc_send_request
SUNRPC: Handle ECONNREFUSED in xs_local_setup_socket
NFSv4.1: Deal effectively with interrupted RPC calls.
NFSv4.1: Move the RPC timestamp out of the slot.
NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
NFS: Fix calls to drop_nlink()
NFS: Ensure that we always drop inodes that have been marked as stale
nfs: Remove unused list nfs4_clientid_list
nfs: Remove duplicate function declaration in internal.h
NFS: avoid NULL dereference in nfs_destroy_server
SUNRPC handle EKEYEXPIRED in call_refreshresult
SUNRPC set gss gc_expiry to full lifetime
nfs: fix page dirtying in NFS DIO read codepath
nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
NFSv4.1: Be conservative about the client highest slotid
NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
nfs: don't extend writes to cover entire page if pagecache is invalid
...

Linus Torvalds
2012-12-19 01:36:34 +0800
ea88eeac0 Merge tag 'md-3.8' of git://neil.brown.name/md ... Browse Code »

Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."

* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.

Linus Torvalds
2012-12-19 01:32:44 +0800

18 Dec, 2012

28 commits

08afe22c6 Merge branch 'slab/next' into slab/for-linus ... Browse Code »

Fix up a trivial merge conflict with commit baaf1dd ("mm/slob: use
min_t() to compare ARCH_SLAB_MINALIGN") that did not go through the slab
tree.

Conflicts:
mm/slob.c

Signed-off-by: Pekka Enberg

Pekka Enberg
2012-12-18 18:46:20 +0800
848b81415 Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge misc patches from Andrew Morton:
"Incoming:

- lots of misc stuff

- backlight tree updates

- lib/ updates

- Oleg's percpu-rwsem changes

- checkpatch

- rtc

- aoe

- more checkpoint/restart support

I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."

* emailed patches from Andrew Morton : (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc//fdinfo/ fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc//fdinfo/ output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...

Linus Torvalds
2012-12-18 12:58:12 +0800
711c7bf99 fs, exportfs: add exportfs_encode_inode_fh() helper ... Browse Code »

We will need this helper in the next patch to provide a file handle for
inotify marks in /proc/pid/fdinfo output.

The patch is rather providing the way to use inodes directly when dentry
is not available (like in case of inotify system).

Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-12-18 09:15:27 +0800
138d22b58 fs, epoll: add procfs fdinfo helper ... Browse Code »

This allows us to print out eventpoll target file descriptor, events and
data, the /proc/pid/fdinfo/fd consists of

| pos: 0
| flags: 02
| tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

[avagin@: fix for unitialized ret variable]

Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-12-18 09:15:27 +0800
55985dd72 procfs: add ability to plug in auxiliary fdinfo providers ... Browse Code »

This patch brings ability to print out auxiliary data associated with
file in procfs interface /proc/pid/fdinfo/fd.

In particular further patches make eventfd, evenpoll, signalfd and
fsnotify to print additional information complete enough to restore
these objects after checkpoint.

To simplify the code we add show_fdinfo callback inside struct
file_operations (as Al and Pavel are proposing).

Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-12-18 09:15:27 +0800
6582c665d prandom: introduce prandom_bytes() and prandom_bytes_state() ... Browse Code »

Add functions to get the requested number of pseudo-random bytes.

The difference from get_random_bytes() is that it generates pseudo-random
numbers by prandom_u32(). It doesn't consume the entropy pool, and the
sequence is reproducible if the same rnd_state is used. So it is suitable
for generating random bytes for testing.

Signed-off-by: Akinobu Mita
Cc: "Theodore Ts'o"
Cc: Artem Bityutskiy
Cc: Adrian Hunter
Cc: David Woodhouse
Cc: Eilon Greenstein
Cc: David Laight
Cc: Michel Lespinasse
Cc: Robert Love
Cc: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2012-12-18 09:15:26 +0800
496f2f93b random32: rename random32 to prandom ... Browse Code »

This renames all random32 functions to have 'prandom_' prefix as follows:

void prandom_seed(u32 seed); /* rename from srandom32() */
u32 prandom_u32(void); /* rename from random32() */
void prandom_seed_state(struct rnd_state *state, u64 seed);
/* rename from prandom32_seed() */
u32 prandom_u32_state(struct rnd_state *state);
/* rename from prandom32() */

The purpose of this renaming is to prevent some kernel developers from
assuming that prandom32() and random32() might imply that only
prandom32() was the one using a pseudo-random number generator by
prandom32's "p", and the result may be a very embarassing security
exposure. This concern was expressed by Theodore Ts'o.

And furthermore, I'm going to introduce new functions for getting the
requested number of pseudo-random bytes. If I continue to use both
prandom32 and random32 prefixes for these functions, the confusion
is getting worse.

As a result of this renaming, "prandom_" is the common prefix for
pseudo-random number library.

Currently, srandom32() and random32() are preserved because it is
difficult to rename too many users at once.

Signed-off-by: Akinobu Mita
Cc: "Theodore Ts'o"
Cc: Robert Love
Cc: Michel Lespinasse
Cc: Valdis Kletnieks
Cc: David Laight
Cc: Adrian Hunter
Cc: Artem Bityutskiy
Cc: David Woodhouse
Cc: Eilon Greenstein
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2012-12-18 09:15:26 +0800
8529091e8 linux/compiler.h: add __must_hold macro for functions called with a lock held ... Browse Code »

linux/compiler.h has macros to denote functions that acquire or release
locks, but not to denote functions called with a lock held that return
with the lock still held. Add a __must_hold macro to cover that case.

Signed-off-by: Josh Triplett
Reported-by: Ed Cashin
Tested-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josh Triplett
2012-12-18 09:15:23 +0800
a5ba911ec pidns: remove unused is_container_init() ... Browse Code »

Since commit 1cdcbec1a337 ("CRED: Neuter sys_capset()")
is_container_init() has no callers.

Signed-off-by: Gao feng
Cc: David Howells
Acked-by: Serge Hallyn
Cc: James Morris
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gao feng
2012-12-18 09:15:23 +0800
d74026986 exec: use -ELOOP for max recursion depth ... Browse Code »

To avoid an explosion of request_module calls on a chain of abusive
scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
as maximum recursion depth is hit, the error will fail all the way back
up the chain, aborting immediately.

This also has the side-effect of stopping the user's shell from attempting
to reexecute the top-level file as a shell script. As seen in the
dash source:

if (cmd != path_bshell && errno == ENOEXEC) {
*argv-- = cmd;
*argv = cmd = path_bshell;
goto repeat;
}

The above logic was designed for running scripts automatically that lacked
the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
things continue to behave as the shell expects.

Additionally, when tracking recursion, the binfmt handlers should not be
involved. The recursion being tracked is the depth of calls through
search_binary_handler(), so that function should be exclusively responsible
for tracking the depth.

Signed-off-by: Kees Cook
Cc: halfdog
Cc: P J P
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2012-12-18 09:15:23 +0800
992fb6e17 ptrace: introduce PTRACE_O_EXITKILL ... Browse Code »

Ptrace jailers want to be sure that the tracee can never escape
from the control. However if the tracer dies unexpectedly the
tracee continues to run in potentially unsafe mode.

Add the new ptrace option PTRACE_O_EXITKILL. If the tracer exits
it sends SIGKILL to every tracee which has this bit set.

Note that the new option is not equal to the last-option << 1. Because
currently all options have an event, and the new one starts the eventless
group. It uses the random 20 bit, so we have the room for 12 more events,
but we can also add the new eventless options below this one.

Suggested by Amnon Shiloh.

Signed-off-by: Oleg Nesterov
Tested-by: Amnon Shiloh
Cc: Denys Vlasenko
Cc: Michael Kerrisk
Cc: Serge Hallyn
Cc: Chris Evans
Cc: David Howells
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-12-18 09:15:22 +0800
4c925d603 kstrto*: add documentation ... Browse Code »

As Bruce Fields pointed out, kstrto* is currently lacking kerneldoc
comments. This patch adds kerneldoc comments to common variants of
kstrto*: kstrto(u)l, kstrto(u)ll and kstrto(u)int.

Signed-off-by: Eldad Zack
Cc: J. Bruce Fields
Cc: Joe Perches
Cc: Randy Dunlap
Cc: Alexey Dobriyan
Cc: Rob Landley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eldad Zack
2012-12-18 09:15:22 +0800
0ad50c389 compat: generic compat_sys_sched_rr_get_interval() implementation ... Browse Code »

This function is used by sparc, powerpc tile and arm64 for compat support.
The patch adds a generic implementation with a wrapper for PowerPC to do
the u32->int sign extension.

The reason for a single patch covering powerpc, tile, sparc and arm64 is
to keep it bisectable, otherwise kernel building may fail with mismatched
function declarations.

Signed-off-by: Catalin Marinas
Acked-by: Chris Metcalf [for tile]
Acked-by: David S. Miller
Acked-by: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2012-12-18 09:15:18 +0800
8ebe34731 percpu_rw_semaphore: add lockdep annotations ... Browse Code »

Add lockdep annotations. Not only this can help to find the potential
problems, we do not want the false warnings if, say, the task takes two
different percpu_rw_semaphore's for reading. IOW, at least ->rw_sem
should not use a single class.

This patch exposes this internal lock to lockdep so that it represents the
whole percpu_rw_semaphore. This way we do not need to add another "fake"
->lockdep_map and lock_class_key. More importantly, this also makes the
output from lockdep much more understandable if it finds the problem.

In short, with this patch from lockdep pov percpu_down_read() and
percpu_up_read() acquire/release ->rw_sem for reading, this matches the
actual semantics. This abuses __up_read() but I hope this is fine and in
fact I'd like to have down_read_no_lockdep() as well,
percpu_down_read_recursive_readers() will need it.

Signed-off-by: Oleg Nesterov
Cc: Anton Arapov
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Michal Marek
Cc: Mikulas Patocka
Cc: "Paul E. McKenney"
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-12-18 09:15:18 +0800
9390ef0c8 percpu_rw_semaphore: kill ->writer_mutex, add ->write_ctr ... Browse Code »

percpu_rw_semaphore->writer_mutex was only added to simplify the initial
rewrite, the only thing it protects is clear_fast_ctr() which otherwise
could be called by multiple writers. ->rw_sem is enough to serialize the
writers.

Kill this mutex and add "atomic_t write_ctr" instead. The writers
increment/decrement this counter, the readers check it is zero instead of
mutex_is_locked().

Move atomic_add(clear_fast_ctr(), slow_read_ctr) under down_write() to
avoid the race with other writers. This is a bit sub-optimal, only the
first writer needs this and we do not need to exclude the readers at this
stage. But this is simple, we do not want another internal lock until we
add more features.

And this speeds up the write-contended case. Before this patch the racing
writers sleep in synchronize_sched_expedited() sequentially, with this
patch multiple synchronize_sched_expedited's can "overlap" with each
other. Note: we can do more optimizations, this is only the first step.

Signed-off-by: Oleg Nesterov
Cc: Anton Arapov
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Michal Marek
Cc: Mikulas Patocka
Cc: "Paul E. McKenney"
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-12-18 09:15:18 +0800
a1fd3e24d percpu_rw_semaphore: reimplement to not block the readers unnecessarily ... Browse Code »

Currently the writer does msleep() plus synchronize_sched() 3 times to
acquire/release the semaphore, and during this time the readers are
blocked completely. Even if the "write" section was not actually started
or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use the
slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov
Reviewed-by: Paul E. McKenney
Cc: Linus Torvalds
Cc: Mikulas Patocka
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Srikar Dronamraju
Cc: Ananth N Mavinakayanahalli
Cc: Anton Arapov
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-12-18 09:15:18 +0800
b18888ab2 string: introduce helper to get base file name from given path ... Browse Code »

There are several places in the kernel that use functionality like
basename(3) with the exception: in case of '/foo/bar/' we expect to get an
empty string. Let's do it common helper for them.

Signed-off-by: Andy Shevchenko
Cc: Jason Baron
Cc: YAMANE Toshiaki
Cc: Greg Kroah-Hartman
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2012-12-18 09:15:17 +0800
762a936fb backlight: add of_find_backlight_by_node() ... Browse Code »

This function finds the struct backlight_device for a given device tree
node. A dummy function is provided so that it safely compiles out if OF
support is disabled.

[akpm@linux-foundation.org: Don't use IS_ENABLED(CONFIG_OF)]
Signed-off-by: Thierry Reding
Acked-by: Jingoo Han
Reviewed-by: Grant Likely
Cc: Thierry Reding
Reviewed-by: Grant Likely
Acked-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thierry Reding
2012-12-18 09:15:16 +0800
8cc9764c9 drivers/video/backlight/lp855x_bl.c: use generic PWM functions ... Browse Code »

The LP855x family devices support the PWM input for the backlight control.
Period of the PWM is configurable in the platform side. Platform
specific functions are unnecessary anymore because generic PWM functions
are used inside the driver.

(PWM input mode)
To set the brightness, new lp855x_pwm_ctrl() is used.
If a PWM device is not allocated, devm_pwm_get() is called.
The PWM consumer name is from the chip name such as 'lp8550' and 'lp8556'.
To get the brightness value, no additional handling is required.
Just the value of 'props.brightness' is returned.

If the PWM driver is not ready while initializing the LP855x driver, it's
OK. The PWM device can be retrieved later, when the brightness value is
changed.

Documentation is updated with an example.

[akpm@linux-foundation.org: coding-style simplification, per Thierry]
Signed-off-by: Milo(Woogyom) Kim
Cc: Thierry Reding
Cc: Richard Purdie
Cc: Bryan Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kim, Milo
2012-12-18 09:15:16 +0800
41739ee35 asm-generic: io: don't perform swab during {in,out} string functions ... Browse Code »

The {in,out}s{b,w,l} functions are designed to operate on a stream of
bytes and therefore should not perform any byte-swapping, regardless of
the CPU byte order.

This patch fixes the generic IO header so that {in,out}s{b,w,l} call the
__raw_{read,write} functions directly rather than going via the
endian-correcting accessors.

Signed-off-by: Will Deacon
Cc: Mike Frysinger
Acked-by: Arnd Bergmann
Acked-by: Ben Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2012-12-18 09:15:13 +0800
965c8e59c lseek: the "whence" argument is called "whence" ... Browse Code »
13

But the kernel decided to call it "origin" instead. Fix most of the
sites.

Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-12-18 09:15:12 +0800
7929d407e include/linux/init.h: use the stringify operator for the __define_initcall macro ... Browse Code »

Currently the __define_initcall() macro takes three arguments, fn, id and
level. The level argument is exactly the same as the id argument but
wrapped in quotes. To overcome this need to specify three arguments to
the __define_initcall macro, where one argument is the stringification of
another, we can just use the stringification macro instead.

Signed-off-by: Matthew Leach
Cc: Benjamin Herrenschmidt
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Leach
2012-12-18 09:15:12 +0800
6a2b60b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.

The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.

The files under /proc//ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.

Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc//ns/ files in this tree.

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...

Linus Torvalds
2012-12-18 07:44:47 +0800
376bddd34 Merge remote-tracking branch 'agust/next' into next ... Browse Code »

Brings some 52xx updates. Also manually merged tools/perf/perf.h.

Signed-off-by: Benjamin Herrenschmidt

Benjamin Herrenschmidt
2012-12-18 07:22:27 +0800
9228ff903 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:

- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.

- A few cleanups from Akinobu Mita for drbd and cciss.

- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.

- A few fixes for xen back/front block driver from Roger Pau Monne.

- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...

Linus Torvalds
2012-12-18 05:39:11 +0800
9360b5366 Revert "bdi: add a user-tunable cpu_list for the bdi flusher threads" ... Browse Code »

This reverts commit 8fa72d234da9b6b473bbb1f74d533663e4996e6b.

People disagree about how this should be done, so let's revert this for
now so that nobody starts using the new tuning interface. Tejun is
thinking about a more generic interface for thread pool affinity.

Requested-by: Tejun Heo
Acked-by: Jeff Moyer
Acked-by: Jens Axboe
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-12-18 03:29:09 +0800
60da5bf47 Merge branch 'for-3.8/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer core updates from Jens Axboe:
"Here are the core block IO bits for 3.8. The branch contains:

- The final version of the surprise device removal fixups from Bart.

- Don't hide EFI partitions under advanced partition types. It's
fairly wide spread these days. This is especially dangerous for
systems that have both msdos and efi partition tables, where you
want to keep them in sync.

- Cleanup of using -1 instead of the proper NUMA_NO_NODE

- Export control of bdi flusher thread CPU mask and default to using
the home node (if known) from Jeff.

- Export unplug tracepoint for MD.

- Core improvements from Shaohua. Reinstate the recursive merge, as
the original bug has been fixed. Add plugging for discard and also
fix a problem handling non pow-of-2 discard limits.

There's a trivial merge in block/blk-exec.c due to a fix that went
into 3.7-rc at a later point than -rc4 where this is based."

* 'for-3.8/core' of git://git.kernel.dk/linux-block:
block: export block_unplug tracepoint
block: add plug for blkdev_issue_discard
block: discard granularity might not be power of 2
deadline: Allow 0ms deadline latency, increase the read speed
partitions: enable EFI/GPT support by default
bsg: Remove unused function bsg_goose_queue()
block: Make blk_cleanup_queue() wait until request_fn finished
block: Avoid scheduling delayed work on a dead queue
block: Avoid that request_fn is invoked on a dead queue
block: Let blk_drain_queue() caller obtain the queue lock
block: Rename queue dead flag
bdi: add a user-tunable cpu_list for the bdi flusher threads
block: use NUMA_NO_NODE instead of -1
block: recursive merge requests
block CFQ: avoid moving request to different queue

Linus Torvalds
2012-12-18 00:27:23 +0800
3c2e81ef3 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

Pull DRM updates from Dave Airlie:
"This is the one and only next pull for 3.8, we had a regression we
found last week, so I was waiting for that to resolve itself, and I
ended up with some Intel fixes on top as well.

Highlights:
- new driver: nvidia tegra 20/30/hdmi support
- radeon: add support for previously unused DMA engines, more HDMI
regs, eviction speeds ups and fixes
- i915: HSW support enable, agp removal on GEN6, seqno wrapping
- exynos: IPP subsystem support (image post proc), HDMI
- nouveau: display class reworking, nv20->40 z compression
- ttm: start of locking fixes, rcu usage for lookups,
- core: documentation updates, docbook integration, monotonic clock
usage, move from connector to object properties"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (590 commits)
drm/exynos: add gsc ipp driver
drm/exynos: add rotator ipp driver
drm/exynos: add fimc ipp driver
drm/exynos: add iommu support for ipp
drm/exynos: add ipp subsystem
drm/exynos: support device tree for fimd
radeon: fix regression with eviction since evict caching changes
drm/radeon: add more pedantic checks in the CP DMA checker
drm/radeon: bump version for CS ioctl support for async DMA
drm/radeon: enable the async DMA rings in the CS ioctl
drm/radeon: add VM CS parser support for async DMA on cayman/TN/SI
drm/radeon/kms: add evergreen/cayman CS parser for async DMA (v2)
drm/radeon/kms: add 6xx/7xx CS parser for async DMA (v2)
drm/radeon: fix htile buffer size computation for command stream checker
drm/radeon: fix fence locking in the pageflip callback
drm/radeon: make indirect register access concurrency-safe
drm/radeon: add W|RREG32_IDX for MM_INDEX|DATA based mmio accesss
drm/exynos: support extended screen coordinate of fimd
drm/exynos: fix x, y coordinates for right bottom pixel
drm/exynos: fix fb offset calculation for plane
...

Linus Torvalds
2012-12-18 00:26:17 +0800