23 Oct, 2007
1 commit
-
Fix mprotect bug in recent commit 3ed75eb8f1cd89565966599c4f77d2edb086d5b0
(setup vma->vm_page_prot by vm_get_page_prot()): the vma_wants_writenotify
case was setting the same prot as when not.Nothing wrong with the use of protection_map[] in mmap_region(),
but use vm_get_page_prot() there too in the same ~VM_SHARED way.Signed-off-by: Hugh Dickins
Cc: Coly Li
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Oct, 2007
4 commits
-
Now that nfsd has stopped writing to the find_exported_dentry member we an
mark the export_operations constSigned-off-by: Christoph Hellwig
Cc: Neil Brown
Cc: "J. Bruce Fields"
Cc:
Cc: Dave Kleikamp
Cc: Anton Altaparmakov
Cc: David Chinner
Cc: Timothy Shimmin
Cc: OGAWA Hirofumi
Cc: Hugh Dickins
Cc: Chris Mason
Cc: Jeff Mahoney
Cc: "Vladimir V. Saveliev"
Cc: Steven Whitehouse
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I'm not sure what people were thinking when adding support to export tmpfs,
but here's the conversion anyway:Signed-off-by: Christoph Hellwig
Cc: Neil Brown
Cc: "J. Bruce Fields"
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab()
after memory online.When memory online is called, kmem_cache_nodes are created for all SLUBs
for new node whose memory are available.slab_mem_going_online_callback() is called to make kmem_cache_node() in
callback of memory online event. If it (or other callbacks) fails, then
slab_mem_offline_callback() is called for rollback.In memory offline, slab_mem_going_offline_callback() is called to shrink
all slub cache, then slab_mem_offline_callback() is called later.[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: locking fix]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Current memory notifier has some defects yet. (Fortunately, nothing uses
it.) This patch is to fix and rearrange for them.- Add information of start_pfn, nr_pages, and node id if node status is
changes from/to memoryless node for callback functions.
Callbacks can't do anything without those information.
- Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
- Move GOING_OFFLINE status notification after page isolation.
It is good place for return memory like cache for callback,
because returned page is not used again.
- Make CANCEL events for rollingback when error occurs.
- Delete MEM_MAPPING_INVALID notification. It will be not used.
- Fix compile error of (un)register_memory_notifier().Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Oct, 2007
1 commit
-
Wrong order of arguments
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds
20 Oct, 2007
14 commits
-
Signed-off-by: Adrian Bunk
-
Typo fixes retrun -> return
Signed-off-by: Gabriel Craciunescu
Signed-off-by: Adrian Bunk -
Spelling fixes in mm/.
Signed-off-by: Simon Arlott
Signed-off-by: Adrian Bunk -
Fix the wishy-washy comment to clearly explain why kmalloc() can't
use the __GFP_HIGHMEM zone modifier.Signed-off-by: Robert P. J. Day
Signed-off-by: Adrian Bunk -
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov
Cc: Dave Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With pid namespaces this field is now dangerous to use explicitly, so hide
it behind the helpers.Also the pid and pgrp fields o task_struct and signal_struct are to be
deprecated. Unfortunately this patch cannot be sent right now as this
leads to tons of warnings, so start isolating them, and deprecate later.Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
but Oleg pointed out that in case of posix cpu timers this is the same, and
thread_group_leader() is more preferable.Signed-off-by: Pavel Emelyanov
Acked-by: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.Signed-off-by: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Alexey Dobriyan
Cc: Sukadev Bhattiprolu
Cc: Oleg Nesterov
Cc: Paul Menage
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm/oom_kill.c: Convert list_for_each to list_for_each_entry in
oom_kill_process()Signed-off-by: Matthias Kaehlcke
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().A cgroup init has it's tsk->pid == 1.
A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.Changelog:
2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().2.6.21-mm2-pidns2:
- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystemThe "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.Signed-off-by: Paul Menage
Cc: Serge E. Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Paul Jackson
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: Srivatsa Vaddagiri
Cc: Cedric Le Goater
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix kernel-api docbook contents problems.
docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'Signed-off-by: Randy Dunlap
Cc: Jens Axboe
Cc: WU Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch uses vm_get_page_prot() to setup vma->vm_page_prot.
Though inside vm_get_page_prot() the protection flags is AND with
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.Signed-off-by: Coly Li
Cc: Hugh Dickins
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nobody uses flush_tlb_pgtables anymore, this patch removes all remaining
traces of it from all archs.Signed-off-by: Benjamin Herrenschmidt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Oct, 2007
4 commits
-
It gets it indirectly from blkdev.h when CONFIG_BLOCK is enabled, but it
needs it unconditionally for the definition of mapping_cap_writeback_dirty.Noticed and bisected down to 4af3c9cc4fad54c3627e9afebf905aafde5690ed
("Drop some headers from mm.h") by Avuton Olrich.Cc: Avuton Olrich
Cc: Alexey Dobriyan
Cc: Andrew Morton
Signed-off-by: Linus Torvalds -
Get rid of sparse related warnings from places that use integer as NULL
pointer.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger
Cc: Andi Kleen
Cc: Jeff Garzik
Cc: Matt Mackall
Cc: Ian Kent
Cc: Arnd Bergmann
Cc: Davide Libenzi
Cc: Stephen Smalley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch fixes memory leak in error path.
In reality, we don't need to call cpuup_canceled(cpu) for now. But upcoming
cpu hotplug error handling change needs this.Cc: Christoph Lameter
Cc: Gautham R Shenoy
Acked-by: Pekka Enberg
Signed-off-by: Akinobu Mita
Cc: Gautham R Shenoy
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cpuup_callback() is too long. This patch factors out CPU_UP_CANCELLED and
CPU_UP_PREPARE handlings from cpuup_callback().Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Akinobu Mita
Cc: Gautham R Shenoy
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Oct, 2007
1 commit
-
* 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xfs: eagerly remove vmap mappings to avoid upsetting Xen
xen: add some debug output for failed multicalls
xen: fix incorrect vcpu_register_vcpu_info hypercall argument
xen: ask the hypervisor how much space it needs reserved
xen: lock pte pages while pinning/unpinning
xen: deal with stale cr3 values when unpinning pagetables
xen: add batch completion callbacks
xen: yield to IPI target if necessary
Clean up duplicate includes in arch/i386/xen/
remove dead code in pgtable_cache_init
paravirt: clean up lazy mode handling
paravirt: refactor struct paravirt_ops into smaller pv_*_ops
17 Oct, 2007
15 commits
-
This patch contains the following cleanups that are now possible:
- remove the unused security_operations->inode_xattr_getsuffix
- remove the no longer used security_operations->unregister_security
- remove some no longer required exit code
- remove a bunch of no longer used exportsSigned-off-by: Adrian Bunk
Acked-by: James Morris
Cc: Chris Wright
Cc: Stephen Smalley
Cc: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Implement file posix capabilities. This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php. For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.Changelog:
Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.Sep 01:
Define task_setscheduler, task_setioprio, cap_task_kill, and
task_setnice to make sure a user cannot affect a process in which
they called a program with some fscaps.One remaining question is the note under task_setscheduler: are we
ok with CAP_SYS_NICE being sufficient to confine a process to a
cpuset?It is a semantic change, as without fsccaps, attach_task doesn't
allow CAP_SYS_NICE to override the uid equivalence check. But since
it uses security_task_setscheduler, which elsewhere is used where
CAP_SYS_NICE can be used to override the uid equivalence check,
fixing it might be tough.task_setscheduler
note: this also controls cpuset:attach_task. Are we ok with
CAP_SYS_NICE being used to confine to a cpuset?
task_setioprio
task_setnice
sys_setpriority uses this (through set_one_prio) for another
process. Need same checks as setrlimitAug 21:
Updated secureexec implementation to reflect the fact that
euid and uid might be the same and nonzero, but the process
might still have elevated caps.Aug 15:
Handle endianness of xattrs.
Enforce capability version match between kernel and disk.
Enforce that no bits beyond the known max capability are
set, else return -EPERM.
With this extra processing, it may be worth reconsidering
doing all the work at bprm_set_security rather than
d_instantiate.Aug 10:
Always call getxattr at bprm_set_security, rather than
caching it at d_instantiate.[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn
Cc: Stephen Smalley
Cc: James Morris
Cc: Chris Wright
Cc: Andrew Morgan
Signed-off-by: Andrew Morgan
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix kernel-doc for sys_remap_file_pages() and add info to the 'prot' NOTE.
Rename __prot parameter to prot.Signed-off-by: Randy Dunlap
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.This patch:
Some filesystems forego the vfs and may_open() and create their own 'struct
file's.This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.Also, rename an existing, static-scope init_file() to a less generic name.
Signed-off-by: Dave Hansen
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We don't want to introduce pointless delays in throttle_vm_writeout() when
the writeback limits are not yet exceeded, do we?Cc: Nick Piggin
Cc: OGAWA Hirofumi
Cc: Kumar Gala
Cc: Pete Zaitcev
Cc: Greg KH
Reviewed-by: Rik van Riel
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.Also document the various bits and change their order from historical to
logical.[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel
Cc: Dave Kleikamp
Cc: David Chinner
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays. But sometimes the following happens instead:- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92MSome analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write 0, no more writes(BUG)nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a
flag writeback_control.more_io to indicate this situation. With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.Cc: David Chinner
Cc: Ken Chen
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since nothing earlier than gcc-3.2 is supported for kernel
compilation, that 2.95 hack can be removed.Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
remove them and add them back to files which need them.Cross-compile tested on many configs and archs.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These aren't modular, so SLAB_PANIC is OK.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a writeback-internal marker but we're propagating it all the way back
to userspace!.Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zone->lock is quite an "inner" lock and mostly constrained to page alloc as
well, so like slab locks, it probably isn't something that is critically
important to document here. However unlike slab locks, zone lock could be
used more widely in future, and page_alloc.c might possibly have more
business to do tricky things with pagecache than does slab. So... I don't
think it hurts to document it.Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduces new zone flag interface for testing and setting flags:
int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)
Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
called, this flag is test and set before starting zone reclaim. Zone reclaim
starts in __alloc_pages() when a zone's watermark fails and the system is in
zone_reclaim_mode. If it's already in reclaim, there's no need to start again
so it is simply considered full for that allocation attempt.There is a change of behavior with regard to concurrent zone shrinking. It is
now possible for try_to_free_pages() or kswapd to already be shrinking a
particular zone when __alloc_pages() starts zone reclaim. In this case, it is
possible for two concurrent threads to invoke shrink_zone() for a single zone.This change forbids a zone to be in zone reclaim twice, which was always the
behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
when starting zone reclaim.Cc: Andrea Arcangeli
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if
the lock can't be acquired; it will be available soon enough once the zonelist
scanning is done. All other threads waiting for the OOM killer are also
contingent on the exiting task being able to acquire the lock in
clear_zonelist_oom() so it doesn't make sense to put it to sleep.Cc: Andrea Arcangeli
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since no task descriptor's 'cpuset' field is dereferenced in the execution of
the OOM killer anymore, it is no longer necessary to take callback_mutex.[akpm@linux-foundation.org: restore cpuset_lock for other patches]
Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds