19 Dec, 2012
4 commits
-
Pull (again) user namespace infrastructure changes from Eric Biederman:
"Those bugs, those darn embarrasing bugs just want don't want to get
fixed.Linus I just updated my mirror of your kernel.org tree and it appears
you successfully pulled everything except the last 4 commits that fix
those embarrasing bugs.When you get a chance can you please repull my branch"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Fix typo in description of the limitation of userns_install
userns: Add a more complete capability subset test to commit_creds
userns: Require CAP_SYS_ADMIN for most uses of setns.
Fix cap_capable to only allow owners in the parent user namespace to have caps. -
Pull exofs changes from Boaz Harrosh:
"These are just 3 patches, the last two are bug fixes on the error
paths in exofs.The important patch is the one to osd_uld which adds sysfs info to osd
devices for use by user-mode clustering discovery software. I'm
already sitting on this patch since before February this year, It is
important for some of the big installation cluster systems, who's been
compiling their own kernel just for that patch."Ugh. The osd_uld patch already went through the SCSI tree, so this was
kind of pointless. But at least it has the two small error-path fixes..* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: don't leak io_state and pages on read error
exofs: clean up the correct page collection on write error
osduld: Add osdname & systemid sysfs at scsi_osd class -
Pull btrfs update from Chris Mason:
"A big set of fixes and features.In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
... -
Pull NFS client updates from Trond Myklebust:
"Features include:- Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client
code. Remove altogether where possible, and replace with
WARN_ON_ONCE and appropriate error returns where not.
- NFSv4.1 client adds session dynamic slot table management. There
is matching server side code that has been submitted to Bruce for
consideration.Together, this code allows the server to dynamically manage the
amount of memory it allocates to the duplicate request cache for
each client. It will constantly resize those caches to reserve
more memory for clients that are hot while shrinking caches for
those that are quiescent.In addition, there are assorted bugfixes for the generic NFS write
code, fixes to deal with the drop_nlink() warnings, and yet another
fix for NFSv4 getacl."* tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (106 commits)
SUNRPC: continue run over clients list on PipeFS event instead of break
NFS: Don't use SetPageError in the NFS writeback code
SUNRPC: variable 'svsk' is unused in function bc_send_request
SUNRPC: Handle ECONNREFUSED in xs_local_setup_socket
NFSv4.1: Deal effectively with interrupted RPC calls.
NFSv4.1: Move the RPC timestamp out of the slot.
NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
NFS: Fix calls to drop_nlink()
NFS: Ensure that we always drop inodes that have been marked as stale
nfs: Remove unused list nfs4_clientid_list
nfs: Remove duplicate function declaration in internal.h
NFS: avoid NULL dereference in nfs_destroy_server
SUNRPC handle EKEYEXPIRED in call_refreshresult
SUNRPC set gss gc_expiry to full lifetime
nfs: fix page dirtying in NFS DIO read codepath
nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
NFSv4.1: Be conservative about the client highest slotid
NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
nfs: don't extend writes to cover entire page if pagecache is invalid
...
18 Dec, 2012
26 commits
-
Merge misc patches from Andrew Morton:
"Incoming:- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."* emailed patches from Andrew Morton : (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc//fdinfo/ fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc//fdinfo/ output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
... -
The kernel keeps FAN_MARK_IGNORED_SURV_MODIFY bit separately from
fsnotify_mark::mask|ignored_mask thus put it in @mflags (mark flags)
field so the user-space reader will be able to detect if such bit were
used on mark creation procedure.| pos: 0
| flags: 04002
| fanotify flags:10 event-flags:0
| fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
| fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This allow us to print out fsnotify details such as watchee inode, device,
mask and optionally a file handle.For inotify objects if kernel compiled with exportfs support the output
will be| pos: 0
| flags: 02000000
| inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
| inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:11a1000020542153
| inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:49b1060023552153If kernel compiled without exportfs support, the file handle
won't be provided but inode and device only.| pos: 0
| flags: 02000000
| inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0
| inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0
| inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0For fanotify the output is like
| pos: 0
| flags: 04002
| fanotify flags:10 event-flags:0
| fanotify mnt_id:12 mask:3b ignored_mask:0
| fanotify ino:50205 sdev:800013 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:05020500fb1d47e7To minimize impact on general fsnotify code the new functionality
is gathered in fs/notify/fdinfo.c file.Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We will need this helper in the next patch to provide a file handle for
inotify marks in /proc/pid/fdinfo output.The patch is rather providing the way to use inodes directly when dentry
is not available (like in case of inotify system).Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This routine will be used to generate a file handle in fdinfo output for
inotify subsystem, where if no s_export_op present the general
export_encode_fh should be used. Thus add a test if s_export_op present
inside exportfs_encode_fh itself.Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This allows us to print out eventpoll target file descriptor, events and
data, the /proc/pid/fdinfo/fd consists of| pos: 0
| flags: 02
| tfd: 5 events: 1d data: ffffffffffffffff enabled: 1[avagin@: fix for unitialized ret variable]
Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This allows us to print out raw counter value. The /proc/pid/fdinfo/fd
output is| pos: 0
| flags: 04002
| eventfd-count: 5aSigned-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch brings ability to print out auxiliary data associated with
file in procfs interface /proc/pid/fdinfo/fd.In particular further patches make eventfd, evenpoll, signalfd and
fsnotify to print additional information complete enough to restore
these objects after checkpoint.To simplify the code we add show_fdinfo callback inside struct
file_operations (as Al and Pavel are proposing).Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This also converts filling memory loop to use memset.
Signed-off-by: Akinobu Mita
Cc: Artem Bityutskiy
Cc: Adrian Hunter
Cc: "Theodore Ts'o"
Cc: David Laight
Cc: David Woodhouse
Cc: Eilon Greenstein
Cc: Michel Lespinasse
Cc: Robert Love
Cc: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To avoid an explosion of request_module calls on a chain of abusive
scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
as maximum recursion depth is hit, the error will fail all the way back
up the chain, aborting immediately.This also has the side-effect of stopping the user's shell from attempting
to reexecute the top-level file as a shell script. As seen in the
dash source:if (cmd != path_bshell && errno == ENOEXEC) {
*argv-- = cmd;
*argv = cmd = path_bshell;
goto repeat;
}The above logic was designed for running scripts automatically that lacked
the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
things continue to behave as the shell expects.Additionally, when tracking recursion, the binfmt handlers should not be
involved. The recursion being tracked is the depth of calls through
search_binary_handler(), so that function should be exclusively responsible
for tracking the depth.Signed-off-by: Kees Cook
Cc: halfdog
Cc: P J P
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We display a list of supplementary group for each process in
/proc//status. However, we show only the first 32 groups, not all of
them.Although this is rare, but sometimes processes do have more than 32
supplementary groups, and this kernel limitation breaks user-space apps
that rely on the group list in /proc//status.Number 32 comes from the internal NGROUPS_SMALL macro which defines the
length for the internal kernel "small" groups buffer. There is no
apparent reason to limit to this value.This patch removes the 32 groups printing limit.
The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
which is currently set to 65536. And this is the maximum count of groups
we may possibly print.Signed-off-by: Artem Bityutskiy
Acked-by: Serge E. Hallyn
Acked-by: Kees Cook
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is currently impossible to examine the state of seccomp for a given
process. While attaching with gdb and attempting "call
prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
reliable. If the process is in seccomp mode 1, this query will kill the
process (prctl not allowed), if the process is in mode 2 with prctl not
allowed, it will similarly be killed, and in weird cases, if prctl is
filtered to return errno 0, it can look like seccomp is disabled.When reviewing the state of running processes, there should be a way to
externally examine the seccomp mode. ("Did this build of Chrome end up
using seccomp?" "Did my distro ship ssh with seccomp enabled?")This adds the "Seccomp" line to /proc/$pid/status.
Signed-off-by: Kees Cook
Reviewed-by: Cyrill Gorcunov
Cc: Andrea Arcangeli
Cc: James Morris
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During c/r sessions we've found that there is no way at the moment to
fetch some VMA associated flags, such as mlock() and madvise().This leads us to a problem -- we don't know if we should call for mlock()
and/or madvise() after restore on the vma area we're bringing back to
life.This patch intorduces a new field into "smaps" output called VmFlags,
where all set flags associated with the particular VMA is shown as two
letter mnemonics.[ Strictly speaking for c/r we only need mlock/madvise bits but it has been
said that providing just a few flags looks somehow inconsistent. So all
flags are here now. ]This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as
other applications may start to use these fields.The data is encoded in a somewhat awkward two letters mnemonic form, to
encourage userspace to be prepared for fields being added or removed in
the future.[a.p.zijlstra@chello.nl: props to use for_each_set_bit]
[sfr@canb.auug.org.au: props to use array instead of struct]
[akpm@linux-foundation.org: overall redesign and simplification]
[akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()]
Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Peter Zijlstra
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Without this patch it is really hard to interpret a bounding set, if
CAP_LAST_CAP is unknown for a current kernel.Non-existant capabilities can not be deleted from a bounding set with help
of prctl.E.g.: Here are two examples without/with this patch.
CapBnd: ffffffe0fdecffff
CapBnd: 00000000fdecffffI suggest to hide non-existent capabilities. Here is two reasons.
* It's logically and easier for using.
* It helps to checkpoint-restore capabilities of tasks, because tasks
can be restored on another kernel, where CAP_LAST_CAP is bigger.Signed-off-by: Andrew Vagin
Cc: Andrew G. Morgan
Reviewed-by: Serge E. Hallyn
Cc: Pavel Emelyanov
Reviewed-by: Kees Cook
Cc: KAMEZAWA Hiroyuki
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Option parsing code expects an unsigned integer for the codepage option,
but prefixes and stores this option with "cp" before passing to
load_nls(). This makes the displayed option in /proc an invalid one.
Strip the prefix when printing so that the displayed option is valid for
reuse.Signed-off-by: Dave Reisner
Acked-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
parse_options() is supposed to return value < 0 on error however we
returned 0 (success) in a lot of cases. This actually was not a problem
in practice because match_token() used by parse_options() is clever and
catches most of the problems for us.Signed-off-by: Jan Kara
Cc: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
So far FAT either offsets time stamps by sys_tz.minuteswest or leaves them
as they are (when tz=UTC mount option is used). However in some cases it
is useful if one can specify time stamp offset on his own (e.g. when time
zone of the camera connected is different from time zone of the computer,
or when HW clock is in UTC and thus sys_tz.minuteswest == 0).So provide a mount option time_offset= which allows user to specify offset
in minutes that should be applied to time stamps on the filesystem.akpm: this code would work incorrectly when used via `mount -o remount',
because cached inodes would not be updated. But fatfs's fat_remount() is
basically a no-op anyway.Signed-off-by: Jan Kara
Acked-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Change fatfs so that a warning is emitted when an attempt is made to mount
a filesystem with the unsupported `discard' option.ext4 aready does this: http://patchwork.ozlabs.org/patch/192668/
Signed-off-by: Namjae Jeon
Signed-off-by: Amit Sahrawat
Acked-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If elf_core_dump() is called and fill_note_info() fails in the kmalloc()
then it returns 0 but has not yet initialised all the needed fields. As a
result we do a kfree(randomness) after correctly skipping the thread data.[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Andy Shevchenko
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fixes following sparse warning:
fs/notify/inode_mark.c:127:22: warning: symbol 'fsnotify_find_inode_mark_locked' was not declared. Should it be static?
Signed-off-by: Tushar Behera
Cc: Eric Paris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
But the kernel decided to call it "origin" instead. Fix most of the
sites.Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.The files under /proc//ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc//ns/ files in this tree.Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
... -
Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three partsThe new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.Signed-off-by: Liu Bo
Signed-off-by: Chris Mason -
The handling for directory crc hash overflows was fairly obscure,
split_leaf returns EOVERFLOW when we try to extend the item and that is
supposed to bubble up to userland. For a while it did so, but along the
way we added better handling of errors and forced the FS readonly if we
hit IO errors during the directory insertion.Along the way, we started testing only for EEXIST and the EOVERFLOW case
was dropped. The end result is that we may force the FS readonly if we
catch a directory hash bucket overflow.This fixes a few problem spots. First I add tests for EOVERFLOW in the
places where we can safely just return the error up the chain.btrfs_rename is harder though, because it tries to insert the new
directory item only after it has already unlinked anything the rename
was going to overwrite. Rather than adding very complex logic, I added
a helper to test for the hash overflow case early while it is still safe
to bail out.Snapshot and subvolume creation had a similar problem, so they are using
the new helper now too.Signed-off-by: Chris Mason
Reported-by: Pascal Junod -
Pull ext3, udf, quota fixes from Jan Kara:
"Some ext3 & quota cleanups and couple of udf fixes"* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Use the pre-processor to compile out quotactl_cmd_write when !CONFIG_BLOCK
ext3: drop if around WARN_ON
ext3: get rid of the duplicate code on ext3_fill_super
udf: remove un-needed variable from inode_getblk
udf: don't increment lenExtents while writing to a hole
udf: fix memory leak while allocating blocks during write
17 Dec, 2012
10 commits
-
This confuses and angers lockdep even though it's ok. We don't really need
the lock for free space inodes since only the transaction committer will be
reserving space. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason -
This happens because writeback_inodes_sb_nr_if_idle does down_read. This
doesn't work for us and it has not been fixed upstream yet, so do it
ourselves and use that instead so we can stop having this stupid long
standing lockup. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason -
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.Signed-off-by: Filipe Brandenburger
Reviewed-by: Liu Bo
Signed-off-by: Chris Mason -
Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.Signed-off-by: Liu Bo
Signed-off-by: Chris Mason -
This fixes a very special case that can be reproduced by just
disconnecting a disk at runtime, and without unmounting the
filesystem first, start scrub on the filesystem with the
disconnected disk. All read and write EIOs are handled
correctly, only the first superblock is an exception and gives
a BUG() in a subfunction. The BUG() is correct, it would crash
later otherwise. The subfunction must not be called for
superblocks and this is what the fix changes.Reported-by: Joeri Vanthienen
Signed-off-by: Stefan Behrens
Signed-off-by: Chris Mason -
This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload. We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason -
I noticed while doing fsync tests that we were always dropping the path and
re-searching when we first cow the log root even though we've already gotten
the write lock on the root. That's because we don't take into account that
there might not be a parent node, so fix the check to make sure there is
actually a parent node before we undo all of this work for nothing. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason -
If we are syncing over and over the overhead of doing all those maps in
fill_inode_item and log_changed_extents really starts to hurt, so use map
tokens so we can avoid all the extra mapping. Since the token maps from our
offset to the end of the page make sure to set the first thing in the item
first so we really only do one map. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason -
This gets called at least 4 times for every level while adding an object,
and it involves 3 kmapping calls, which on my box take about 5us a piece.
So instead use a token, which brings us down to 1 kmap call and makes this
function take 1/3 of the time per call. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason -
Our token logic depends on token->kaddr being set, and if it is not it sets
everything properly as needed. So instead of memsetting just set
token->kaddr to NULL. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason