28 May, 2010
26 commits
-
We don't name our generic fsync implementations very well currently.
The no-op implementation for in-memory filesystems currently is called
simple_sync_file which doesn't make too much sense to start with,
the the generic one for simple filesystems is called simple_fsync
which can lead to some confusion.This patch renames the generic file fsync method to generic_file_fsync
to match the other generic_file_* routines it is supposed to be used
with, and the no-op implementation to noop_fsync to make it obvious
what to expect. In addition add some documentation for both methods.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
Add a mutex_unlock missing on the error path. At other exists from the
function that return an error flag, the mutex is unlocked, so do the same
here.The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)//
@@
expression E1;
@@* mutex_lock(E1,...);
* mutex_unlock(E1,...);
//Signed-off-by: Julia Lawall
Signed-off-by: Al Viro -
__aio_put_req() plays sick games with file refcount. What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one. Current code decrements f_count and if it hasn't
hit 0, everything is fine. Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test(). IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was. And use normal fput() in delayed work.I've made that atomic_long_add_unless call a new helper -
fput_atomic(). Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that. aio.c converted to it, __fput()
use is gone. req->ki_file *always* contributes to refcount
now. And __fput() became static.Signed-off-by: Al Viro
-
Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.
In particular, before this patch, the command
ls -l
in an NFS mounted directory would always check if the directory on the server
had changed and if so would flush and refill the pagecache for the dir.
After this patch, the same "ls -l" will repeatedly return stale date until
the cached attributes for the directory time out.The following patch fixes this by ensuring the d_revalidate is called by
do_last when "." is being looked-up.
link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
is not set so nfs_lookup_verify_inode chooses not to do any validation.The following patch restores the original behaviour.
Cc: stable@kernel.org
Signed-off-by: NeilBrown
Signed-off-by: Al Viro -
This reverts commit a7cf4145bb86aaf85d4d4d29a69b50b688e2e49d.
-
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
Btrfs: add more error checking to btrfs_dirty_inode
Btrfs: allow unaligned DIO
Btrfs: drop verbose enospc printk
Btrfs: Fix block generation verification race
Btrfs: fix preallocation and nodatacow checks in O_DIRECT
Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
Btrfs: rework O_DIRECT enospc handling
Btrfs: use async helpers for DIO write checksumming
Btrfs: don't walk around with task->state != TASK_RUNNING
Btrfs: do aio_write instead of write
Btrfs: add basic DIO read/write support
direct-io: do not merge logically non-contiguous requests
direct-io: add a hook for the fs to provide its own submit_bio function
fs: allow short direct-io reads to be completed via buffered IO
Btrfs: Metadata ENOSPC handling for balance
Btrfs: Pre-allocate space for data relocation
Btrfs: Metadata ENOSPC handling for tree log
Btrfs: Metadata reservation for orphan inodes
Btrfs: Introduce global metadata reservation
... -
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: Make fsync sync new parent directories in no-journal mode
ext4: Drop whitespace at end of lines
ext4: Fix compat EXT4_IOC_ADD_GROUP
ext4: Conditionally define compat ioctl numbers
tracing: Convert more ext4 events to DEFINE_EVENT
ext4: Add new tracepoints to track mballoc's buddy bitmap loads
ext4: Add a missing trace hook
ext4: restart ext4_ext_remove_space() after transaction restart
ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
ext4: Avoid crashing on NULL ptr dereference on a filesystem error
ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
ext4: Use our own write_cache_pages()
ext4: Show journal_checksum option
ext4: Fix for ext4_mb_collect_stats()
ext4: check for a good block group before loading buddy pages
ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
ext4: Remove extraneous newlines in ext4_msg() calls
...Fixed up trivial conflict in fs/ext4/fsync.c
-
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix another nfs_wb_page() deadlock
NFS: Ensure that we mark the inode as dirty if we exit early from commit
NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker
sunrpc: fix leak on error on socket xprt setup -
Do not use the fallback default_llseek() if the readdir operation of the
filesystem still uses the big kernel lock.Since llseek() modifies
file->f_pos of the directory directly it may need locking to not confuse
readdir which usually uses file->f_pos directly as wellSince the special characteristics of the BKL (unlocked on schedule) are
not necessary in this case, the inode mutex can be used for locking as
provided by generic_file_llseek(). This is only possible since all
filesystems, except reiserfs, either use a directory as a flat file or
with disk address offsets. Reiserfs on the other hand uses a 32bit hash
off the filename as the offset so generic_file_llseek() can get used as
well since the hash is always smaller than sb->s_maxbytes (= (512 << 32) -
blocksize).Signed-off-by: Jan Blunck
Acked-by: Jan Kara
Acked-by: Anders Larsen
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is an implementation of ->llseek useable for the rare special case
when userspace expects the seek to succeed but the (device) file is
actually not able to perform the seek. In this case you use noop_llseek()
instead of falling back to the default implementation of ->llseek.Signed-off-by: Jan Blunck
Cc: Frederic Weisbecker
Cc: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The aio compat code was not converting the struct iovecs from 32bit to
64bit pointers, causing either EINVAL to be returned from io_getevents, or
EFAULT as the result of the I/O. This patch passes a compat flag to
io_submit to signal that pointer conversion is necessary for a given iocb
array.A variant of this was tested by Michael Tokarev. I have also updated the
libaio test harness to exercise this code path with good success.
Further, I grabbed a copy of ltp and ran the
testcases/kernel/syscall/readv and writev tests there (compiled with -m32
on my 64bit system). All seems happy, but extra eyes on this would be
welcome.[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
Signed-off-by: Jeff Moyer
Reported-by: Michael Tokarev
Cc: Zach Brown
Cc: [2.6.35.1]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and
writev AIO operations were not functioning properly. It turns out that
the code to convert the 32bit io vectors to 64 bits was never written.
The results of that can be pretty bad, but in my testing, it mostly ended
up in generating EFAULT as we walked off the list of I/O vectors provided.This patch set fixes the problem in my environment. are greatly
appreciated.This patch:
Factor out code that will be used by both compat_do_readv_writev and the
compat aio submission code paths.Signed-off-by: Jeff Moyer
Reported-by: Michael Tokarev
Cc: Zach Brown
Cc: [2.6.35.1]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)//
@@
type T;
T x;
identifier f;
@@T f (...) { }
@@
expression x;
@@- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
//Signed-off-by: Julia Lawall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Extend KCORE_TEXT to cover the pages between _text and _stext, to allow
examining some important page table pages.`readelf -a` output on x86_64 before and after patch:
Type Offset VirtAddr PhysAddr
before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000
after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000The newly covered pages are:
0xffffffff81000000 etc.
0xffffffff81001000
0xffffffff81002000
0xffffffff81003000
0xffffffff81004000
0xffffffff81005000
0xffffffff81006000
0xffffffff81007000
0xffffffff81008000Before patch, /proc/kcore shows outdated contents for the above page
table pages, for example:(gdb) p level3_ident_pgt
$1 = {} 0xffffffff81002000
(gdb) p/x *((pud_t *)&level3_ident_pgt)@512
$2 = {{pud = 0x1006063}, {pud = 0x0} }while the real content is:
root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem
1002000 6063 0100 0000 0000 8067 0000 0000 0000
1002010 0000 0000 0000 0000 0000 0000 0000 0000
*
1003000That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB
identity mapping before/after patch:(gdb) p/x *((pud_t *)&level3_ident_pgt)@512
before $1 = {{pud = 0x1006063}, {pud = 0x0} }
after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} }Obviously the content before patch is wrong.
Signed-off-by: Wu Fengguang
Cc: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A quick test shows these comments are obsolete, so just remove them.
Signed-off-by: WANG Cong
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I removed 3 unused assignments. The first two get reset on the first
statement of their functions. For "err" in root.c we don't return an
error and we don't use the variable again.Signed-off-by: Dan Carpenter
Cc: Oleg Nesterov
Acked-by: Serge Hallyn
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that task->signal can't go away get_nr_threads() doesn't need
->siglock to read signal->count.Also, make it inline, move into sched.h, and convert 2 other proc users of
signal->count to use this (now trivial) helper.Henceforth get_nr_threads() is the only valid user of signal->count, we
are ready to turn it into "int nr_threads" or, perhaps, kill it.Signed-off-by: Oleg Nesterov
Cc: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
de_thread() and __exit_signal() use signal_struct->count/notify_count for
synchronization. We can simplify the code and use ->notify_count only.
Instead of comparing these two counters, we can change de_thread() to set
->notify_count = nr_of_sub_threads, then change __exit_signal() to
dec-and-test this counter and notify group_exit_task.Note that __exit_signal() checks "notify_count > 0" just for symmetry with
exit_notify(), we could just check it is != 0.Signed-off-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc: Veaceslav Falico
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
- move the cprm.mm_flags checks up, before we take mmap_sem
- move down_write(mmap_sem) and ->core_state check from do_coredump()
to coredump_wait()This simplifies the code and makes the locking symmetrical.
Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
- kill "int dump_count", argv_split(argcp) accepts argcp == NULL.
- move "int dump_count" under " if (ispipe)" branch, fail_dropcount
can check ispipe.- move "char **helper_argv" as well, change the code to do argv_free()
right after call_usermodehelper_fns().- If call_usermodehelper_fns() fails goto close_fail label instead
of closing the file by hand.Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
do_coredump() does a lot of file checks after it opens the file or calls
usermode helper. But all of these checks are only needed in !ispipe case.Move this code into the "else" branch and kill the ugly repetitive ispipe
checks.Signed-off-by: Oleg Nesterov
Cc: David Howells
Cc: Neil Horman
Cc: Roland McGrath
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The first patch in this series introduced an init function to the
call_usermodehelper api so that processes could be customized by caller.
This patch takes advantage of that fact, by customizing the helper in
do_coredump to create the pipe and set its core limit to one (for our
recusrsion check). This lets us clean up the previous uglyness in the
usermodehelper internals and factor call_usermodehelper out entirely.
While I'm at it, we can also modify the helper setup to look for a core
limit value of 1 rather than zero for our recursion checkSigned-off-by: Neil Horman
Reviewed-by: Oleg Nesterov
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I recently had to recover some files from an old broken machine that was
running BorderWare Document Gateway. It's basically a drop in web server
for sharing files. From the look of the init process and using strings on
of a few files it seems to be based on FreeBSD 3.3.The process turned out to be more difficult than I imagined, but to cut a
long story short BorderWare in their wisdom use a nonstandard magic number
in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount
the file systems in order to recover the data. After a bit of hunting I
was able to make a quick fix to fs/ufs/super.c in order to detect the new
magic number.I assume that this number is the same for all installations. It's quite
easy to find out from ufs_fs.h. The superblock sits 8k into the block
device and the magic number its 1372 bytes into the superblock struct.# dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null | hd
00000000 97 26 24 0f |.&$.|
#Signed-off-by: Thomas Stewart
Cc: Evgeniy Dushistov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use memdup_user when user data is immediately copied into the allocated
region. Elimination of the variable ads, which is no longer useful.The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)//
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
}
- if (copy_from_user(to, from, size) != 0) {
-
- }
//Signed-off-by: Julia Lawall
Cc: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 May, 2010
5 commits
-
The ENOSPC code will now return ENOSPC to btrfs_start_transaction.
btrfs_dirty_inode needs to check for this and error out appropriately.Signed-off-by: Chris Mason
-
In order to support DIO that isn't aligned to the filesystem blocksize,
we fall back to buffered for any unaligned DIOs.Signed-off-by: Chris Mason
-
Less printk is good printk.
Signed-off-by: Chris Mason
-
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.Signed-off-by: Yan Zheng
Signed-off-by: Chris Mason -
The O_DIRECT code wasn't checking for multiple references
on preallocated or nodatacow extents. This means it
wasn't honoring snapshots properly.The fix here is to add an explicit check for multiple references
This also fixes the math for selecting the correct disk block,
making sure not to go past the end of the extent.Signed-off-by: Chris Mason
26 May, 2010
9 commits
-
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
squashfs: update documentation to include description of xattr layout
squashfs: fix name reading in squashfs_xattr_get
squashfs: constify xattr handlers
squashfs: xattr fix sparse warnings
squashfs: xattr_lookup sparse fix
squashfs: add xattr support configure option
squashfs: add new extended inode types
squashfs: add support for xattr reading
squashfs: add xattr id support -
fs/fscache/object-list.c: In function 'fscache_objlist_lookup':
fs/fscache/object-list.c:105: warning: cast to pointer from integer of different sizeAcked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
btrfs_dirty_inode tries to sneak in without much waiting or
space reservation, mostly for performance reasons. This
usually works well but can cause problems when there are
many many writers.When btrfs_update_inode fails with ENOSPC, we fallback
to a slower btrfs_start_transaction call that will reserve
some space.Signed-off-by: Chris Mason
-
This moves the delalloc space reservation done for O_DIRECT
into btrfs_direct_IO. This way we don't leak reserved space
if the generic O_DIRECT write code errors out before it
calls into btrfs_direct_IO.Signed-off-by: Chris Mason
-
J.R. Okajima reports that the call to sync_inode() in nfs_wb_page() can
deadlock with other writeback flush calls. It boils down to the fact
that we cannot ever call writeback_single_inode() while holding a page
lock (even if we do set nr_to_write to zero) since another process may
already be waiting in the call to do_writepages(), and so will deny us
the I_SYNC lock.Signed-off-by: Trond Myklebust
-
If we exit from nfs_commit_inode() without ensuring that the COMMIT rpc
call has been completed, we must re-mark the inode as dirty. Otherwise,
future calls to sync_inode() with the WB_SYNC_ALL flag set will fail to
ensure that the data is on the disk.Signed-off-by: Trond Myklebust
-
Commit 9c7e7e23371e629dbb3b341610a418cdf1c19d91 (NFS: Don't call iput() in
nfs_access_cache_shrinker) unintentionally removed the spin unlock for the
inode->i_lock.Reported-by: David Howells
Signed-off-by: Trond Myklebust -
This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.Signed-off-by: Chris Mason
-
This adds:
alias: devname:
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)Cc: Greg Kroah-Hartman
Cc: David S. Miller
Cc: Miklos Szeredi
Cc: Chris Mason
Cc: Alasdair G Kergon
Cc: Tigran Aivazian
Cc: Ian Kent
Signed-Off-By: Kay Sievers
Signed-off-by: Greg Kroah-Hartman