07 Mar, 2010
1 commit
-
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!Tested-by: Quentin Barnes
Signed-off-by: Wu Fengguang
Cc: Nick Piggin
Cc: Andi Kleen
Cc: Steven Whitehouse
Cc: David Howells
Cc: Jonathan Corbet
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Trond Myklebust
Cc: Chuck Lever
Cc: [2.6.33.x]
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jun, 2009
1 commit
-
Impact: code simplification.
Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Jan, 2009
1 commit
-
System calls with an unsigned long long argument can't be converted with
the standard wrappers since that would include a cast to long, which in
turn means that we would lose the upper 32 bit on 32 bit architectures.
Also semctl can't use the standard wrapper since it has a 'union'
parameter.So we handle them as special case and add some extra wrappers instead.
Signed-off-by: Heiko Carstens
17 Oct, 2008
1 commit
-
People can use the real name an an index into MAINTAINERS to find the
current email address.Signed-off-by: Francois Cami
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Apr, 2008
1 commit
-
Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
the user mappings.This requires the get_xip_page API to be changed to an address based one.
Improve the API layering a little bit too, while we're here.This is required in order to support XIP filesystems on memory that isn't
backed with struct page (but memory with struct page is still supported too).Signed-off-by: Nick Piggin
Acked-by: Carsten Otte
Cc: Jared Hulbert
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Feb, 2008
1 commit
-
I've written some test programs in ltp project. During writing I met an
problem which I cannot solve in user land. So I wrote a patch for linux
kernel. Please, include this patch if acceptable.The test program tests the 4th parameter of fadvise64_64:
long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
My test case calls fadvise64_64 with invalid advice value and checks errno is
set to EINVAL. About the advice parameter man page says:...
Permissible values for advice include:POSIX_FADV_NORMAL
...
POSIX_FADV_SEQUENTIAL
...
POSIX_FADV_RANDOM
...
POSIX_FADV_NOREUSE
...
POSIX_FADV_WILLNEED
...
POSIX_FADV_DONTNEED
...
ERRORS
...
EINVAL An invalid value was specified for advice.However, I got a bug report that the system call invocations
in my test case returned 0 unexpectedly.I've inspected the kernel code:
asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
{
struct file *file = fget(fd);
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
pgoff_t start_index;
pgoff_t end_index;
unsigned long nrpages;
int ret = 0;if (!file)
return -EBADF;if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
ret = -ESPIPE;
goto out;
}mapping = file->f_mapping;
if (!mapping || len < 0) {
ret = -EINVAL;
goto out;
}if (mapping->a_ops->get_xip_page)
/* no bad return value, but ignore advice */
goto out;
...
out:
fput(file);
return ret;
}I found the advice parameter is just ignored in the case
mapping->a_ops->get_xip_page is given. This behavior is different from
what is written on the man page. Is this o.k.?get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
Anyway I cannot find the easy way to detect get_xip_page
field is given or CONFIG_EXT2_FS_XIP is true from the
user space.I propose the following patch which checks the advice parameter
even if get_xip_page is given.Signed-off-by: Masatake YAMATO
Acked-by: Carsten Otte
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Dec, 2006
1 commit
-
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.
Signed-off-by: Josef "Jeff" Sipek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Aug, 2006
1 commit
-
The POSIX_FADV_NOREUSE hint means "the application will use this range of the
file a single time". It seems to be intended that the implementation will use
this hint to perform drop-behind of that part of the file when the application
gets around to reading or writing it.However for reasons which aren't obvious (or sane?) I mapped
POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED. ie: it does readahead.That's daft. So for now, make POSIX_FADV_NOREUSE a no-op.
This is a non-back-compatible change. If someone was using POSIX_FADV_NOREUSE
to perform readahead, they lose. The likelihood is low.If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
do it fully we'll need to maintain file offset/length ranges and peform all
sorts of complex tricks, and managing the lifetime of those ranges' data
structures will be interesting..A sensible implementation would probably ignore the file range and would
simply mark the entire file as needing some form of drop-behind treatment.Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jul, 2006
1 commit
-
Cc: "Michael Kerrisk"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Apr, 2006
1 commit
-
Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
Reasons:- It's more flexible. Things which would require two or three syscalls with
fadvise() can be done in a single syscall.- Using fadvise() in this manner is something not covered by POSIX.
The patch wires up the syscall for x86.
The sycall is implemented in the new fs/sync.c. The intention is that we can
move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.Documentation for the syscall is in fs/sync.c.
A test app (sync_file_range.c) is in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
NFS_DATA_SYNC which is hopefully the more common."Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
the queue is congested. This is trivial to fix: add a new flag bit, set
wbc->nonblocking. But I'm not sure that we want to expose implementation
details down to that level.Note: it's notable that we can sync an fd which wasn't opened for writing.
Same with fsync() and fdatasync()).Note: the code takes some care to handle attempts to sync file contents
outside the 16TB offset on 32-bit machines. It makes such attempts appear to
succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
requests fail...Cc: Nick Piggin
Cc: Michael Kerrisk
Cc: Ulrich Drepper
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Mar, 2006
1 commit
-
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Jan, 2006
1 commit
-
The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
fully POSIX-compliant.Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jun, 2005
1 commit
-
Make sys_madvice/fadvice return sane with xip.
Signed-off-by: Carsten Otte
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Apr, 2005
1 commit
-
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.Let it rip!