Eric Lee / smarc-fsl-linux-kernel

15 Mar, 2011

1 commit

1abf0c718 New kind of open files - "location only". ... Browse Code »

New flag for open(2) - O_PATH. Semantics:
* pathname is resolved, but the file itself is _NOT_ opened
as far as filesystem is concerned.
* almost all operations on the resulting descriptors shall
fail with -EBADF. Exceptions are:
1) operations on descriptors themselves (i.e.
close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
fcntl(fd, F_SETFD, ...))
2) fcntl(fd, F_GETFL), for a common non-destructive way to
check if descriptor is open
3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
points of pathname resolution
* closing such descriptor does *NOT* affect dnotify or
posix locks.
* permissions are checked as usual along the way to file;
no permission checks are applied to the file itself. Of course,
giving such thing to syscall will result in permission checks (at
the moment it means checking that starting point of ....at() is
a directory and caller has exec permissions on it).

fget() and fget_light() return NULL on such descriptors; use of
fget_raw() and fget_raw_light() is needed to get them. That protects
existing code from dealing with those things.

There are two things still missing (they come in the next commits):
one is handling of symlinks (right now we refuse to open them that
way; see the next commit for semantics related to those) and another
is descriptor passing via SCM_RIGHTS datagrams.

Signed-off-by: Al Viro

Al Viro
2011-03-15 14:21:45 +0800

17 Jan, 2011

1 commit

c2b3e74b7 fs: Remove unlikely() from fput_light() ... Browse Code »

In fput_light(), there's an unlikely(fput_needed), which running on
my normal desktop doing firefox, xchat, evolution and part of my distcc farm,
and running the annotate branch profiler shows that the unlikely is not
very unlikely.

correct incorrect % Function File Line
------- --------- - -------- ---- ----
0 48 100 fput_light file.h 26
115828710 897415279 88 fput_light file.h 26
865271179 5286128445 85 fput_light file.h 26
19568539 8923664 31 fput_light file.h 26
12353677 3562279 22 fput_light file.h 26
267691 67062 20 fput_light file.h 26
15014853 348172 2 fput_light file.h 26
209258 205 0 fput_light file.h 26
1364164 0 0 fput_light file.h 26

Which gives 1032903812 times it was correct and 6203351846 times it was
incorrect, or 85% incorrect.

Cc: Andrew Morton
Signed-off-by: Steven Rostedt
Signed-off-by: Al Viro

Steven Rostedt
2011-01-17 16:26:26 +0800

28 May, 2010

1 commit

d7065da03 get rid of the magic around f_count in aio ... Browse Code »

__aio_put_req() plays sick games with file refcount. What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one. Current code decrements f_count and if it hasn't
hit 0, everything is fine. Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.

Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test(). IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was. And use normal fput() in delayed work.

I've made that atomic_long_add_unless call a new helper -
fput_atomic(). Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that. aio.c converted to it, __fput()
use is gone. req->ki_file *always* contributes to refcount
now. And __fput() became static.

Signed-off-by: Al Viro

Al Viro
2010-05-28 10:03:07 +0800

17 Dec, 2009

2 commits

2c48b9c45 switch alloc_file() to passing struct path ... Browse Code »

... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.

Signed-off-by: Al Viro

Al Viro
2009-12-17 01:16:42 +0800
3d1e46315 get rid of init_file() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-12-17 01:16:42 +0800

21 Oct, 2008

1 commit

aeb5d7270 [PATCH] introduce fmode_t, do annotations ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2008-10-21 19:47:06 +0800

01 Aug, 2008

1 commit

1027abe88 [PATCH] merge locate_fd() and get_unused_fd() ... Browse Code »

New primitive: alloc_fd(start, flags). get_unused_fd() and
get_unused_fd_flags() become wrappers on top of it.

Signed-off-by: Al Viro

Al Viro
2008-08-01 23:25:23 +0800

02 May, 2008

1 commit

9f3acc314 [PATCH] split linux/file.h ... Browse Code »

Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro

Al Viro
2008-05-02 01:08:16 +0800

25 Apr, 2008

1 commit

3b1253880 [PATCH] sanitize unshare_files/reset_files_struct ... Browse Code »

* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
it will break the system if somebody ever calls it for anything
else (we can't modify ->files of somebody else). Lose the
task_struct * argument.

Signed-off-by: Al Viro

Al Viro
2008-04-25 21:23:59 +0800

19 Apr, 2008

1 commit

aceaf78da [PATCH] r/o bind mounts: create helper to drop file write access ... Browse Code »

If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().

NFS does just that, and will use this in the next
patch.

AV: drop write access in __fput() only after we evict from file list.

Signed-off-by: Dave Hansen
Cc: Erez Zadok
Cc: Trond Myklebust
Cc: "J Bruce Fields"
Acked-by: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Dave Hansen
2008-04-19 12:25:32 +0800

14 Feb, 2008

1 commit

b3c975286 include/linux: Remove all users of FASTCALL() macro ... Browse Code »

FASTCALL() is always expanded to empty, remove it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-02-14 08:21:18 +0800

17 Oct, 2007

1 commit

ce8d2cdf3 r/o bind mounts: filesystem helpers for custom 'struct file's ... Browse Code »

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2007-10-17 23:43:04 +0800

17 Jul, 2007

1 commit

4a19542e5 O_CLOEXEC for SCM_RIGHTS ... Browse Code »

Part two in the O_CLOEXEC saga: adding support for file descriptors received
through Unix domain sockets.

The patch is once again pretty minimal, it introduces a new flag for recvmsg
and passes it just like the existing MSG_CMSG_COMPAT flag. I think this bit
is not used otherwise but the networking people will know better.

This new flag is not recognized by recvfrom and recv. These functions cannot
be used for that purpose and the asymmetry this introduces is not worse than
the already existing MSG_CMSG_COMPAT situations.

The patch must be applied on the patch which introduced O_CLOEXEC. It has to
remove static from the new get_unused_fd_flags function but since scm.c cannot
live in a module the function still hasn't to be exported.

Here's a test program to make sure the code works. It's so much longer than
the actual patch...

#include
#include
#include
#include
#include
#include
#include
#include

#ifndef O_CLOEXEC
# define O_CLOEXEC 02000000
#endif
#ifndef MSG_CMSG_CLOEXEC
# define MSG_CMSG_CLOEXEC 0x40000000
#endif

int
main (int argc, char *argv[])
{
if (argc > 1)
{
int fd = atol (argv[1]);
printf ("child: fd = %d\n", fd);
if (fcntl (fd, F_GETFD) == 0 || errno != EBADF)
{
puts ("file descriptor valid in child");
return 1;
}
return 0;

}

struct sockaddr_un sun;
strcpy (sun.sun_path, "./testsocket");
sun.sun_family = AF_UNIX;

char databuf[] = "hello";
struct iovec iov[1];
iov[0].iov_base = databuf;
iov[0].iov_len = sizeof (databuf);

union
{
struct cmsghdr hdr;
char bytes[CMSG_SPACE (sizeof (int))];
} buf;
struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1,
.msg_control = buf.bytes,
.msg_controllen = sizeof (buf) };
struct cmsghdr *cmsg = CMSG_FIRSTHDR (&msg);

cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN (sizeof (int));

msg.msg_controllen = cmsg->cmsg_len;

pid_t child = fork ();
if (child == -1)
error (1, errno, "fork");
if (child == 0)
{
int sock = socket (PF_UNIX, SOCK_STREAM, 0);
if (sock < 0)
error (1, errno, "socket");

if (bind (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0)
error (1, errno, "bind");
if (listen (sock, SOMAXCONN) < 0)
error (1, errno, "listen");

int conn = accept (sock, NULL, NULL);
if (conn == -1)
error (1, errno, "accept");

*(int *) CMSG_DATA (cmsg) = sock;
if (sendmsg (conn, &msg, MSG_NOSIGNAL) < 0)
error (1, errno, "sendmsg");

return 0;
}

/* For a test suite this should be more robust like a
barrier in shared memory. */
sleep (1);

int sock = socket (PF_UNIX, SOCK_STREAM, 0);
if (sock < 0)
error (1, errno, "socket");

if (connect (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0)
error (1, errno, "connect");
unlink (sun.sun_path);

*(int *) CMSG_DATA (cmsg) = -1;

if (recvmsg (sock, &msg, MSG_CMSG_CLOEXEC) < 0)
error (1, errno, "recvmsg");

int fd = *(int *) CMSG_DATA (cmsg);
if (fd == -1)
error (1, 0, "no descriptor received");

char fdname[20];
snprintf (fdname, sizeof (fdname), "%d", fd);
execl ("/proc/self/exe", argv[0], fdname, NULL);
puts ("execl failed");
return 1;
}

[akpm@linux-foundation.org: Fix fastcall inconsistency noted by Michael Buesch]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Ulrich Drepper
Cc: Ingo Molnar
Cc: Michael Buesch
Cc: Michael Kerrisk
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Drepper
2007-07-17 00:05:45 +0800

23 Dec, 2006

1 commit

01b2d93ca [PATCH] fdtable: Provide free_fdtable() wrapper ... Browse Code »

Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel. The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage. Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.

Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vadim Lobanov
2006-12-23 00:55:50 +0800

11 Dec, 2006

3 commits

5466b456e [PATCH] fdtable: Implement new pagesize-based fdtable allocator ... Browse Code »

This patch provides an improved fdtable allocation scheme, useful for
expanding fdtable file descriptor entries. The main focus is on the fdarray,
as its memory usage grows 128 times faster than that of an fdset.

The allocation algorithm sizes the fdarray in such a way that its memory usage
increases in easy page-sized chunks. The overall algorithm expands the allowed
size in powers of two, in order to amortize the cost of invoking vmalloc() for
larger allocation sizes. Namely, the following sizes for the fdarray are
considered, and the smallest that accommodates the requested fd count is
chosen:

pagesize / 4
pagesize / 2
pagesize open_fds is now used as the anchor for the
fdset memory allocation.

Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vadim Lobanov
2006-12-11 01:57:22 +0800
4fd45812c [PATCH] fdtable: Remove the free_files field ... Browse Code »

An fdtable can either be embedded inside a files_struct or standalone (after
being expanded). When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.

Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded. We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().

Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vadim Lobanov
2006-12-11 01:57:22 +0800
bbea9f696 [PATCH] fdtable: Make fdarray and fdsets equal in size ... Browse Code »

Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets. The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).

In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.

Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal. This
patch removes fdtable->max_fdset. As an added bonus, most of the supporting
code becomes simpler.

Signed-off-by: Vadim Lobanov
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vadim Lobanov
2006-12-11 01:57:22 +0800

08 Dec, 2006

2 commits

8b7d91eb7 [PATCH] Move filep_cachep to include/file.h ... Browse Code »

filp_cachep is only used in fs/file_table.c and in fs/dcache.c where
it is defined.

Move it to related definitions in linux/file.h.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:23 +0800
5d6538fcf [PATCH] Move files_cachep to include/file.h ... Browse Code »

Proper place is in file.h since files_cachep uses are rated to file I/O.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:22 +0800

30 Sep, 2006

1 commit

3b9b8ab65 [PATCH] Fix unserialized task->files changing ... Browse Code »

Fixed race on put_files_struct on exec with proc. Restoring files on
current on error path may lead to proc having a pointer to already kfree-d
files_struct.

->files changing at exit.c and khtread.c are safe as exit_files() makes all
things under lock.

Found during OpenVZ stress testing.

[akpm@osdl.org: add export]
Signed-off-by: Pavel Emelianov
Signed-off-by: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-09-30 00:18:12 +0800

23 Mar, 2006

1 commit

0c9e63fd3 [PATCH] Shrinks sizeof(files_struct) and better layout ... Browse Code »

1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
platforms, lowering kmalloc() allocated space by 50%.

2) Reduce the size of (files_struct), using a special 32 bits (or
64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
close_on_exec_init and open_fds_init fields. This save some ram (248
bytes per task) as most tasks dont open more than 32 files. D-Cache
footprint for such tasks is also reduced to the minimum.

3) Reduce size of allocated fdset. Currently two full pages are
allocated, that is 32768 bits on x86 for example, and way too much. The
minimum is now L1_CACHE_BYTES.

UP and SMP should benefit from this patch, because most tasks will touch
only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
(next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the
same cache line)

Signed-off-by: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2006-03-23 23:38:09 +0800

09 Mar, 2006

1 commit

529bf6be5 [PATCH] fix file counting ... Browse Code »

I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench. Tested on both x86_64 and powerpc.

The way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was
constructor/destructor based. This meant that nr_files was decremented
only when the object was removed from the slab cache. This is susceptible
to slab fragmentation. With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844

At the same time, I see only a 2000+ objects in filp cache. The following
patch I fixes this problem.

This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.

Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma
Cc: "Paul E. McKenney"
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dipankar Sarma
2006-03-09 06:14:01 +0800

14 Nov, 2005

1 commit

95e861db3 [PATCH] reorder struct files_struct ... Browse Code »

The file_lock spinlock sits close to mostly read fields of 'struct
files_struct'

In SMP (and NUMA) environments, each time a thread wants to open or close
a file, it has to acquire the spinlock, thus invalidating the cache line
containing this spinlock on other CPUS. So other threads doing
read()/write()/... calls that use RCU to access the file table are going
to ask further memory (possibly NUMA) transactions to read again this
memory line.

Move the spinlock to another cache line, so that concurrent threads can
share the cache line containing 'count' and 'fdt' fields.

It's worth up to 9% on a microbenchmark using a 4-thread 2-package x86
machine. See
http://marc.theaimsgroup.com/?l=linux-kernel&m=112680448713342&w=2

Signed-off-by: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2005-11-14 10:14:10 +0800

07 Nov, 2005

1 commit

2109a2d1b [PATCH] mm: rename kmem_cache_s to kmem_cache ... Browse Code »

This patch renames struct kmem_cache_s to kmem_cache so we can start using
it instead of kmem_cache_t typedef.

Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka J Enberg
2005-11-07 23:53:24 +0800

10 Sep, 2005

2 commits

ab2af1f50 [PATCH] files: files struct with RCU ... Browse Code »

Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure. The fdtable structure is
protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead. This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dipankar Sarma
2005-09-10 04:57:55 +0800
badf16621 [PATCH] files: break up files struct ... Browse Code »

In order for the RCU to work, the file table array, sets and their sizes must
be updated atomically. Instead of ensuring this through too many memory
barriers, we put the arrays and their sizes in a separate structure. This
patch takes the first step of putting the file table elements in a separate
structure fdtable that is embedded withing files_struct. It also changes all
the users to refer to the file table using files_fdtable() macro. Subsequent
applciation of RCU becomes easier after this.

Signed-off-by: Dipankar Sarma
Signed-Off-By: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dipankar Sarma
2005-09-10 04:57:55 +0800

17 Apr, 2005

1 commit

1da177e4c Linux-2.6.12-rc2 ... Browse Code »

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!

Linus Torvalds
2005-04-17 06:20:36 +0800