Eric Lee / smarc-fsl-linux-kernel

30 Apr, 2008

40 commits

e5d9a0df0 fuse: fix max i/o size calculation ... Browse Code »

Fix a bug that Werner Baumann reported: fuse can send a bigger write request
than the maximum specified. This only affected direct_io operation.

In addition set a sane minimum for the max_read and max_write tunables, so I/O
always makes some progress.

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:51 +0800
5c5c5e51b fuse: update file size on short read ... Browse Code »

If the READ request returned a short count, then either

- cached size is incorrect
- filesystem is buggy, as short reads are only allowed on EOF

So assume that the size is wrong and refresh it, so that cached read() doesn't
zero fill the missing chunk.

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
ea9b9907b fuse: implement perform_write ... Browse Code »

Introduce fuse_perform_write. With fusexmp (a passthrough filesystem), large
(1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times
(256MB/s vs 71MB/s).

[mszeredi@suse.cz]:

- split into smaller functions
- testing
- duplicate generic_file_aio_write(), so that there's no need to add a
new ->perform_write() a_op. Comment from hch.

Signed-off-by: Nick Piggin
Signed-off-by: Miklos Szeredi
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-04-30 23:29:50 +0800
854512ec3 fuse: clean up setting i_size in write ... Browse Code »

Extract common code for setting i_size in write functions into a common
helper.

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
3be5a52b3 fuse: support writable mmap ... Browse Code »

Quoting Linus (3 years ago, FUSE inclusion discussions):

"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."

Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.

Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.

It has been tested extensively with fsx-linux and bash-shared-mapping.

Fuse page writeback design
--------------------------

fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.

The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.

For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.

On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.

This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?

The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.

Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.

Currently there are several cases where the kernel can block on page
writeback:

- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)

Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.

As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.

With appropriate privileges, this limit can be raised through
'/sys/class/bdi//max_ratio'.

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
b88473f73 mm: document missing fields for /proc/meminfo ... Browse Code »

A few fields in /proc/meminfo were not documented. Fix.

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
fc3ba692a mm: Add NR_WRITEBACK_TEMP counter ... Browse Code »

Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously). This is needed, because there cannot
be any guarantee about the time in which a write will complete.

By using temporary buffers, from the MM's point if view the page is written
back immediately. If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.

This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.

[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
Signed-off-by: Miklos Szeredi
Cc: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
dd5656e59 mm: bdi: export bdi_writeout_inc() ... Browse Code »

Fuse needs this for writable mmap support.

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
e4ad08fe6 mm: bdi: add separate writeback accounting capability ... Browse Code »

Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().

Misc cleanups:

- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
76f1418b4 mm: bdi: move statistics to debugfs ... Browse Code »

Move BDI statistics to debugfs:

/sys/kernel/debug/bdi//stats

Use postcore_initcall() to initialize the sysfs class and debugfs,
because debugfs is initialized in core_initcall().

Update descriptions in ABI documentation.

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
a42dde041 mm: bdi: allow setting a maximum for the bdi dirty limit ... Browse Code »

Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
the global dirty threshold allocated to this bdi.

[mszeredi@suse.cz]

- fix parsing in max_ratio_store().
- export bdi_set_max_ratio() to modules
- limit bdi_dirty with bdi->max_ratio
- document new sysfs attribute

Signed-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-04-30 23:29:50 +0800
189d3c4a9 mm: bdi: allow setting a minimum for the bdi dirty limit ... Browse Code »

Under normal circumstances each device is given a part of the total write-back
cache that relates to its current avg writeout speed in relation to the other
devices.

min_ratio - allows one to assign a minimum portion of the write-back cache to
a particular device. This is useful in situations where you might want to
provide a minimum QoS. (One request for this feature came from flash based
storage people who wanted to avoid writing out at all costs - they of course
needed some pdflush hacks as well)

max_ratio - allows one to assign a maximum portion of the dirty limit to a
particular device. This is useful in situations where you want to avoid one
device taking all or most of the write-back cache. Eg. an NFS mount that is
prone to get stuck, or a FUSE mount which you don't trust to play fair.

Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
the global dirty threshold allocated to this bdi.

[mszeredi@suse.cz]

- fix parsing in min_ratio_store()
- document new sysfs attribute

Signed-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-04-30 23:29:50 +0800
b6f2fcbcf mm: bdi: expose the BDI object in sysfs for FUSE ... Browse Code »

Register FUSE's backing_dev_info under sysfs with the name "fuse-MAJOR:MINOR"

Make the fuse control filesystem use s_dev instead of a fuse specific ID.
This makes it easier to match directories under /sys/fs/fuse/connections/ with
directories under /sys/class/bdi, and with actual mounts.

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:49 +0800
fa799759f mm: bdi: expose the BDI object in sysfs for NFS ... Browse Code »

Register NFS' backing_dev_info under sysfs with the name "nfs-MAJOR:MINOR"

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Cc: Trond Myklebust
Cc: "J. Bruce Fields"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:49 +0800
cf0ca9fe5 mm: bdi: export BDI attributes in sysfs ... Browse Code »

Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.

In particular this properly exposes the read-ahead window for all relevant
users and /sys/block//queue/read_ahead_kb should be deprecated.

With patient help from Kay Sievers and Greg KH

[mszeredi@suse.cz]

- split off NFS and FUSE changes into separate patches
- document new sysfs attributes under Documentation/ABI
- do bdi_class_init as a core_initcall, otherwise the "default" BDI
won't be initialized
- remove bdi_init_fmt macro, it's not used very much

[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra
Cc: Kay Sievers
Acked-by: Greg KH
Cc: Trond Myklebust
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-04-30 23:29:49 +0800
caafa4324 pidns: make pid->level and pid_ns->level unsigned ... Browse Code »

These values represent the nesting level of a namespace and pids living in it,
and it's always non-negative.

Turning this from int to unsigned int saves some space in pid.c (11 bytes on
x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
E.g. on ia64 this removes the sign extension calls, which compiler adds to
optimize access to pid->nubers[ns->level].

Signed-off-by: Pavel Emelyanov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-30 23:29:49 +0800
ab883af53 make marker_debug static ... Browse Code »

With the needlessly global marker_debug being static gcc can optimize the
unused code away.

Signed-off-by: Adrian Bunk
Acked-by: Mathieu Desnoyers
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-04-30 23:29:49 +0800
148ff86b1 mxser: convert large macros to functions ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2008-04-30 23:29:49 +0800
12a3de0a9 pids: sys_getpgid: fix unsafe *pid usage, s/tasklist/rcu/ ... Browse Code »

1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setpgid().

2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure
that we don't use the NULL pid if the task exits right after successful
find_task_by_vpid().

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:49 +0800
1dd768c08 pids: sys_getsid: fix unsafe *pid usage, fix possible 0 instead of -ESRCH ... Browse Code »

1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setsid().

2. The task can exit between find_task_by_vpid() and task_session_vnr(),
in that unlikely case sys_getsid() returns 0 instead of -ESRCH.

Signed-off-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:49 +0800
7d8da0962 pids: __set_special_pids: use change_pid() helper ... Browse Code »

Use change_pid() instead of detach_pid() + attach_pid() in
__set_special_pids().

This way task_session() is not NULL in between.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:49 +0800
83beaf3c6 pids: sys_setpgid: use change_pid() helper ... Browse Code »

Use change_pid() instead of detach_pid() + attach_pid() in sys_setpgid().

This way task_pgrp() is not NULL in between.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:49 +0800
24336eaee pids: introduce change_pid() helper ... Browse Code »

Based on Eric W. Biederman's idea.

Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
This can happen even if task == current.

Intoduce the new helper, change_pid(), which should be used instead. This way
the caller always sees the special pid != NULL, either old or new.

Also change the prototype of attach_pid(), it always returns 0 and nobody
check the returned value.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:48 +0800
65450cebc pids: de_thread: don't clear session/pgrp pids for the old leader ... Browse Code »

Based on Eric W. Biederman's idea.

Unless task == current, without tasklist_lock held task_session()/task_pgrp()
can return NULL if the caller races with de_thread() which switches the group
leader.

Change transfer_pid() to not clear old->pids[type].pid for the old leader.
This means that its .pid can point to "nowhere", but this is already true for
sub-threads, and the old leader is not group_leader() any longer. IOW, with
or without this change we can't trust task's special pids unless it is the
group leader.

With this change the following code

rcu_read_lock();
task = find_task_by_xxx();
do_something(task_pgrp(task), task_session(task));
rcu_read_unlock();

can't race with exec and hit the NULL pid.

Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:48 +0800
5cd204550 Deprecate find_task_by_pid() ... Browse Code »

There are some places that are known to operate on tasks'
global pids only:

* the rest_init() call (called on boot)
* the kgdb's getthread
* the create_kthread() (since the kthread is run in init ns)

So use the find_task_by_pid_ns(..., &init_pid_ns) there
and schedule the find_task_by_pid for removal.

[sukadev@us.ibm.com: Fix warning in kernel/pid.c]
Signed-off-by: Pavel Emelyanov
Cc: "Eric W. Biederman"
Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-30 23:29:48 +0800
cb41d6d06 Use find_task_by_vpid in taskstats ... Browse Code »

The pid to lookup a task by is passed inside taskstats code via genetlink
message.

Since netlink packets are now processed in the context of the sending task,
this is correct to lookup the task with find_task_by_vpid() here.

Besides, I fix the call to fill_pid() from taskstats_exit(), since the
tsk->pid is not required in fill_pid() in this case, and the pid field on
task_struct is going to be deprecated as well.

Signed-off-by: Pavel Emelyanov
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: Jay Lan
Cc: Jonathan Lim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-30 23:29:48 +0800
b7127aa45 free_pidmap: turn it into free_pidmap(struct upid *) ... Browse Code »

The callers of free_pidmap() pass 2 members of "struct upid", we can just
pass "struct upid *" instead. Shaves off 10 bytes from pid.o.

Also, simplify the alloc_pid's "out_free:" error path a little bit. This
way it looks more clear which subset of pid->numbers[] we are freeing.

Signed-off-by: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: "Eric W. Biederman"
Cc :Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-04-30 23:29:48 +0800
718a91633 devpts: factor out PTY index allocation ... Browse Code »

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index(). This localizes the external data
structures used in managing the pts indices.

[akpm@linux-foundation.org: undo accidental mutex2sem conversion]
Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Serge Hallyn
Signed-off-by: Matt Helsley
Acked-by: H. Peter Anvin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2008-04-30 23:29:48 +0800
4f8f9d66c devpts: propagate error code from devpts_pty_new ... Browse Code »

Have ptmx_open() propagate any error code returned by devpts_pty_new()
(which returns either 0 or -ENOMEM anyway).

Signed-off-by: Sukadev Bhattiprolu
Acked-by: Serge Hallyn
Acked-by: H. Peter Anvin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukadev Bhattiprolu
2008-04-30 23:29:48 +0800
86a965381 tty: fix routine name in ptmx_open() ... Browse Code »

At ptmx_open(), the 2nd parameter for check_tty_count() should
be "ptmx_open".

Signed-off-by: Hiroshi Shimamoto
Acked-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hiroshi Shimamoto
2008-04-30 23:29:48 +0800
24cb23352 char serial: switch drivers to ioremap_nocache ... Browse Code »

Simple search/replace except for synclink.c where I noticed a real bug and
fixed it too. It was doing NULL + offset, then checking for NULL if the remap
failed.

Signed-off-by: Alan Cox
Cc: Paul Fulghum
Acked-by: Jiri Slaby
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:48 +0800
a6fc819eb ip2: switch remaining direct call of ops->flush_buffer ... Browse Code »

Signed-off-by: Alan Cox

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
39c2e60f8 tty: add throttle/unthrottle helpers ... Browse Code »

Something Arjan suggested which allows us to clean up the code nicely

Signed-off-by: Alan Cox
Cc: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
8cd64518a isicom: fix buffer allocation ... Browse Code »

Fix the rather strange buffer management on open that turned up while auditing
for BKL dependencies.

Signed-off-by: Alan Cox
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
fb100b6ea esp: clean up to modern coding style ... Browse Code »

Signed-off-by: Alan Cox
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
191260a01 epca: coding style ... Browse Code »

Clean up the epca driver

Signed-off-by: Alan Cox
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
9492e1351 riscom8: coding style ... Browse Code »

Signed-off-by: Alan Cox
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
8e8bcf16c strip: Fix up strip for the new order ... Browse Code »

- Use the tty baud functions
- Call driver termios methods directly holding the right locking
- Check for a write method

Signed-off-by: Alan Cox
Cc: David S. Miller
Cc: Jeff Garzik
Cc: "John W. Linville"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
f34d7a5b7 tty: The big operations rework ... Browse Code »

- Operations are now a shared const function block as with most other Linux
objects

- Introduce wrappers for some optional functions to get consistent behaviour

- Wrap put_char which used to be patched by the tty layer

- Document which functions are needed/optional

- Make put_char report success/fail

- Cache the driver->ops pointer in the tty as tty->ops

- Remove various surplus lock calls we no longer need

- Remove proc_write method as noted by Alexey Dobriyan

- Introduce some missing sanity checks where certain driver/ldisc
combinations would oops as they didn't check needed methods were present

[akpm@linux-foundation.org: fix fs/compat_ioctl.c build]
[akpm@linux-foundation.org: fix isicom]
[akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build]
[akpm@linux-foundation.org: fix kgdb]
Signed-off-by: Alan Cox
Acked-by: Greg Kroah-Hartman
Cc: Jason Wessel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800
251b8dd7e isicom: bring into coding style ... Browse Code »

[akpm@linux-foundation.org: fix arm, cleanups]
Signed-off-by: Alan Cox
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-04-30 23:29:47 +0800